LAGO Help

Introduction

LAGO finds significant GO terms among a list of identifiers (eg. gene names), helping you discover what they have in common in the context of a given annotation, a background population, and the GO ontologies.

In broad terms, it does this by mapping the provided identifier list, via the annotation, to terms in the ontology. It also maps likewise a background population. It then calculates the significance (p-value) via the hypergeometric distribution, and applies (by default) Bonferroni correction. It then provides tables and graphs of the significant terms, omitting terms with a higher p-value then a cutoff, if provided. P-value computation and correction may be turned off if desired. Turning off p-value computation is equivalent to setting a cutoff of 1.0.

This tool is based largely on GOTermFinder, developed here at the Lewis-Sigler Institute at Princeton University and made available as part of the GMOD project, and the GO-TermFinder perl module by Gavin Sherlock. However, this is a complete rework of the above, in the C programming language, with many bug fixes, more efficient data structures, and optimizations. The result is a tool that is similar in functionality (with some improvements), that is 50 times faster (or more) for common queries, and that uses down to 1/20 (or less) of system memory. This allows more interactive analysis and much greater scalability.

For more information about GO::TermFinder and the method for calculating statistical significance, please see Boyle et al, Bioinformatics (2004) . For publication, please cite that document and this tool (including the URL).

Form Options

1. Identifier list

The identifier list should contain one identifier per line, with no spaces. For example:

SPBC16C6.08c
SPBC16H5.06
SPBC29A3.18
SPCC1682.01
SPCC191.07
SPCC613.10
SPMIT.05
qcr10
qcr7
qcr8

For annotation-specific examples, please refer to the annotation files table.

Batch processing

If the uploaded file name ends in ".tar", then it is assumed to be a tar archive containing individual files that each contain a separate list of identifiers. Each file is then analyzed independently.

You can create a tar file on unix (Linux, MacOS X, etc.) using a command like one of the following:

tar -cf batchList.tar orfs1.txt orfs2.txt ...
tar -cf batchList.tar *.txt
tar -cf batchList.tar clusterORFs/

Please note that all files within the tar archive will be processed. If it contains other types of files, this can significantly impact the performance of this tool, and the behaviour regarding those files is undefined.

2. Select annotation(s)

Select the GO annotation(s) you wish to use. You may also upload an annotation file in GAF 1 or GAF 2 format.

Performance of this tool is greatly affected by optimizations based on the naming schemes within the annotations. Some annotations may not currently be optimized. There is no optimization for uploaded annotations.

For information about the annotations themselves, please refer to the annotation files table. The table contains information about the annotations used by our various GO tools, and not all are yet available to this tool.

A note regarding GOA annotations: The annotations provided by the GOA tend to have a lot of ambiguity in the secondary IDs (synonyms). For example, CAT is a synonym for both P04040 and Q6IB77. The best way around this is to use only UniProt IDs (P04040 and Q6IB77 in this example), since they are primary identifiers in the annotation and are therefore always unique. Other ways to reduce ambiguity are to filter out evidence codes (especially IEA) or provide your own annotation perhaps reduced to only your aspect of interest or with redundant identifiers removed. If we can come up with an option for dealing with this within LAGO, we will make it available.

3. Select ontology

Choose either the complete GO ontology or the generic slim GO ontology. You may also upload your own ontology in OBO format.

4. Aspect

Select the aspect you wish to query. You may also query all aspects at once. Note that using multiple aspects will not affect the uncorrected p-values, however it will affect the correction and therefore the corrected p-values.

5. P-value options

This tool can compute p-values based on the hypergeometric distribution. The Bonferroni correction method is then applied, unless it is deselected. Both the corrected and uncorrected p-values are reported if correction is applied.

A cutoff can be specified for p-values. Terms with higher p-values than the cutoff are not reported, however they are still counted towards the Bonferroni correction.

6. Graphing options

The tool can produce a directed acyclic graph (DAG) showing the tree of ontology terms, from the most significant terms (determined by the p-value and the p-value cutoff, if any) upwards to the root(s) of the aspect(s). If the cutoff is 1.0 or is not applied (which are internally equivalent), the tree begins at the lowest level terms to which identifiers are annotated within the whole aspect(s).

Nodes within the graph may be colored according to the (corrected) p-value computed for the term. Also, boxes containing query identifiers may be attached to their lowest-level terms. Either of these options may be selected as desired. Identifier boxes can become a problem when there are a large number of them, so this is deselected by default.

Edges in the graph are colored according to the type of link within the ontology. The coloring is as follows:

black	is_a
blue	part_of
orange	regulates
green	positively_regulates
red	negatively_regulates

For an explanation of these terms, see here. For examples of graphs, see the Output section below.

7. Evidence codes

Annotations can be filtered out based on evidence code. That is, annotations with selected codes will be ignored. This happens when the annotation is loaded, so it can indirectly affect the statistical background (see Optional background settings below).

You may wish to view the Guide to GO Evidence Codes for information about standard usage of evidence codes.

8. Optional background settings

Calculation of p-values based on the hypergeometric distribution requires a background. By default, the background is derived from the annotation itself. All unique identifiers are mapped to all their terms, from the lowest-level term (the direct annotation) up to the root (all indirect annotations). The number of identifiers thusly mapped to each term serves as the background for the term (corresponding to m in the hypergeometric distribution description).

The background size (N in the hypergeometric distribution description) is usually interpreted as the size of the genome. For some known organisms, therefore, the size of the background is automatically adjusted. To see which organisms are affected and the background size selected, please refer to the annotation files table, comparing the Total Annotated Gene Products column and the Total Estimated Gene Products column.

Because this method of background selection isn't always ideal, the tool provides two options:

The overall size of the background (N) can be adusted. Although this doesn't affect the number of background identifiers annotated to each term, this can be used to adjust significance based upon the size of the entire genome, for example.
The background list of identifiers (defining the background population) can be provided explicitly by uploading the identifier list. As with the query list, this should be a file consisting of identifiers, one per line, without any spaces. It is acceptable for this list to contain identifiers that do not appear in the annotation. In this case, those identifiers will be added to the background unnannotated node but still be counted towards the size of the background population (N).

Note that all identifiers within the query list should generally also be present within the background list (or be synonyms). That is, objects represented by the query list should also be present in the background. Identifiers that occur within the query list but not within the background are termed "discarded". See the section on Identifier classifications below.

Please refer to the GO ID classification flow diagram (follow the green lines) for more detail on how background objects are counted.

Output

An example of the output for one set of results is available here.

Each query identifier list and annotation produces an independent set of results and each set of results has its own row in the summary table.

Identifier classifications

Identifiers in the query list are separated into different categories, and a link to each list of identifiers is presented.

For those not within one of the above lists:
duplicated	Exactly duplicated within the input list
synonym	Not exactly duplicated but refers to the same unique object
ambiguous	Found within the annotation but ambiguous
discarded	When custom background list is supplied, not found within that list
annotated	Found within the annotation and within the aspect
unannotated	Annotated but not within the aspect Or, for custom background, unannotated but included in the background (Does not count those that are unannotated and are not in the custom background)
unknown	Not found within the annotation in any aspect

See the GO ID classification flow diagram for more detail.

Number of terms

The number of terms that are reported. This number may include the unannotated node (see Sundry Details below) if that is significant.

Minimum p-value

The minimum p-value computed. This is especially useful to indicate whether the cutoff is too low.

Analysis results

HTML - This is an HTML table showing the GO ID, term, corrected (if applicable) and uncorrected p-values, the number of annotated objects in the query, the number of annotated objects in the background, and the annotated genes.

DAG - A directed acyclic graph of the ontology terms and, if selected, the query identifiers annotated. If selected, the term nodes are colored according to the magnitude of their p-value.

Tab-delimited text - A tab-delimited text table containing similar information to the HTML table, and perhaps more. Since this file is intended to be machine-readable for use with other applications, it is likely to contain other miscellaneous information and not be quite as human-friendly as the HTML table.

Sundry Details

Uannotated node

When an identifier in the query list cannot be found in the annotation, it is added to the unannotated node. Likewise, when a background list is provided and includes an identifier that is not found in the annotation, the background count for the unannotated node is incremented. This node is included in p-value calculations, which helps indicate whether unknown identifiers are potentially significant within your query list.

When p-value correction is applied, the unannotated node is counted if the number of unannotated identifiers is at least 2 more than the number of unknown identifiers in the query. This is for compatibility with GOTermFinder and in the future there may be an option to change this behaviour.

Links

genomics.princeton.edu	Lewis-Sigler Institute for Integrative Genomics
go.princeton.edu annotation files table	GO tools at Princeton LSI
www.geneontology.org GAF 1 or GAF 2 format OBO format GO relations GO evidence codes	Gene Ontology (GO) Project
GO-TermFinder	GO perl module from Gavin Sherlock
hypergeometric distribution Bonferroni correction	relevant pages at Wolfram