Introduction

LAGO finds significant GO terms among a list of identifiers (eg. gene names), helping you discover what they have in common in the context of a given annotation, a background population, and the GO ontologies.

In broad terms, it does this by mapping the provided identifier list, via the annotation, to terms in the ontology. It also maps likewise a background population. It then calculates the significance (p-value) via the hypergeometric distribution, and applies (by default) Bonferroni correction. It then provides tables and graphs of the significant terms, omitting terms with a higher p-value then a cutoff, if provided. P-value computation and correction may be turned off if desired. Turning off p-value computation is equivalent to setting a cutoff of 1.0.

This tool is based largely on GOTermFinder, developed here at the Lewis-Sigler Institute at Princeton University and made available as part of the GMOD project, and the GO-TermFinder perl module by Gavin Sherlock. However, this is a complete rework of the above, in the C programming language, with many bug fixes, more efficient data structures, and optimizations. The result is a tool that is similar in functionality (with some improvements), that is 50 times faster (or more) for common queries, and that uses down to 1/20 (or less) of system memory. This allows more interactive analysis and much greater scalability.

For more information about GO::TermFinder and the method for calculating statistical significance, please see Boyle et al, Bioinformatics (2004) . For publication, please cite that document and this tool (including the URL).

Form Options
1. Identifier list

The identifier list should contain one identifier per line, with no spaces. For example:

SPBC16C6.08c
SPBC16H5.06
SPBC29A3.18
SPCC1682.01
SPCC191.07
SPCC613.10
SPMIT.05
qcr10
qcr7
qcr8

For annotation-specific examples, please refer to the annotation files table.

Batch processing

If the uploaded file name ends in ".tar", then it is assumed to be a tar archive containing individual files that each contain a separate list of identifiers. Each file is then analyzed independently.

You can create a tar file on unix (Linux, MacOS X, etc.) using a command like one of the following:

tar -cf batchList.tar orfs1.txt orfs2.txt ...
tar -cf batchList.tar *.txt
tar -cf batchList.tar clusterORFs/

Please note that all files within the tar archive will be processed. If it contains other types of files, this can significantly impact the performance of this tool, and the behaviour regarding those files is undefined.

2. Select annotation(s)

Select the GO annotation(s) you wish to use. You may also upload an annotation file in GAF 1 or GAF 2 format.

Performance of this tool is greatly affected by optimizations based on the naming schemes within the annotations. Some annotations may not currently be optimized. There is no optimization for uploaded annotations.

For information about the annotations themselves, please refer to the annotation files table. The table contains information about the annotations used by our various GO tools, and not all are yet available to this tool.

A note regarding GOA annotations: The annotations provided by the GOA tend to have a lot of ambiguity in the secondary IDs (synonyms). For example, CAT is a synonym for both P04040 and Q6IB77. The best way around this is to use only UniProt IDs (P04040 and Q6IB77 in this example), since they are primary identifiers in the annotation and are therefore always unique. Other ways to reduce ambiguity are to filter out evidence codes (especially IEA) or provide your own annotation perhaps reduced to only your aspect of interest or with redundant identifiers removed. If we can come up with an option for dealing with this within LAGO, we will make it available.

4. Aspect

Select the aspect you wish to query. You may also query all aspects at once. Note that using multiple aspects will not affect the uncorrected p-values, however it will affect the correction and therefore the corrected p-values.

5. P-value options

This tool can compute p-values based on the hypergeometric distribution. The Bonferroni correction method is then applied, unless it is deselected. Both the corrected and uncorrected p-values are reported if correction is applied.

A cutoff can be specified for p-values. Terms with higher p-values than the cutoff are not reported, however they are still counted towards the Bonferroni correction.

8. Optional background settings

Calculation of p-values based on the hypergeometric distribution requires a background. By default, the background is derived from the annotation itself. All unique identifiers are mapped to all their terms, from the lowest-level term (the direct annotation) up to the root (all indirect annotations). The number of identifiers thusly mapped to each term serves as the background for the term (corresponding to m in the hypergeometric distribution description).

The background size (N in the hypergeometric distribution description) is usually interpreted as the size of the genome. For some known organisms, therefore, the size of the background is automatically adjusted. To see which organisms are affected and the background size selected, please refer to the annotation files table, comparing the Total Annotated Gene Products column and the Total Estimated Gene Products column.

Because this method of background selection isn't always ideal, the tool provides two options:

  • The overall size of the background (N) can be adusted. Although this doesn't affect the number of background identifiers annotated to each term, this can be used to adjust significance based upon the size of the entire genome, for example.
  • The background list of identifiers (defining the background population) can be provided explicitly by uploading the identifier list. As with the query list, this should be a file consisting of identifiers, one per line, without any spaces. It is acceptable for this list to contain identifiers that do not appear in the annotation. In this case, those identifiers will be added to the background unnannotated node but still be counted towards the size of the background population (N).

Note that all identifiers within the query list should generally also be present within the background list (or be synonyms). That is, objects represented by the query list should also be present in the background. Identifiers that occur within the query list but not within the background are termed "discarded". See the section on Identifier classifications below.

Please refer to the GO ID classification flow diagram (follow the green lines) for more detail on how background objects are counted.

Output

An example of the output for one set of results is available here.

Each query identifier list and annotation produces an independent set of results and each set of results has its own row in the summary table.

Number of terms

The number of terms that are reported. This number may include the unannotated node (see Sundry Details below) if that is significant.

Minimum p-value

The minimum p-value computed. This is especially useful to indicate whether the cutoff is too low.

Analysis results

HTML - This is an HTML table showing the GO ID, term, corrected (if applicable) and uncorrected p-values, the number of annotated objects in the query, the number of annotated objects in the background, and the annotated genes.

DAG - A directed acyclic graph of the ontology terms and, if selected, the query identifiers annotated. If selected, the term nodes are colored according to the magnitude of their p-value.

Tab-delimited text - A tab-delimited text table containing similar information to the HTML table, and perhaps more. Since this file is intended to be machine-readable for use with other applications, it is likely to contain other miscellaneous information and not be quite as human-friendly as the HTML table.

Sundry Details

Uannotated node

When an identifier in the query list cannot be found in the annotation, it is added to the unannotated node. Likewise, when a background list is provided and includes an identifier that is not found in the annotation, the background count for the unannotated node is incremented. This node is included in p-value calculations, which helps indicate whether unknown identifiers are potentially significant within your query list.

When p-value correction is applied, the unannotated node is counted if the number of unannotated identifiers is at least 2 more than the number of unknown identifiers in the query. This is for compatibility with GOTermFinder and in the future there may be an option to change this behaviour.

Links
genomics.princeton.edu Lewis-Sigler Institute for Integrative Genomics
go.princeton.edu
annotation files table
GO tools at Princeton LSI
www.geneontology.org
GAF 1 or GAF 2 format
OBO format
GO relations
GO evidence codes
Gene Ontology (GO) Project
GO-TermFinderGO perl module from Gavin Sherlock
hypergeometric distribution
Bonferroni correction
relevant pages at Wolfram