CAMERA[1] participants have sequenced random DNA fragments extracted from dozens of ocean water samples, from the Seychelles to New Caledonia and the Sargasso sea. This has produced over 1.045 giga nucleotides of microbial DNA sequence submitted to the public sequence databanks without any annotation other than geophysical properties of the water sample (e.g. GPS position, sample depth etc.). Bioinformatics are the tools of choice to observe biodiversity at this molecular scale!
Your mission, should you accept it, is to attempt to identify the microbial origin of these sequences (archae, protists, algues, viruses?), as well as determine the putative functions of coding sequences contained therein. Some sequences look rather familiar, whilst others are totally novel or very strange indeed!
More background on the "Global Ocean Sampling" expedition can be found in the Open Access PLoS special issue.
[1] CAMERA stands for Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis
Your team will collectively annotate distinct DNA fragments randomly distributed from the available public sequence pool. Each registered annotator will be responsible for annotating a specific set of sequence fragments. For each fragment, annotators will produce a full report specifying if it is likely to be coding, the putative function of the protein product, as well as the most likely taxonomic classification of the host organism.
By practicing the core bioinformatics tools on a number of distinct sequences, each one a live piece of experimental data, you will become familiar with the running and interpretation of fundamental sequence analysis. Experience shows that after two or three distinct analyses, the focus shifts from bioinformatics to biological issues! All tool are available online, so all you need to start is a web browser.User Guide
You can access the Annotathon from any computer connected to the Internet, irrespective of operating system (MAC, Windows or Linux)...
We recommend that you simultaneously open the following pages (in different windows or tabs) in your browser:
If you don't have an Annotathon account (i.e. it is your first session), clic on the "New account" tab in the permanent menu at the top of the Annotathon pages. Follow the instructions to open a new account; make sure you select the appropriate affiliation or you might end up being supervised inside an another team, possibly using a different language... If you don't know your Team code, please ask your instructor for it. You are required to enter at least one firstname/lastname pair, and one email address in order to receive Annotathon specific notifications. Your email address is secure and will under no circumstance be made public or passed on to any third party. Only low traffic messages specific to your course duration will be mailed to this address; no further messages will be sent after the course is completed.
Finally a clic on "Open the account" should be followed by the message "Account 'XYZ' has been created". Use your 'username' and 'password' and clic "Connect" in the form at top of page to open an Annotathon session. You will be reminded that your email address is not validated until you have followed the special link included in an email automatically sent to you at account creation.
The home page (also available by clicking the "Home" tab) gives an overview of the team's annotation progress. Note that once connected you will be able to locate your position in the team at the bottom of the stats page (after your annotations start being evaluated).
You can only add new sequence fragments to your cart when it is empty, or when you have already annotated all available sequences. Add new sequence fragments at your discretion (or until you reach the upper limit set by your supervisor).
Clic the icon opposite the sequence you wish to view in your cart. The initial annotation is minimal: outside the sequence itself and its geographic origin, each sequence fragment has a unique Annotathon accession number. The remaining annotation is your responsibility!
Clic the icon opposite the fragment you wish to edit. After having modified any annotations, remember to save your work on the Annotathon server by clicking the "Save your annotations" button! Should you leave this editing form without submitting, all modifications since last save will be lost... Since you can submit your work as often as you wish, it is recommended you save your work regularly.
When your annotations are completed, clic the icon opposite the fragment. Its status will then shift from 'Annotation 1' to 'Evaluation 1' and it will be closed for editing until your work has been evaluated. After this initial evaluation, the fragment status shifts from 'Evaluation 1' to 'Annotation 2'; you are then invited to update your initial annotations following the evaluator's comments. When your second annotation pass is completed, clic the icon to submit your annotation for the second and final round of evaluations.
The "Forum" tab opens access to the Annotathon internal forum (the signals that a new unread message has been posted to the forum). Clic on a message subject to see its content. If you wish to reply to a message, use the form immediatly under it and clic "Post message". IMPORTANT! only use this method to DIRECTLY REPLY TO THE CONTENT OF A SPECIFIC MESSAGE!
If you wish to open a new discussion thread, you MUST use the special new thread form available at the top of each of your annotation records ( icon in your cart)! You can then select the appropriate forum for your new thread (e.g. Searching fo homologues: BLAST). A link to your specific sequence fragment and associated annotations will automatically be included with your post. Note that the messages you post are also emailed directly to your supervisors and fellow annotators.
If messages are often answered by supervisors, trainees who wish to offer help by answering fellow trainee questions are nonetheless strongly encouraged to do so. Constructive replies will be taken into account in trainee evaluations.
Annoucements from your supervisors will be displayed at the top of Annotathon pages. Once read, tick the 'Read' box to transfer the message to your archive (available at the bottom of the Forum tab).Sequence Annotations
IMPORTANT: for type 1 fields (raw results), boxes are initially filled with a standard template of the form:
PROTOCOL: --------------------------------------------------------------------------------------------------- RESULTS ANALYSIS: --------------------------------------------------------------------------------------------------- RAW RESULTS:
Under the "PROTOCOL" heading, specify the minimum information necessary to reproduce the exact same results. Usually, this would entail giving the name of the tool used, together with its run parameters. For instance, for the ORF finding results field, the protocol line could read:
PROTOCOL: SMS ORFinder / direct strand / frames 1, 2 & 3 / min 60 AA / 'any codon' initiation / 'universal' genetic code
Copy & paste the raw results of the analysis, in-extenso, under the "RAW RESULTS" heading. If you have carried out more than one analysis (for instance two SMS ORFfinder runs, one on forward and one on reverse strand), then reference the two analyses using an index exactly as follows:
PROTOCOL: a) SMS ORFinder / forward strand / frames 1, 2 & 3 / min 60 AA / 'any codon' initiation / 'universal' genetic code b) SMS ORFinder / reverse strand / frames 1, 2 & 3 / min 60 AA / 'any codon' initiation / 'universal' genetic code --------------------------------------------------------------------------------------------------- RESULTS ANALYSIS: [enter your observations here] --------------------------------------------------------------------------------------------------- RAW RESULTS: a) forward strand >ORF number 1 in reading frame 1 on the direct strand extends from base 511 to base 744. CGAGTGATAACTGGTCCAGTAATCGCGATACCGATCATCTTGTTGCGGATTGACGATGTT AAAATCCCGATCAGGGCGGATATCCAGCCCCAGCCTTTCACAACGTTGCTGAATCACTTC GGGGCGGCCTATGACGATGGGAACTTCGCTGGTTTCTTCCAAAACGGCCTGAGCGGCGCG CAGCACCCGCTCGTCTTCGCCCTCGGCAAACACAATCCGTCGAGCGCTGCTTGA >Translation of ORF number 1 in reading frame 1 on the direct strand. RVITGPVIAIPIILLRIDDVKIPIRADIQPQPFTTLLNHFGAAYDDGNFAGFFQNGLSGA QHPLVFALGKHNPSSAA* --------------------------------------------------------------------------------------------------- b) reverse strand >ORF number 1 in reading frame 1 on the reverse strand extends from base 517 to base 855. CCTGATCTGTGGCGCTGTGGGCGAATTCAGATGGCATCTGAATTATATCGAGCAAATTTT AGGCAGCAAAACCTTATCGCCAAGCGGCGCGCTGTCTTTGATGATTTTAGAAGACGGGCC TCTGTTCATCGCAGACACCCACGTCTGGGCGGATCCCACCCCCATGCAAATTGCCCAAAC CGCCAAAGGGGCCGCGCGCCATGTGCGCCGTTTTGGCATAGAGCCACAAGTCGCGCTGTG CTCGCAATCACAATTTGGAAATCTGAACAGCGAGACTGGCAAGAAAATGCGCCAAGCATT GGATATTCTCGATACCGAAAAGGTGACGTTTACCTATGA >Translation of ORF number 1 in reading frame 1 on the reverse strand. PDLWRCGRIQMASELYRANFRQQNLIAKRRAVFDDFRRRASVHRRHPRLGGSHPHANCPN RQRGRAPCAPFWHRATSRAVLAITIWKSEQRDWQENAPSIGYSRYRKGDVYL*
Finally, use the "RESULTS ANALYSIS" section of the type 1 fields to expose your observations of the raw results. Results analysis, a pivotal part of scientific discourse, answers the question "what did we see that is notable when we carried out the experiment described in the protocol". These rigorous factual observations, usually accompanied by precise numerical values (percentages, E-values, number of hits, number of amino acids etc.) are offered without far reaching discussions. Focus the main discussion and interpretations in the "Conclusion" field.
Note: the last "Notepad" field at the bottom of the sequence editing form is available to store any data that isn't accommodated by other specific annotation fields. Use the Notepad to store data that can be useful for subsequent re-analyses (e.g. store your set of FASTA formatted homolog sequences here). The Notepad is your private space and is not consulted during evaluation.
Brief contextual help is available for each annotation field of the editing form by clicking the icons. The information expected for each annotation field is described below.
Remember that a Frequently Asked Questions is available for in depth explanations, tutorials and screen shots of each of the bioinformatics analyses needed to perform the sequence annotations.
Always keep in mind during your analyses the two main focal points of your annotation which consists in proposing:
The basic rule set below can be over ridden by more specific or alternative rules given to you by your instructors. If in doubt, always consult your instructors. ORF finding The first investigation for each DNA fragment will involve the identification of putative Open Reading Frames (ORFs). There are many tools to tackle this issue, including the following:
Copy & paste the raw UNCENSORED ORF finding output in the'ORF finding' field of the Annotathon editing form. Remember to conduct the analysis in all SIX frames, and do include the PROTOCOL line above each raw result.
If your sequence contains several ORFs, arbitrarily select the longest one for all subsequent analyses.
You must decide if the best (longest) ORF is likely to be a true or a false positive. The key elements in support of a true positive ORF are:
-If the DNA fragment doesn't appear to contain any ORFs, tick the 'non-coding' box of the 'Status' field. The annotation of this fragment will be limited to populating the ORF finding and BLAST fields, as well as the conclusion of course!
-If the sequence appears to carry a true coding ORF, tick the 'coding' box of the 'Status' field. Indicate in the appropriate fields the start and end positions of the ORF, as well as the strand. Note that if the ORF is complete at the 3' end (i.e. finishes with a STOP), you need to substract the 3 STOP codon nucleotides from the end position. Validate this ORF by clicking "Save annotations".
If the ORF verifies the rules above, the translation will automatically by displayed; otherwise an error message will help you pin point the problem. The ORF can be incomplete, in which case simple informational messages to this effect are displayed.
[1]indeed the absence of homologs in public protein databases does not suggest that a sequence is non-coding; it merely means that there is currently no known homolog. There exists other so called ab initio approaches to identify true positive coding ORFs (for instance based on statistical codon usage biases) but these methods usually require organism specific known gene training sets or large chunks of genome sequence, which are hence not well suited to metagenome exploration.
Please refer to the Frequently Asked Questions for further details on ORF finding, in particular on the subtil issue of exact determination of ORF start position... Molecular weight If the ORF is complete at both ends, compute its theoretical polypeptide molecular weight using for instance:
Only submit to the Annotathon domains that you have good reasons to believe are significant, that is to say:
Please refer to the Frequently Asked Questions for further details on running BLAST and most importantly on identifying conserved domains. BLAST homolog search Use BLAST to identify putative sequence homologs of your ORF in public sequence databases. You can find online BLAST services at:
Two approaches can be used to identify homologs of your sequence:
You should query the two following protein databases:
If homologs of your ORF exist, indicate what you consider the E-value threshold that separates true positive homologs from false positive non-homologs.
Use the BLAST results (the lineage report is your friend here) to build two groups of homolog sequences which will serve, after multiple alignement, as a basis for phylogenetic tree reconstruction:
IMPORTANT: Remember that ALL sequences selected for inclusion in the study and external groups must be homologs of your ORF, i.e. their BLAST score must be above the E-value threshold determinded above.
IMPORTANT: Include under the RESULTS ANALYSIS heading of the Taxonomy report the COMPREHENSIVE list of all the sequences you have selected in the study and external groups: for each sequence, provide its accession number, the short name you have chosen for it (see below Multiple alignment of protein sequences), its BLAST E-value and score and its taxonomic group. For instance:
PROTOCOL: BLASTp versus NR, NCBI default parameters apart from "Number of descriptions=500" --------------------------------------------------------------------------------------------------- RESULTS ANALYSIS: [describe here your analysis of the taxonomy report, followed by the selected list of homologs carried over to multiple sequence alignment:] In group: proteobacteria ref|ZP_01264926.1| Cpelagibacter 5e-89 Candidatus Pelagibacter ubique HTCC1002 (a-proteobacteria) gb|AAI55631.1| Bcaryophylli 3e-79 Burkholderia caryophylli (b-proteobacteria) gb|AAI55631.1| Gsulfurreducens 7e-59 Geobacter sulfurreducens (b-proteobacteria) sp|Q8CXD9| Aaquariorum 2e-41 Aeromonas aquariorum (g-proteobacteria) [...] Out group: other bacteria (=non proteobacteria: Firmicutes, Cyanobacteria, Thermotogales) ref|YP_249980.1| Linnocua 8e-38 Listeria innocua (firmicutes) ref|ZP_01002095.1| Tmaritima 5e-22 Thermotoga maritima (thermotogales) emb|CAD31286.1| Synechocystis 9e-21 Synechocystis sp. PCC 6803 (cyanobacteria) [...] --------------------------------------------------------------------------------------------------- RAW RESULTS: cellular organisms . Bacteria [bacteria] . . Proteobacteria [proteobacteria] . . . Sinorhizobium meliloti -------------------------------------------- 348 2 hits [a-proteobacteria] Sarcosine oxidase subunit alpha (Sarcosine oxidase subunit) . . . Francisella novicida U112 ......................................... 80 1 hit [g-proteobacteria] Aminomethyltransferase (Glycine cleavage system T protein) . . . Bdellovibrio bacteriovorus ........................................ 80 1 hit [d-proteobacteria] Aminomethyltransferase (Glycine cleavage system T protein) [...]
Please refer to the Frequently Asked Questions for further details on running BLAST and most importantly on the sensitive issue of sequence selection for study and external groups. Multiple sequence alignement The aim of the multiple alignment is first to verify that the ORF integrates convincingly in its presumed homolog family: the alignment must hence present clear well conserved regions. Secondly, the multiple alignment will serve as the basis for the phylogenetic tree inference: the alignment must therefore suggest a sufficient number of mutations (informative positions) to allow the reconstruction of the evolution history! Beware not not include sequences that are too partial as these can dramatically reduce the number of informative positions in the alignment.
It is common to have to reiterate the building of the multiple alignment many times, adding or taking away more or less divergent sequences, in order to finally obtain a satisfactory result.
IMPORTANT: before proceeding to the multiple alignment, insert a legible label directly in the sequence FASTA format in order to create useful labels both for alignment and phylogenetic tree. Collect FASTA formated homolog sequences in the Notepad and insert sequence labels as follows:
FASTA sequence as produced by the NCBI (if left untouched, the sequence label will be a cryptic "gi|5581978"):
>gi|55819784|ref|YP_143054.1| serine protease inhibitor [Acanthamoeba mimivirus] MDYSHKYIKYKKKYLSLRNKLDRENTPVIISRIEDNFSIDDKITQSNNNFTNNVFYNFDTSANIFSPMSL TFSLALLQLAAGSETDKSLTKFLGYKYSLDDINYLFNIMNSSIMKLSNLLVVNNKYSINQEYRSMLNGIA VIVQDDFITNKKLISQKVNEFVESETNAMIKNVINDSDIDNKSVFIMVNTIYFKANWKHKFPVDNTTKMR FHRTQEDVVDMMYQVNSFNYYENKALQLIELPYNDEDYVMGIILPKVYNTDNVDYTINNV
FASTA sequence after insertion of a legible sequence label (the label is formed by the letters directly following the ">" sign up to the first space or up to 10 characters, which ever comes first):
>Amimivirus gi|55819784|ref|YP_143054.1| serine protease inhibitor [Acanthamoeba mimivirus] MDYSHKYIKYKKKYLSLRNKLDRENTPVIISRIEDNFSIDDKITQSNNNFTNNVFYNFDTSANIFSPMSL TFSLALLQLAAGSETDKSLTKFLGYKYSLDDINYLFNIMNSSIMKLSNLLVVNNKYSINQEYRSMLNGIA VIVQDDFITNKKLISQKVNEFVESETNAMIKNVINDSDIDNKSVFIMVNTIYFKANWKHKFPVDNTTKMR FHRTQEDVVDMMYQVNSFNYYENKALQLIELPYNDEDYVMGIILPKVYNTDNVDYTINNV
Choose an easily recognisable label, such as "Ecoli" for "Escherichia coli". It is crucial that your sequence labels are unique, or the following steps (multiple alignments and tree) will likely fail! If you have two "Ecoli" sequences, use for instance "Ecoli1" and "Ecoli2".
Build a multiple alignment (including all the in and out group sequences, as well as your ORF, naturally) using an online version of one the following software: ClustalW (widely used), MUSCLE (fast and a little more efficient) or T-COFFEE (slower but highly robust method with very useful colored conserved alignment blocks). These methods are available on the web site of:
The limitation in the number of sequences to align is simply due to computation time of multiple alignment programs, as well as subsequent phylogenetic tree reconstruction. Computation time is reasonable up to around thirty our fifty sequences of a few hundred residues.
Copy & paste the "ClustalW" formated multiple alignment in the 'Multiple Alignement' Annotathon field. Phylogenetic tree Use the above multiple alignment to infer a phylogenetic tree using two distinct tree reconstruction approaches:
Please refer to the Frequently Asked Questions for further details and screen shots on running phylogenetic analyses.
Copy & paste the textual tree representation in the 'Tree' Annotathon field. Remember to include a protocol line in the 'Tree' field that includes the program name and run parameters (ex 'Phylip / Protdist+neighbor / Randomized input - Random number seed = 11 / rooted on: Coccidioides immitis (ascomycetes)').
Add after each tree leaf label the taxonomic group in brackets, e.g. (alpha-proteobacteria). Your textual tree representation should look like this - notice the (taxonomic group) labels added:
PROTOCOL: a) Phylogeny.fr / BioNJ method / out group: Coccidioides immitis (ascomycetes) b) Phylogeny.fr / PhyML method / no bootstrap / default substitution model / out group: Coccidioides immitis (ascomycetes) --------------------------------------------------------------------------------------------------- RESULTS ANALYSIS: [for each tree produced, explain: -is the tree coherent with the reference phylogeny of species? -is the tree coherent with the tree produced by the alternate method? -to which taxonomic group does the metagenomic sequence appear to belong?] --------------------------------------------------------------------------------------------------- RAW RESULTS: a) BioNJ +---------------------Roseovarius (alpha-proteobacterie) ! ! +----------------------------------aproteobac (alpha-proteobacterie) +------10 ! ! ! ! +-------------------------------Bparapertu (beta-proteobactérie) ! ! ! ! ! +----13 ! +----------Jannaschia (alpha-proteobacterie) ! ! ! +-----5 ! ! ! +--------8 +-----------Ogranulos (alpha-proteobacterie) ! ! ! ! ! ! ! ! ! +---------------------GOS_OT2311 ! +-15 ! ! ! ! +-------Rhodobact (alpha-proteobacterie) ! ! +----12 +-------6 ! ! ! ! +-----9 +----Roseobacter (alpha-proteobacterie) ! ! ! ! ! ! ! ! ! ! ! +--------------Obatsensis (alpha-proteobacterie) ! ! ! +-11 ! +-14 ! +Paerugin1 (gamma-proteobactérie) ! ! ! +-------------1 ! ! +--------7 +Paerugin2 (gamma-proteobactérie) ! ! ! ! ! +------------Bcenocepacia (beta-proteobactérie) ! ! ! +----------------------------------------Oceanobacter (gamma-proteobactérie) ! ! +--Aspergillus terreus (ascomycetes) ! +-----2 4------3 +---Aspergillus niger (ascomycetes) ! ! ! +-----------Aspergillus oryzae (ascomycetes) ! +-------------Coccidioides immitis (ascomycetes) b) PhyML [...]
Note that the "NCBI numerical identifier" box has precedence over the "Scientific Name" box, so if you wish to change the taxonomic classification of your sequence you must delete the numeric code in order to enter a new value in the "Scientific Name" box.
Once the taxonomic group is correctly specified, the full lineage should appear:
Rhodobacterales Rank: order - Genetic Code: Bacterial and Plant Plastid - NCBI Identifier: 204455 Kingdom: Bacteria - Phylum: Proteobacteria - Class: Alphaproteobacteria - Order: Rhodobacterales Bacteria; Proteobacteria; Alphaproteobacteria; Rhodobacterales;
IMPORTANT: unless your DNA sequence is 100% identical to an existing GENBANK entry, you should probably not specify a precise species! Since without further evidence the precise taxonomic definition of the organism carrying the metagenomic DNA fragment is impossible, specify as likely taxonomic group the node immediatly above your sequence in the phylogentic tree. Biological process & molecular function When your ORF's homologs have known functions, or if the ORF presents known conserved domains, select in the available "Biological Process" & "Molecular Function" lists the most appropriate terms that most specifically describe your proposed ORF functional hypotheses. These terms are a subset of the comprehensive and hierachical "Gene Ontology", most often refered to as GO annotations:
For gene symbol examples, check out those already attributed to metagenome fragments during the Annotathon on Metagenes. Conclusion This field is central to your evaluation: write up your interpretations and hypotheses based on the observations you have made in the preceeding "RESULTS ANALYSES". Imagine you are trying to convince a very sceptical colleague: use rigorous argumentation, cite precise evidence and numerical values when ever possible, highlight important findings, cross information from independent sources. Remember that in silico analyses generally do not constitute final proof, only suggestions. Terms such as "putative", "suggests" or "probably" can show understanding of the limitations of computational biology results.
Make sure you have at least covered: