SEED, Annotations, and Subsystems
Identifiers
What does FIG stand for? As in "fig|12021.1.peg.1"
Fellowship for interpretation of genomes
All the peg ids have "fig|" prepended, in fact anything with a fig ID has fig|.
Other identifiers have other ids, for example, gi| for gi numbers sp| for swiss prot, etc
A feature (an RNA, protein, prophage etc) is considered a region on a contig in a genome. There is a feature_location method that returns a list of locations. (A feature is usually at a single location, but very occasionally can be split across two different locations.)
The feature location is in the format contig_start_stop
If start > stop the feature is on the opposite strand. A gotcha here is that contig can also (and does often) contain an underscore, so the regexp that matches these is
m/^(.*)\_(\d+)\_(\d+)/
you can't just split on '_' and assume you'll get an array of (contig, start, stop).
Also, note, that contig ids are only considered unique within a genome and not between them. We have lots of genomes that have a contig whose ID is Contig1
What is in a subsystem cell?
Can a subsystem specify more than one gene in a single orgamism? Can a subsystem specify more than one gene is single 'cell' of the table that represents the subsystem?
Yes, and yes, although it shouldn't. A basic SEED concept is that each functional role is unique in a cell. There is only one enzyme that performs any function. (There is a slight biological argument here about gene duplication, but that is usually involved in divergent evolution and on further study the two copies are doing different things.)
However, a cell (the intersection between functional role and genome) can have more than one protein in it.
For annotators sake, we use the evidence codes ISU to indicate "in subsystem unique" - i.e. we are more confident that this is the only role in the organism versus IDU "in subsystem duplicate" i.e. there is more than one of these roles in that organism, and most likely they need deconvoluting. In short hand it tells me that I, as an annotator, should be more cautious changing the former than the latter.
Definitions
Subsystem
Currently used to describe a group of genes that are somehow connected in the mind of an annotator, for the annotator's convenience. There may be (and usually are) many genes from the same organism in a single subsystem.
Column
Columns are ''functional roles''.
(Actually they're rows in the subsystem top spreadsheets. They are really columns in a later spreadsheet. Very confusing). Each column is intended to hold all genes with a given gene function that are to be annotated together. A gene may appear in multiple columns if it has multiple functions. It should be very rare for an organism to have multiple genes in the same column, as it implies duplicated function, which in evolutionary terms is not generally stable (so we don't often see it, at least not in bacteria).
It is important that columns consist of orthologs, not merely proteins with the same biosynthetic function. Accordingly, the subsystem "glutamine synthetases" is divided into different types of enzymes whose main biosynthetic functions are the same (type I, type II, etc) but are quite different structurally.
There is also something else called "subset of roles", which combines columns. An example is in the subsytem "Calvin-Benson_cycle", where the following columns are combined into a supercolumn *RuBisCo:
3: RuBisCo large subunit [Type I?]
4: RuBisCo small subunit [Type I?]
5: RuBisCo [Type II?]
6: RuBisCo Type III
If you choose the tab labeled limit display you can show or hide subsets, and collapse or uncollapse all of them.
Where do SEED genomes come from
We started off by collecting organisms that were all over the place. Obviously GenBank was a good place to start, but back in the old days GenBank was very slow at releasing new genomes and we wanted as many as we could get for comparative purposes, so we went to the sources. JGI, KEGG,Sanger, WUSTL, Baylor, along with other things we found online or people asked us to add. For example, I have several ''Salmonella'' genomes in the SEED that are not in Genbank because we never finished them.
Over about the last few years, we have really transitioned away from that model. We exceeded 10,000 submissions to our annotation pipeline, RAST. For example, a well known Bioinformatics Institute has resubmittted every genome in GenBank through RAST for reannotation. The question now is which ones go from RAST to SEED (and how they do that) - we've stopped going out looking for genomes, they come to us!
Initially we were inclined to include everything, but in reality it appears to be just cluttering up the genome space. Sure, having 100 group A Strep genomes is important, and there are some interesting questions that you can learn from that, its just not clear that we need to have 100 group A streps in the SEED. We certainly could, but it would make things like chromosomal clustering very tricky (because you need to ignore all those close neighbors). (I chose that example, because they have just been published in PNAS by Musser). I am sure that if someone pressed us (paid us?) to build a solution we could, but its not the top of the list.
More recently we have been selective about what we add. If we have 10 ''Burkholderia'' do we need 20? For sure the preference on adding genomes follows the money. If you compare the genomes in SEED, KEGG, NCBI, etc the results would be different if you looked at ''Vibrio'', ''Campylobacter'', ''Listeria'', etc that we were funded to study compared to an organism like ''Burkholderia''. We can add things, and in fact users can add things to the SEED, kind of. If you submit a genome to RAST there is a button that says make this genome publicly available. (As an aside, we are not an INSDB organization, so we have no requirement to make them public and we are not "acceptable" to the journals as a public repository).
For the plasmids, its a bit trickier. We received a small amount of money to build a plasmid version of the RAST, and so we have some plasmids where people have not sequenced the genomes. Soon, (hopefully in a week or two) we will also have a phage version of the RAST too, so people can submit phages there for annotation. We can capture those into our phage seed.
General issues and questions
How to distinguish Phages from Eukaryotic genomes
In the SEED we have the notion of attributes, basically tuples of {fid, key, value, url}. The url is optional, but often we point to a website somewhere (although we did argue the fact that most of the websites the baselink is static and then we use the value as a cgi parameter, so did we really need a separate url). In the database attributes fid is broken down to {genome, ftype, id} where genome is
the genome number, ftype is peg, rna, etc, and id is the feature id in the genome (imagine parsing fig|1021.1.peg.23 into 1021.1, peg, and 23). This is for optimizing lookups, and also allows us to add attributes to just genomes, and arbitrary features.
I added attributes for genomes called virus_type that currently has two values Phage and Eukaryotic.
Amino Acid Codes
Very occasionally there are random amino acids in the translation strings. Here are a couple of unusual ones:
# Z means "Glutamine or Glutamic acid" (Glx). See note d at the [http://www.chem.qmul.ac.uk/iupac/AminoAcid/AA1n2.html IUPAC page].
# U means "Selenocysteine" (Sec). See this [http://www.chem.qmul.ac.uk/iubmb/newsletter/1999/item3.html IUPAC page].
On complete versus finished
Why are some genomes marked as complete when they have many contigs? How do we discriminate between those genomes, and those where every base is accurately known? When do the contigs correspond to natural entities, i.e. chromosomes and plasmids, as opposed to those in which contigs correspond to artificial entities, sequences that end when human ran out of interest or money? Users should be able to distinguish true ends from artificial ends (nature from artifact). Actually in bacterial genomes there are very seldom ends, since the sequences are usually (but not always) circular.
We identified 211 bacteria from the SEED as "complete" but with more than 10 contigs. Are any of them are complete in the natural sense, it seems unlikely as most bacteria have a single replicon with almost all of the rest having one chromosome and one to several plasmids.
We could use the concept of "completeness" to mean what most microbiologists will expect it to mean: the whole thing done. (Granted, the human genome is popularly described as "complete" when in fact it has many more contigs than chromosomes, but that's for the press releases.) The problem is that most genomes are not (or soon will not) ever be fully complete. It is extremely expensive to finish a genome the last quote I heard was between $60,000 and $100,000 for a microbial genome while you can get down to <100 contigs right now for under $2,000 depending on the size, of course. Therefore, many people argue that we should sequence diversity rather than figuring out every base of every genome. It may actually be a Sisyphean task, since the genome is constantly changing anyway. In the sequencing community this concept is called finishing (or finished). i.e. a genome can be complete but not finished (like the human genome), but presumably if it is finished it is also complete.
We have an automatic way of determining completeness for microbes: The total amount of sequence must be >300,000 bp and >70% of the
contigs must be >20,000 bp. There are a few special cases where a complete genome can be less than 300,000 bp, but they are the exceptional few and we can mark those as complete manually. An alternative proposal is counting the number of contigs and limiting them to 10 (for bacteria) or 2 (for bacteriophage). That will work almost all the time. The situation is more complicated with eukaryotic viruses, however, because they often have multiple independent DNA strands (e.g. influenza has 8 segments).
These limits are completely arbitrary, but if you limit it to those that are finished you will not have very many genomes to play with.
What to do with small things (phages, plasmids, viruses), is indeed a complex problem, and one that we don't have a good answer for. We definitely could add a flag "finished" similar to complete?
Jonathan Eisen also ran [http://phylogenomics.blogspot.com/2010/01/wantedfeedback-on-importance-of.html a poll] on this subject.
Add new comment



FAQ

