Sequence Databases
Primary vs. Secondary Databases
Original experimental data
The original data are sequencing chromatograms, gels, and comparable data traces that should be archived in the originating laboratory
Important Molecular Biological Databases
NCBI, EMBL, DDBJ
Three interlinked database centers
Each sponsors several interlinked databases (e.g., GenBank, PubMed, Refseq, Taxonomy; all at NCBI), and provides other tools and services.
Each accepts submissions independently, share data daily.
We will generally just consider genbank, and treat all of these as equivalent. They were, however, established independently, and each has its own peculiarities.
Strictly speaking, all of these are secondary databases. The data are not raw data -- they have been subject to interpretation. However the data are still fairly close to the original source. Additional databases have been developed by further reprocessing of genbank. These are often called "secondary databases."
Swissprot, PIR
TIGR
JGI
Celera Genomics - One of several private sequence databases, involved in sequencing the human genome.
NCBI
NCBI's databases are some of the most important databases in bioinformatics.
National Center for Biotechnology Information (NCBI), which is part of the National Library of Medicine (NLM), which is itself a part of the National Institutes of Health (NIH), a government agency.
Thus you know the url: http://www.ncbi.nlm.nih.gov
If you want more training on the tools available at GenBank, the University of Maryland sponsor an annual workshops offered by NCBI on how to use GenBank, and similar workshops are offered at other universities nationwide.
Nucleotide and amino-acid data, and allied databases.
Comprised of assembled sequences from the literature, unpublished submissions, and annotation provided either by the original authors, from other commentary, or by the curators.
Genbank sequences are typically the product of assembly of several overlapping fragments, and have had coding regions, introns, and other features identified by a variety of methods.
As with any form of interpretation, these annotations can be incorrect
Beware of plasmid sequences, primer sequences, and other artifacts
Growth of the sequence databases has been logrithmic since the mid-1980s, and shows no sign of slowing down.
Most sequences are submitted directly by the authors, typically as a part of publication, and the submitting author continues to "own" the submission. This means that updates and changes are normally done with the permission of the author.
Data can be deposited, but held confidential until the article is published
Most sequences are available for any use, although a few have been patented, or have other legal restrictions.
GenBank data are also linked externally, to databases not maintained by GenBank
Amino Acid (Protein) Data
Originally easier to obtain, and consequently more common than DNA sequence data, now mostly inferred sequences translated from DNA sequences.
Genbank has a parallel set of accession numbers for a protein database
Nucleotide Data
Now the most common original data type.
The amino acid data in a genbank file are often inferred sequences translated from the DNA sequence, but in some cases represent actual polypeptide sequencing. The annotations should tell you how a sequence was obtained.
Associated databases
Swiss-Prot
Protein Information Resource (PIR)
Swiss-Prot and PIR are derived databases in which data from genbank have been further analyzed and annotated.
Protein Data Base (PDB)
The main database for protein structural (x-ray crystallographic) data.
Protein Families (pFam)
Profile HMM alignment database
Annotation
Seed alignment
Profile HMM
Full alignment (large, some with over 2500 sequences)