Bioinformatics - Databases

Sequence Databases

Primary vs. Secondary Databases

Original experimental data

The original data are sequencing chromatograms, gels, and comparable data traces that should be archived in the originating laboratory

Important Molecular Biological Databases

NCBI, EMBL, DDBJ

Three interlinked database centers

Each sponsors several interlinked databases (e.g., GenBank, PubMed, Refseq, Taxonomy; all at NCBI), and provides other tools and services.

Each accepts submissions independently, share data daily.

We will generally just consider genbank, and treat all of these as equivalent. They were, however, established independently, and each has its own peculiarities.

Strictly speaking, all of these are secondary databases. The data are not raw data -- they have been subject to interpretation. However the data are still fairly close to the original source. Additional databases have been developed by further reprocessing of genbank. These are often called "secondary databases."

Swissprot, PIR

TIGR

JGI

Celera Genomics - One of several private sequence databases, involved in sequencing the human genome.

NCBI

NCBI's databases are some of the most important databases in bioinformatics.

National Center for Biotechnology Information (NCBI), which is part of the National Library of Medicine (NLM), which is itself a part of the National Institutes of Health (NIH), a government agency.

Thus you know the url: http://www.ncbi.nlm.nih.gov

If you want more training on the tools available at GenBank, the University of Maryland sponsor an annual workshops offered by NCBI on how to use GenBank, and similar workshops are offered at other universities nationwide.

Nucleotide and amino-acid data, and allied databases.

Comprised of assembled sequences from the literature, unpublished submissions, and annotation provided either by the original authors, from other commentary, or by the curators.

Genbank sequences are typically the product of assembly of several overlapping fragments, and have had coding regions, introns, and other features identified by a variety of methods.

As with any form of interpretation, these annotations can be incorrect

Beware of plasmid sequences, primer sequences, and other artifacts

Growth of the sequence databases has been logrithmic since the mid-1980s, and shows no sign of slowing down.

Most sequences are submitted directly by the authors, typically as a part of publication, and the submitting author continues to "own" the submission. This means that updates and changes are normally done with the permission of the author.

Data can be deposited, but held confidential until the article is published

Most sequences are available for any use, although a few have been patented, or have other legal restrictions.

GenBank data are also linked externally, to databases not maintained by GenBank

Amino Acid (Protein) Data

Originally easier to obtain, and consequently more common than DNA sequence data, now mostly inferred sequences translated from DNA sequences.

Genbank has a parallel set of accession numbers for a protein database

Nucleotide Data

Now the most common original data type.

The amino acid data in a genbank file are often inferred sequences translated from the DNA sequence, but in some cases represent actual polypeptide sequencing. The annotations should tell you how a sequence was obtained.

Associated databases

Swiss-Prot

Protein Information Resource (PIR)

Swiss-Prot and PIR are derived databases in which data from genbank have been further analyzed and annotated.

Protein Data Base (PDB)

The main database for protein structural (x-ray crystallographic) data.

Protein Families (pFam)

Profile HMM alignment database

Annotation

Seed alignment

Profile HMM

Full alignment (large, some with over 2500 sequences)

Assessing the reliability of data

Know the provenance of the data you are working with!
Because genbank attempts to be comprehensive, it is a very large database
It is not possible to verify every sequence. Consequently some data in genbank will be erroneous
Some things to consider about your data:
1. Are they original data from a highly skilled and reputable lab?
2. Were they generated by an automated system without human intervention?
3. Are they preliminary data from a EST sequencing project, or a similar technique expected to have a relatively high error rate?
4. Are gene-identity assignments made directly from biochemical data, or are they second-generation (or worse) inferences made on the basis of sequence similarity?
If you unwittingly work with bad data, your findings may prove to be invalid.
What can you do to ensure that you are working with "clean" data?
1. Work with a trusted source.
  1. Check the data yourself
  2. Use methods that can detect, correct, or accomodate invalid data.
    1. In general your studies should always include an internal check for the validity of the data.
  3. Use Swiss-Prot, PIR, or another database where the identity of the sequences has been more carefully checked. But be warned that this means you have to trust the work of the curators.