Comparative Bioinformatics - BSCI 348s - Introduction
What is bioinformatics?
The application of information technology to biological data, particularly DNA sequence and genomic data. This class will emphasize bioinformatic approaches that use the comparative method.
Comparative bioinformatics emphasizes information that can be gleaned by comparing genomes against each other and taking advantage of evolutionary information.
Functional bioinformatics uses structural and expression information to make inferences about the function of components of a given genome.
These fields are complementary, and to some extent overlap, so don't take this distinction too seriously.
Bioinformatics is a rapidly moving field
Moore's Law (see Intel's site for a PR take on Moore's Law, or Wikipedia for a more balanced view) is often cited as objective evidence that information technology is changing at an extremely rapid pace. In 1965 Gordon Moore (then the director of research at Fairchild Semiconductor and later one of the founders of Intel Corporation) noted that the number of semiconductors that would fit on a single chip would double roughly every two years.
Molecular biology is also undergoing extremely rapid development. In recent years the rate at which the GenBank database has grown exceeds the pace set by Moore's Law.
Because bioinformatics functions at the boundary between information technology and molecular biology, it is undergoing remarkably rapid change. It is not clear how long this rate of change will be maintained, but it is clear that biological data now require computer analysis.
The collection of genomic information has created a pressing need for bioinformatics.
Part of what has fueled the explosive growth of GenBank has been the development of rapid DNA sequencing technologies and the efforts to determine complete genome sequences.
Although it has much deeper roots, the effort to sequence the human genome developed in the mid-1980's. The Human Genome Initiative became a joint project of the DOE and NIH by 1990. This effort was not limited to the determination of the DNA sequence of the human genome alone, but also to the development of the necessary technologies and to the determination of genome sequences from other representative organisms. It is important to know the sequence of several organisms genomes so that they can be compared.
The first article describing the complete genome of a free-living organism was published in July 1995, and presented the genome of Haemophilus influenzae, a bacterium. Five years later, in the year 2000, roughly 12 genome sequences were published, and in 2002 nearly 40 genomes sequences were published. By 2003 over 150 genome sequences were publicly available, and in 2006 there were 322 bacterial, archaeal, and viral genomes listed on the TIGR web site alone. Progress on eukaryotic genomes has been slower, mostly because of the large size of most eukaryotic genomes, but as of 2006 there are genome scale data from dozens of eukaryotes, with nearly-complete genomes available from at least 25 (see JGI, NCBI).
Genomes vary greatly in size, so simply counting the number of genomes published gives a very rough sense of the scale of data involved, but a huge amount of information has become available in a very short time.
Complete genomes consist of so much sequence data that it would be impractical for a person to analyze the data manually.
Bioinformatic analyses act like a microscope to permit the scientist to see the information that is present in the genome. According to Joel E. Cohen; "Mathematics is biology's next microscope, only better; biology is mathematics' next physics, only better" (PLOS Biology 2:e439).
Why determine complete genomes?
Conventions
Draw DNA sequences with the 5' end on the left, 3' end on the right (5'NNNNNN3')
Double stranded DNA is drawn with the forward strand (5'->3') on the top, reverse strand on the right.
Amino acid sequences are drawn with the amino terminus on the left and the carboxy terminus on the right (amino.....carboxy)
Genomes vary in size and complexity, and some are very large
Organism | Genome Size |
ORFs |
Classification |
Mycoplasma genitalium | 580,000 |
470 |
Bacteria (but not free living) |
Methanococcus jannaschii | 1,664,974 |
1,738 |
Archaaea |
Haemophilus influenzae | 1,830,137 |
1,727 |
Bacteria |
Escherichia coli | 4,639,221 |
3,574 |
Bacteria |
Saccharomyces cerevisiae | 12,067,280 |
5,800 |
Eukarya (fungus) |
Arabidopsis thaliana |
115,400,000 |
25,498 |
Eukarya (angiosperm) |
Caenorhabditis elegans | 97,000,000 |
19,099 |
Eukarya (nematode) |
Drosophila melanogaster |
116,000,000 |
13,601 |
Eukarya (insect) |
Homo sapiens |
> 2,693,000,000 |
~39,114 |
Eukarya (mammal) |
Zea mays |
~5,000,000,000 |
? |
Eukarya (angiosperm) |
Pinus resinosa | ~68,000,000,000 |
? |
Eukarya (pine tree) |
Amoeba dubia |
~670,000,000,000 |
? |
Eukarya (alveolate) |
Note that Mycoplasma genitalium relies on its host for many nutrients.
Methanococcus jannaschii and Escherichia coli are probably much more typical prokaryotes.
Saccharomyces cerevisiae is a eukaryote, but its genome contains less than twice as many ORFs as does that of E. coli. Note, however, that the total genome size is disproportionately large when compared to prokaryotes.
The Homo sapiens (human) genome is one of the only fully sequenced genomes that was not selected for small genome size. This gives it extra theoretical interest.
The Amoeba dubia genome is far larger than the human genome.
Cohen, J.E. 2004. Mathematics is biology's next microscope, only bettter; biology is mathematics' next physics, only better. PLOS Biology 2:e439.
Science magazine's history of the Human Genome Project: http://www.sciencemag.org/cgi/content/full/291/5507/1195
Schrodinger, E. What is Life? The Physcial Aspect of the Living Cell. Cambridge University Press, New York. 1946.