Introduction to Comparative Bioinformatics

Comparative Bioinformatics - BSCI 348s - Introduction

What is bioinformatics?

The application of information technology to biological data, particularly DNA sequence and genomic data. This class will emphasize bioinformatic approaches that use the comparative method.

Comparative bioinformatics emphasizes information that can be gleaned by comparing genomes against each other and taking advantage of evolutionary information.

Functional bioinformatics uses structural and expression information to make inferences about the function of components of a given genome.

These fields are complementary, and to some extent overlap, so don't take this distinction too seriously.

Bioinformatics is a rapidly moving field

Moore's Law (see Intel's site for a PR take on Moore's Law, or Wikipedia for a more balanced view) is often cited as objective evidence that information technology is changing at an extremely rapid pace. In 1965 Gordon Moore (then the director of research at Fairchild Semiconductor and later one of the founders of Intel Corporation) noted that the number of semiconductors that would fit on a single chip would double roughly every two years.

Molecular biology is also undergoing extremely rapid development. In recent years the rate at which the GenBank database has grown exceeds the pace set by Moore's Law.

Because bioinformatics functions at the boundary between information technology and molecular biology, it is undergoing remarkably rapid change. It is not clear how long this rate of change will be maintained, but it is clear that biological data now require computer analysis.

The collection of genomic information has created a pressing need for bioinformatics.

Part of what has fueled the explosive growth of GenBank has been the development of rapid DNA sequencing technologies and the efforts to determine complete genome sequences.

Although it has much deeper roots, the effort to sequence the human genome developed in the mid-1980's. The Human Genome Initiative became a joint project of the DOE and NIH by 1990. This effort was not limited to the determination of the DNA sequence of the human genome alone, but also to the development of the necessary technologies and to the determination of genome sequences from other representative organisms. It is important to know the sequence of several organisms genomes so that they can be compared.

The first article describing the complete genome of a free-living organism was published in July 1995, and presented the genome of Haemophilus influenzae, a bacterium. Five years later, in the year 2000, roughly 12 genome sequences were published, and in 2002 nearly 40 genomes sequences were published. By 2003 over 150 genome sequences were publicly available, and in 2006 there were 322 bacterial, archaeal, and viral genomes listed on the TIGR web site alone. Progress on eukaryotic genomes has been slower, mostly because of the large size of most eukaryotic genomes, but as of 2006 there are genome scale data from dozens of eukaryotes, with nearly-complete genomes available from at least 25 (see JGI, NCBI).

Genomes vary greatly in size, so simply counting the number of genomes published gives a very rough sense of the scale of data involved, but a huge amount of information has become available in a very short time.

Complete genomes consist of so much sequence data that it would be impractical for a person to analyze the data manually.

Bioinformatic analyses act like a microscope to permit the scientist to see the information that is present in the genome. According to Joel E. Cohen; "Mathematics is biology's next microscope, only better; biology is mathematics' next physics, only better" (PLOS Biology 2:e439).

Why determine complete genomes?

Completeness: provides a list of all the parts and mechanisms needed to make an organism
- Caveats:
  - It may not be possible to read that information with current knowledge
  - An organism is not just a list of parts and procedures; each cell is in a particular state at any given time, and the genome is expressed in the context of that preexisting state.
Easy access to intergenic and other noncoding regions
Information about what is not present in the genome
- It can be very difficult to be certain that a given protein is not present.
A bulk of data that are amenable to statistical treatment.
- For example, base compositional information can be used to help recognize genuinely protein-coding ORFs.

Conventions

Draw DNA sequences with the 5' end on the left, 3' end on the right (5'NNNNNN3')

Double stranded DNA is drawn with the forward strand (5'->3') on the top, reverse strand on the right.

Amino acid sequences are drawn with the amino terminus on the left and the carboxy terminus on the right (amino.....carboxy)

Genomes vary in size and complexity, and some are very large

Organism	Genome Size	ORFs	Classification
Mycoplasma genitalium	580,000	470	Bacteria (but not free living)
Methanococcus jannaschii	1,664,974	1,738	Archaaea
Haemophilus influenzae	1,830,137	1,727	Bacteria
Escherichia coli	4,639,221	3,574	Bacteria
Saccharomyces cerevisiae	12,067,280	5,800	Eukarya (fungus)
Arabidopsis thaliana	115,400,000	25,498	Eukarya (angiosperm)
Caenorhabditis elegans	97,000,000	19,099	Eukarya (nematode)
Drosophila melanogaster	116,000,000	13,601	Eukarya (insect)
Homo sapiens	> 2,693,000,000	~39,114	Eukarya (mammal)
Zea mays	~5,000,000,000	?	Eukarya (angiosperm)
Pinus resinosa	~68,000,000,000	?	Eukarya (pine tree)
Amoeba dubia	~670,000,000,000	?	Eukarya (alveolate)

Note that Mycoplasma genitalium relies on its host for many nutrients.

Methanococcus jannaschii and Escherichia coli are probably much more typical prokaryotes.

Saccharomyces cerevisiae is a eukaryote, but its genome contains less than twice as many ORFs as does that of E. coli. Note, however, that the total genome size is disproportionately large when compared to prokaryotes.

The Homo sapiens (human) genome is one of the only fully sequenced genomes that was not selected for small genome size. This gives it extra theoretical interest.

The Amoeba dubia genome is far larger than the human genome.

Base composition: The relative abundance of each of the four nucleotides
Gene: Formally, the fundamental unit of heredity. The concept of gene predated knowledge of DNA.
ORF: Open Reading Frame. A region of DNA that starts with a start codon and ends with stop codon. In some cases the term implies that the distance from start to stop codon is longer than would be expected at random.

Cohen, J.E. 2004. Mathematics is biology's next microscope, only bettter; biology is mathematics' next physics, only better. PLOS Biology 2:e439.

Science magazine's history of the Human Genome Project: http://www.sciencemag.org/cgi/content/full/291/5507/1195

Schrodinger, E. What is Life? The Physcial Aspect of the Living Cell. Cambridge University Press, New York. 1946.