Strategies for Genome Sequencing

Strategies for sequencing genomes

Systematic approach

Traditional

Develop a map of the genome

First must identify markers

May be phenotypically expressed alleles

Alternatively, can be a distinctive sequence signature, ideally unique within the genome

Also makes use of restriction sites

Restriction enzymes cut where predictable oligonucleotide sequences are found

Limitations include sensitivity to methylation and other technical peculiarites

Genetic map

In eukaryotes, based on frequency of crossing over, i.e., recombination

Not all regions of the eukaryotic genome undergo recombination with equal frequency

Physical map

Based on actual number of nucleotides between markers

Generate overlapping clone libraries using several restriction enzymes

Typically use large cloning vectors; cosmids, bacterial artificial chromosomes (BACs) or yeast artificial chromosomes (YACs)

Determine which markers are on which vectors

By examining overlapping clones, piece together the sequence of markers

Forms a huge logical puzzle

Once the genetic map is known, select appropriate vectors, subclone and start methodical sequencing

Advantages

Know where you are at each step of the process

First step is to develop a map, which is itself useful

Sequence assembly is relatively straightforward

Disadvantages

Relatively slow

Labor intensive

Requires a relativley large number of skilled technicians

Random approach

Shear DNA into random fragments of a convenient size

Automated sequencers can get reliable reads of a few hundred to a couple thousand nucleotides

Therefore need a very large clone library of relatively small fragments

Sequence clones in an arbitrary order

Assemble what sequences you can

When a total amount of sequence roughly 5-10x the genome size has been sequenced, all of the fragments can be assembled into a single contig

In eukaryotes, one would expect multiple contigs, with one for each chromosome

This works for smaller genomes without a lot of repetitive DNA

For larger genomes and genomes with repetitive DNA, need longer-distance guideposts

Make additional clone libraries from larger fragments of sheared DNA

Could be anywhere from several kb to hundreds of kb

Sequence the ends of these clones only

This gives the information that the two end sequences are a given distance apart, and oriented in a known direction with respect to each other

Other tricks

Advantages

Fast

Inherently includes oversampling which is also used for error correction

Much of the work can be done by unskilled workers

Celera corporation is having most of the sequencing itself done by technicians with trade school degrees or the equivalent

Potentially can collect information about population-level sequence variation at the same time

Disadvantages

Until large contigs begin to be identified, information is of relatively limited use

Very computationally difficult

Assembly of the human genome will require one of the largest supercomputing arrays in the world

Definitely efficient with small genomes. But may be proportionately more difficult with larger genomes

Repetitive sequences can cause considerable confusion

Repetitive DNA

Repeats can be of different sizes

Microrepeats (AKA microsattelites) are a few (~2-5) nucleotides that may be repeated thousands of times

Minirepeats have a larger numberof nucleotides (~5-50) in the repetitive pattern

Large segments of the genome may be repeated

Ribosomal RNA genes in most eukaryotes are in a "tandem array", a direct repeat of up to several hundred copies.

In some cases very large regions with many genes may be repeated

Caused by unequal crossing over, viral integration, and other genetic phenomena

Direct repeats have the sequences arranged head to tail

Inverted repeats are arranged head-to-head or tail-to-tail

EST and cDNA Analysis

Complements genomic sequencing, and permits a relatively rapid and inexpensive survey of protein-coding regions.

Generate cDNA library

cDNAs are cloned fragments of DNA reverse transcribed from mRNA

The EST (Expressed Sequence Tag) approach is to sequence short regions of a large number of randomly selected cDNAs

This can be used to survey the expressed regions of the genome

Sequences have already been processed, so introns are gone

This can be a big advantage -- the gene involved in muscular distrophy occupies over 10⁶ base pairs in the coding region

Can only study expressed sequences

Subject to bias depending upon the developmental stage used to develop the cDNA library