Strategies for sequencing genomes

Systematic approach

Traditional

Develop a map of the genome

First must identify markers

    May be phenotypically expressed alleles

    Alternatively, can be a distinctive sequence signature, ideally unique within the genome

    Also makes use of restriction sites

    Restriction enzymes cut where predictable oligonucleotide sequences are found

    Limitations include sensitivity to methylation and other technical peculiarites

Genetic map

    In eukaryotes, based on frequency of crossing over, i.e., recombination

    Not all regions of the eukaryotic genome undergo recombination with equal frequency

Physical map

    Based on actual number of nucleotides between markers

    Generate overlapping clone libraries using several restriction enzymes

    Typically use large cloning vectors; cosmids, bacterial artificial chromosomes (BACs) or yeast artificial chromosomes (YACs)

    Determine which markers are on which vectors

    By examining overlapping clones, piece together the sequence of markers

    Forms a huge logical puzzle

    Once the genetic map is known, select appropriate vectors, subclone and start methodical sequencing

Advantages

    Know where you are at each step of the process

    First step is to develop a map, which is itself useful

    Sequence assembly is relatively straightforward

Disadvantages

    Relatively slow

    Labor intensive

    Requires a relativley large number of skilled technicians

Random approach

Shear DNA into random fragments of a convenient size

    Automated sequencers can get reliable reads of a few hundred to a couple thousand nucleotides

    Therefore need a very large clone library of relatively small fragments

Sequence clones in an arbitrary order

Assemble what sequences you can

    When a total amount of sequence roughly 5-10x the genome size has been sequenced, all of the fragments can be assembled into a single contig

    In eukaryotes, one would expect multiple contigs, with one for each chromosome

This works for smaller genomes without a lot of repetitive DNA

    For larger genomes and genomes with repetitive DNA, need longer-distance guideposts

    Make additional clone libraries from larger fragments of sheared DNA

    Could be anywhere from several kb to hundreds of kb

    Sequence the ends of these clones only

    This gives the information that the two end sequences are a given distance apart, and oriented in a known direction with respect to each other

Other tricks

Advantages

    Fast

    Inherently includes oversampling which is also used for error correction

    Much of the work can be done by unskilled workers

    Celera corporation is having most of the sequencing itself done by technicians with trade school degrees or the equivalent

    Potentially can collect information about population-level sequence variation at the same time

Disadvantages

    Until large contigs begin to be identified, information is of relatively limited use

    Very computationally difficult

    Assembly of the human genome will require one of the largest supercomputing arrays in the world

    Definitely efficient with small genomes. But may be proportionately more difficult with larger genomes

    Repetitive sequences can cause considerable confusion

Repetitive DNA

    Repeats can be of different sizes

    Microrepeats (AKA microsattelites) are a few (~2-5) nucleotides that may be repeated thousands of times

    Minirepeats have a larger numberof nucleotides (~5-50) in the repetitive pattern

    Large segments of the genome may be repeated

    Ribosomal RNA genes in most eukaryotes are in a "tandem array", a direct repeat of up to several hundred copies.

    In some cases very large regions with many genes may be repeated

    Caused by unequal crossing over, viral integration, and other genetic phenomena

    Direct repeats have the sequences arranged head to tail

    Inverted repeats are arranged head-to-head or tail-to-tail

EST and cDNA Analysis

Complements genomic sequencing, and permits a relatively rapid and inexpensive survey of protein-coding regions.

Generate cDNA library

cDNAs are cloned fragments of DNA reverse transcribed from mRNA

The EST (Expressed Sequence Tag) approach is to sequence short regions of a large number of randomly selected cDNAs

This can be used to survey the expressed regions of the genome

Sequences have already been processed, so introns are gone

This can be a big advantage -- the gene involved in muscular distrophy occupies over 106 base pairs in the coding region

Can only study expressed sequences

Subject to bias depending upon the developmental stage used to develop the cDNA library