Traditional
Develop a map of the genome
First must identify markers
May be phenotypically expressed alleles
Alternatively, can be a distinctive sequence signature, ideally unique within the genome
Also makes use of restriction sites
Restriction enzymes cut where predictable oligonucleotide sequences are found
Limitations include sensitivity to methylation and other technical peculiarites
Genetic map
In eukaryotes, based on frequency of crossing over, i.e., recombination
Not all regions of the eukaryotic genome undergo recombination with equal frequency
Physical map
Based on actual number of nucleotides between markers
Generate overlapping clone libraries using several restriction enzymes
Typically use large cloning vectors; cosmids, bacterial artificial chromosomes (BACs) or yeast artificial chromosomes (YACs)
Determine which markers are on which vectors
By examining overlapping clones, piece together the sequence of markers
Forms a huge logical puzzle
Once the genetic map is known, select appropriate vectors, subclone and start methodical sequencing
Advantages
Know where you are at each step of the process
First step is to develop a map, which is itself useful
Sequence assembly is relatively straightforward
Disadvantages
Relatively slow
Labor intensive
Requires a relativley large number of skilled technicians
Random approach
Shear DNA into random fragments of a convenient size
Automated sequencers can get reliable reads of a few hundred to a couple thousand nucleotides
Therefore need a very large clone library of relatively small fragments
Sequence clones in an arbitrary order
Assemble what sequences you can
When a total amount of sequence roughly 5-10x the genome size has been sequenced, all of the fragments can be assembled into a single contig
In eukaryotes, one would expect multiple contigs, with one for each chromosome
This works for smaller genomes without a lot of repetitive DNA
For larger genomes and genomes with repetitive DNA, need longer-distance guideposts
Make additional clone libraries from larger fragments of sheared DNA
Could be anywhere from several kb to hundreds of kb
Sequence the ends of these clones only
This gives the information that the two end sequences are a given distance apart, and oriented in a known direction with respect to each other
Other tricks
Advantages
Fast
Inherently includes oversampling which is also used for error correction
Much of the work can be done by unskilled workers
Celera corporation is having most of the sequencing itself done by technicians with trade school degrees or the equivalent
Potentially can collect information about population-level sequence variation at the same time
Disadvantages
Until large contigs begin to be identified, information is of relatively limited use
Very computationally difficult
Assembly of the human genome will require one of the largest supercomputing arrays in the world
Definitely efficient with small genomes. But may be proportionately more difficult with larger genomes
Repetitive sequences can cause considerable confusion
Repetitive DNA
Repeats can be of different sizes
Microrepeats (AKA microsattelites) are a few (~2-5) nucleotides that may be repeated thousands of times
Minirepeats have a larger numberof nucleotides (~5-50) in the repetitive pattern
Large segments of the genome may be repeated
Ribosomal RNA genes in most eukaryotes are in a "tandem array", a direct repeat of up to several hundred copies.
In some cases very large regions with many genes may be repeated
Caused by unequal crossing over, viral integration, and other genetic phenomena
Direct repeats have the sequences arranged head to tail
Inverted repeats are arranged head-to-head or tail-to-tail
Complements genomic sequencing, and permits a relatively rapid and inexpensive survey of protein-coding regions.
Generate cDNA library
cDNAs are cloned fragments of DNA reverse transcribed from mRNA
The EST (Expressed Sequence Tag) approach is to sequence short regions of a large number of randomly selected cDNAs
This can be used to survey the expressed regions of the genome
Sequences have already been processed, so introns are gone
This can be a big advantage -- the gene involved in muscular distrophy occupies over 106 base pairs in the coding region
Can only study expressed sequences
Subject to bias depending upon the developmental stage used to develop the cDNA library