The Problem
Align several homologous sequences
Analogous to pairwise sequence alignment, but applied to more than two sequences
This is a computationally challenging problem
Consider the calculations involved in the Smith-Waterman algorithm; in principle it would be possible to extend the Smith-Waterman algorithm to an n-dimensional matrix and calculate an exact alignment of multiple sequences. In practice this is not computationally feasible.
What are the applications of multiple sequence alignment
Phylogenetic analysis
A powerful technique in multiple sequence analysis, it requires a multiple sequence alignment
Infer relationships both within and among gene families
Can provide functional information
Distinguish orthologs from paralogs
Gene duplication is a critical phenomenon in genome evolution
Infer organismal relationships
Study the evolution of genes, gene families, and genomes
Identify patterns of conservation
Conserved sites, key functional domains are conserved, less important regions are much more variable
Development of a generic representation of a protein family (a profile)
This can be used for database searching, to quickly add additional sequences to the alignment, and to characterize the protein.
Substitution patterns
Amino Acid
PAM - Used an alignment of proteins with a "known" phylogeny
BLOSUM - Only uses ungapped alignments
Nucleotide
Why is multiple seqence alignment difficult?
If sequences are all the same length, alignment is trivial:
KNMITGAAQMDGAILVVAATDGPMPQTREHVLLARQVEVP
KNMITGAAQMDGAILVVSATDGAMPQTKEHILLARQVGVP
KNMITGAAQMDGAILVVSATDGAMPQTKEHILLARQVGVP
KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP
KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP
KNMITGAAQMDGAILVVSSADGPMPQTREHILLARQVGVP
KNMITGAAQMDGAILVVSAPDGPMPQTREHILLARQVQVP
KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP
KNMITGAAQMDAAILLVAADSGAEPQTKEHLLLAQRMGIK
KNMITGAAQMDGAILLVAADSGPEPQTREHILLAKQVGVA
KNMITGAAQMDGAILVVAGTDGPMPQTREHILLARQVNVP
KNMVTGAAQMDGAILVVAATDGPMPQTREHILLAHQVGVP
KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP
KNMITGAAQMDGAIIVVAATDGQMPQTREHLLLARQVGVQ
KNMITGAAQMDGGILVVSGPDGPMPQTKEHILLARQVGVP
KNMITGTAPLDGCILVVAANDGPMPQTREHLLLARQIGVE
KNMITGAAQMEGAILVVAATDGPMPQTREHLLLARQVGVP
KNMITGAAQMDGAILVCSAADGPMPQTREHILLARQVGVP
KNMVTGAAQMDGAIIVVAATDGPMPQTREHILLARQVNVP
KNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVP
KNMITGAAQMDGAILVVASTDGPMPQTREHILLSRQVGVP
KNMITGAAQMDGAILVVSAADGPMPQTREHILLSKNVGVP
KNMITGAAQMDGAILVVSATDSVMPQTREHILLARQVGVPThis sequence alignment is unambiguous because there is no length variation among the sequences. No indels are needed to make the alignment, and the ungapped sequences can simply be arranged together.
However, if the sequenes are of various lengths, problem is more complex, potentially very complex:
RGSALKAVEAPNDPNHEA......YKPIQELLDAMDN.....YIPDPQRDVDKPFL
RGSALKALEGDAAYIEKVR..........ELMQAVDD.....NIPTPEREIDKPFL
RGSALKALE.....IEKVR..........ELMQAGDAAYVDDNIPTPEREIDKPFL
RGSALLALEEMHKNPKTKRGENEWVDKIWELLDAIDE.....YIPTPVRDVDKPFL
RGSALLALEQMHRNPKTRRGENEWVDKIWELLDAIDE.....YIPTPVRDVDKPFL
KGSALQALEALQANPKTARGEDKWVDRIWELLDAVDS.....YIPTPERATDKTFL
RGTARNALESPSKDIN....APEY.KCILELMNAVDE.....YIPTPQRAVDQPFL
KGSALQALE....NAE....DEEKTKCIWELLQAMDD.....YIPAPERDIDKPFL
KGSAFGAMS....NPE....DPESTKCVKELLESMDN.....YFDLPERDIDKPFL
RGSAFAAMS....KPD....DPAATKCLDELLDTMDK.....YFVIPERALDKPFL
KGSALNALN..........GDPEGEKAIMELMDAVDD.....YIPEPVRDVDKPFL
RGSALKALE..........GDPEYQDKVMELMNACDE.....YIPLPQRDTDKPFL
KGSAKLALEGDKGEL....GEGA....ILKLAEALDT.....YIPTPERAVDGAFL
MGSALCALEGRQPEI....GEQA....IMKLLDAVDE.....YIPTPERDLNKPFL
RGSALSALQGTNDEI....GRQA....ILKLMDAVDE.....YIPDPVRVLDKPFL
VGSALCALEGRDPEL....GLKS....VQKLLDAVDT.....YIPVPARDLEKPFL
FGSALCALEGKQPEI....GEEA....VKQLLEVLDN.....KFVIPERKVNEEPM
KGSAKLALEGDTGEL....GEVA....IMSLADALDT.....YIPTPERAVDGAFL
QGSALGALN..........GVEKWEDKVMELMEAVDT.....WIPLPPRDVDKPFL
RGSALKALE..........GDAEWEAKILELAGFLDS.....YIPEPERAIDKPFL
QGSALKALE..........GEPEWEAKILELAAALDS.....YIPEPQRDIDKPFL
KGSALKALE..........GDAEWEAKIFELMDAVDE.....YIPTPERDTEKPFM
YGSALKALE..........GDPKWEAKIHDLIKAVDE.....WIPTPTREVDKPFL
QGSALKALE..........GDSKYEDIIMELMNTVDE.....YIPEPERDTEKPLL
RTSALKALE..........GDPQWVKSVEDLMDAVDE.....YIPDPVRDKDKPFL
RVSALKALE..........GDAKWVASVEELMNAVDE.....SIPDPVRETDKPFL
AGSALMALEKMASEPKLIRGKDDWVDCIYSLMDAVDA.....YIPTPERAIDKPFL
AGSALQAVDAMIAKGTTKQSENEWVDKIHKLMAEVDA.....FIPTPERIIDKPFL
AGSALKAVEALQANSSIGKGEDEWVDKIHDLVAQVDE.....YIPAPERDIDKPFL
SGSALKALDFLTENPKTTRGENDWVDKIHALMDEVDA.....YIPTPERDIDKGLLInsert gaps to take indels (insertion/deletion events) into account
One basic strategy is to first align sequences that are most similar to each other
Once similar sequences have been aligned, align the resulting clusters of sequences against each other
This would be very simple in cases where there are no conficts among the alignments
However, in many cases the best position to place an indel is ambiguous
Ideally, one would know the phylogeny for the sequences; this would help infer the sequence of indels
Unfortunately one normally needs a multiple sequence alignment to make inferences about how the sequences are related
Most alignment algorithms make a quick approximation of phylogeny, and then base alignments on these
Alternatively, if external information on phylogeny is available, can use this information to guide alignment algorithms
Algorithms
Exact algorithms
Needleman-Wunsch, Smith-Waterman, etc. can be extended to an n-dimensional matrix
Not practical at present, so divide and conquer...
Cluster alignment algorithms
Clustal W, Clustal X
- Calculate all possible pairwise alignments, record the score for each pair
- Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining)
- Find the two most closely related sequences
- Align these sequences (algorithm: Smith-Waterman).
- Calculate a consensus of this alignment
- Replace the two sequences with the consensus
- Find the two next-most closely related sequences (one of these could be a previously determined consensus sequence).
- Iterate until all sequences have been aligned
- Expand the consensus sequences with the (gapped) original sequences
- Report the multiple sequence alignment
PILEUP
(GCG)Pileup is conceptually similar to Clustal, but it uses UPGMA to determine the guide tree, and Needleman-Wunsch for alignments.
Successive alignment algorithms
MultAlign
Motif-based alignment and related algorithms
These tools generally perform both database searching and alignment
Rely on motif libraries
Considerations in preparing alignments
Amino-acid vs nucleotide alignments
As with pairwise sequence alignment, amino acid sequences are more data-rich for alignment
Aligning divergent structural RNA sequences (as with rRNA or intron sequences) can be problematic
Spacer regions are generally under minimal selection, so they diverge very quickly and are difficult or impossible to align at large phylogenetic differences
The causes of sequence similarity
Sequence similarity does not guarantee homology
Possible reasons for sequence similarity
Phylogeny
Descent from a common ancestor
Such sequences are homologous
Function
Functionally similar sequences may undergo convergent evolution
While lengthy sequences are unlikely to be similar because of convergent evolution, sequence motifs may well be
Physical constraints
Membrane-spanning domains have to be hydrophobic, etc.
Repetitive sequences
The information content of repetitive sequences is low, so similarity may be deceptive
A fundamental goal of sequence analysis is to distinguish among these influences
Profile Analysis
Hidden Markov Model (HMM) alignment methods
meme, hmmalign
What is a Hidden Markov Model?
A statistical model of a system that can be in different states
The state of the system can change
Prior states held by the system do not affect the probability of future state changes
Clustal: Available from the EMBL Bioinformatics outstation (ftp://ftp.ebi.ac.uk//pub/)