Multiple Sequence Alignment

The Problem

Align several homologous sequences

Analogous to pairwise sequence alignment, but applied to more than two sequences

This is a computationally challenging problem

Consider the calculations involved in the Smith-Waterman algorithm; in principle it would be possible to extend the Smith-Waterman algorithm to an n-dimensional matrix and calculate an exact alignment of multiple sequences. In practice this is not computationally feasible.

What are the applications of multiple sequence alignment

Phylogenetic analysis

A powerful technique in multiple sequence analysis, it requires a multiple sequence alignment

Infer relationships both within and among gene families

Can provide functional information

Distinguish orthologs from paralogs

Gene duplication is a critical phenomenon in genome evolution

Infer organismal relationships

Study the evolution of genes, gene families, and genomes

Identify patterns of conservation

Conserved sites, key functional domains are conserved, less important regions are much more variable

Development of a generic representation of a protein family (a profile)

This can be used for database searching, to quickly add additional sequences to the alignment, and to characterize the protein.

Substitution patterns

Amino Acid

PAM - Used an alignment of proteins with a "known" phylogeny

BLOSUM - Only uses ungapped alignments

Nucleotide

Infer models of nucleotide evoution

Why is multiple seqence alignment difficult?

If sequences are all the same length, alignment is trivial:

KNMITGAAQMDGAILVVAATDGPMPQTREHVLLARQVEVP
KNMITGAAQMDGAILVVSATDGAMPQTKEHILLARQVGVP
KNMITGAAQMDGAILVVSATDGAMPQTKEHILLARQVGVP
KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP
KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP
KNMITGAAQMDGAILVVSSADGPMPQTREHILLARQVGVP
KNMITGAAQMDGAILVVSAPDGPMPQTREHILLARQVQVP
KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP
KNMITGAAQMDAAILLVAADSGAEPQTKEHLLLAQRMGIK
KNMITGAAQMDGAILLVAADSGPEPQTREHILLAKQVGVA
KNMITGAAQMDGAILVVAGTDGPMPQTREHILLARQVNVP
KNMVTGAAQMDGAILVVAATDGPMPQTREHILLAHQVGVP
KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP
KNMITGAAQMDGAIIVVAATDGQMPQTREHLLLARQVGVQ
KNMITGAAQMDGGILVVSGPDGPMPQTKEHILLARQVGVP
KNMITGTAPLDGCILVVAANDGPMPQTREHLLLARQIGVE
KNMITGAAQMEGAILVVAATDGPMPQTREHLLLARQVGVP
KNMITGAAQMDGAILVCSAADGPMPQTREHILLARQVGVP
KNMVTGAAQMDGAIIVVAATDGPMPQTREHILLARQVNVP
KNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVP
KNMITGAAQMDGAILVVASTDGPMPQTREHILLSRQVGVP
KNMITGAAQMDGAILVVSAADGPMPQTREHILLSKNVGVP
KNMITGAAQMDGAILVVSATDSVMPQTREHILLARQVGVP

This sequence alignment is unambiguous because there is no length variation among the sequences. No indels are needed to make the alignment, and the ungapped sequences can simply be arranged together.

However, if the sequenes are of various lengths, problem is more complex, potentially very complex:

    RGSALKAVEAPNDPNHEA......YKPIQELLDAMDN.....YIPDPQRDVDKPFL
    RGSALKALEGDAAYIEKVR..........ELMQAVDD.....NIPTPEREIDKPFL
    RGSALKALE.....IEKVR..........ELMQAGDAAYVDDNIPTPEREIDKPFL
    RGSALLALEEMHKNPKTKRGENEWVDKIWELLDAIDE.....YIPTPVRDVDKPFL
    RGSALLALEQMHRNPKTRRGENEWVDKIWELLDAIDE.....YIPTPVRDVDKPFL
    KGSALQALEALQANPKTARGEDKWVDRIWELLDAVDS.....YIPTPERATDKTFL
    RGTARNALESPSKDIN....APEY.KCILELMNAVDE.....YIPTPQRAVDQPFL
    KGSALQALE....NAE....DEEKTKCIWELLQAMDD.....YIPAPERDIDKPFL
    KGSAFGAMS....NPE....DPESTKCVKELLESMDN.....YFDLPERDIDKPFL
    RGSAFAAMS....KPD....DPAATKCLDELLDTMDK.....YFVIPERALDKPFL
    KGSALNALN..........GDPEGEKAIMELMDAVDD.....YIPEPVRDVDKPFL
    RGSALKALE..........GDPEYQDKVMELMNACDE.....YIPLPQRDTDKPFL
    KGSAKLALEGDKGEL....GEGA....ILKLAEALDT.....YIPTPERAVDGAFL
    MGSALCALEGRQPEI....GEQA....IMKLLDAVDE.....YIPTPERDLNKPFL
    RGSALSALQGTNDEI....GRQA....ILKLMDAVDE.....YIPDPVRVLDKPFL
    VGSALCALEGRDPEL....GLKS....VQKLLDAVDT.....YIPVPARDLEKPFL
    FGSALCALEGKQPEI....GEEA....VKQLLEVLDN.....KFVIPERKVNEEPM
    KGSAKLALEGDTGEL....GEVA....IMSLADALDT.....YIPTPERAVDGAFL
    QGSALGALN..........GVEKWEDKVMELMEAVDT.....WIPLPPRDVDKPFL
    RGSALKALE..........GDAEWEAKILELAGFLDS.....YIPEPERAIDKPFL
    QGSALKALE..........GEPEWEAKILELAAALDS.....YIPEPQRDIDKPFL
    KGSALKALE..........GDAEWEAKIFELMDAVDE.....YIPTPERDTEKPFM
    YGSALKALE..........GDPKWEAKIHDLIKAVDE.....WIPTPTREVDKPFL
    QGSALKALE..........GDSKYEDIIMELMNTVDE.....YIPEPERDTEKPLL
    RTSALKALE..........GDPQWVKSVEDLMDAVDE.....YIPDPVRDKDKPFL
    RVSALKALE..........GDAKWVASVEELMNAVDE.....SIPDPVRETDKPFL
    AGSALMALEKMASEPKLIRGKDDWVDCIYSLMDAVDA.....YIPTPERAIDKPFL
    AGSALQAVDAMIAKGTTKQSENEWVDKIHKLMAEVDA.....FIPTPERIIDKPFL
    AGSALKAVEALQANSSIGKGEDEWVDKIHDLVAQVDE.....YIPAPERDIDKPFL
    SGSALKALDFLTENPKTTRGENDWVDKIHALMDEVDA.....YIPTPERDIDKGLL

    Insert gaps to take indels (insertion/deletion events) into account

    One basic strategy is to first align sequences that are most similar to each other

    Once similar sequences have been aligned, align the resulting clusters of sequences against each other

This would be very simple in cases where there are no conficts among the alignments

    However, in many cases the best position to place an indel is ambiguous

Ideally, one would know the phylogeny for the sequences; this would help infer the sequence of indels

    Unfortunately one normally needs a multiple sequence alignment to make inferences about how the sequences are related

    Most alignment algorithms make a quick approximation of phylogeny, and then base alignments on these

    Alternatively, if external information on phylogeny is available, can use this information to guide alignment algorithms

Algorithms

Exact algorithms

Needleman-Wunsch, Smith-Waterman, etc. can be extended to an n-dimensional matrix

Not practical at present, so divide and conquer...

Cluster alignment algorithms

    Clustal W, Clustal X

    1. Calculate all possible pairwise alignments, record the score for each pair
    2. Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining)
    3. Find the two most closely related sequences
    4. Align these sequences (algorithm: Smith-Waterman).
      1. Calculate a consensus of this alignment
      2. Replace the two sequences with the consensus
      3. Find the two next-most closely related sequences (one of these could be a previously determined consensus sequence).
      4. Iterate until all sequences have been aligned
    5. Expand the consensus sequences with the (gapped) original sequences
    6. Report the multiple sequence alignment

    PILEUP (GCG)

    Pileup is conceptually similar to Clustal, but it uses UPGMA to determine the guide tree, and Needleman-Wunsch for alignments.

Successive alignment algorithms

    MultAlign

Motif-based alignment and related algorithms

    These tools generally perform both database searching and alignment

    Rely on motif libraries

Considerations in preparing alignments

    Amino-acid vs nucleotide alignments

    As with pairwise sequence alignment, amino acid sequences are more data-rich for alignment

    Aligning divergent structural RNA sequences (as with rRNA or intron sequences) can be problematic

    Spacer regions are generally under minimal selection, so they diverge very quickly and are difficult or impossible to align at large phylogenetic differences

The causes of sequence similarity

    Sequence similarity does not guarantee homology

    Possible reasons for sequence similarity

    Phylogeny

    Descent from a common ancestor

    Such sequences are homologous

    Function

    Functionally similar sequences may undergo convergent evolution

    While lengthy sequences are unlikely to be similar because of convergent evolution, sequence motifs may well be

    Physical constraints

    Membrane-spanning domains have to be hydrophobic, etc.

    Repetitive sequences

    The information content of repetitive sequences is low, so similarity may be deceptive

    A fundamental goal of sequence analysis is to distinguish among these influences

Profile Analysis

Hidden Markov Model (HMM) alignment methods

meme, hmmalign

What is a Hidden Markov Model?

A statistical model of a system that can be in different states

The state of the system can change

Prior states held by the system do not affect the probability of future state changes

 


Clustal: Available from the EMBL Bioinformatics outstation (ftp://ftp.ebi.ac.uk//pub/)

 

Bioinformatics Home
Syllabus
Links
Reading