BSCI 348S - Pairwise Sequence Alignment

Pairwise Sequence Alignment

The context for sequence alignment

Homologous sequences

Homology

As the term is normally used today, two sequences are homologous if they are descended by evolution from the same sequence in the genome of a common ancestor

Similarity

Similarity refers to a measurable property, i.e., the fraction of identical nucleotides or chemically similar amino acids in the sequence

Statements about homology are making an evolutionary inference about the sequence.

Unfortunately, confusion about the distiction between similarity and homology has resulted in a large body of literature that uses the terms as essentially synonomous. This mistake is so widespread that it is difficult to criticize an author for using the terms incorrectly, but you will communicate your thoughts more effectively if you understand and use the terms correctly.

Sequence Evolution

Sequences from different organisms (or different loci) resemble each other primarily because they are homologous, and natural selection has conserved features of the sequence.

Convergent evolution is an important biological phenomenon, but in most cases proteins that have converged on similar function have quite different gene sequences.

This is because there are a huge number of possible ways to make a functional protein.

As a consequence, convergent genes are usually easy to recognize (but there is no guarantee this will always be true!)

Consequently the study of sequence conservation is the study of natural selection.

Major forms of mutation are:

Point mutation

Silent substitutions - change nucleotide sequence, but not the amino acid sequence

Nonsilent substitutions - change the amino acid sequence

Insertion and deletion mutations, referred to as "indels"

Measuring similarity

Identity vs. Similarity

Identity refers to an exact match between two nucleotides or amino acids

Similarity refers to a resemblance between two residues that is greater than one would expect at random.

Nucleotide sequences

Simple fraction of identical residues

Genetic distances based on models of DNA sequence evolution

Peptide sequences

Genetic distances based on models of amino acid sequence evolution

Measured "Probability of Accepted Mutation" (PAM) matrices

PAM matrices first calculated in the 1970s

Newer equivalents are BLOSUM matrices

Gene duplication

Complicates matters; one organism may have two or more sequences derived from the same ancestral sequence.

This has resulted in the development of terms that refer to special cases of homology:

Orthology

Two seqences are orthologs if they are descended from the same copy of a duplicated sequence.

Paralogy

Two sequences are paralogs if they are descended from different copies of a duplicated gene.

Gene Families

All of the descendants of a duplicated gene -- even if they had evolved different functions -- are said to be members of a gene family
There are many gene families, many of which can be grouped into superfamilies, etc.

Performing pairwise sequence alignment -- Exact algorithms

Aligment would be trivial except for indels -- insertions and deletions

The computer has to decide where to put indels

To do so, the computer must maximize the number of similar residues in alignment, and insert no more indels than are absolutely necessary

Example: the Needleman-Wunsch algorithm

Described by Needleman & Wunsch, 1970, modified by Gotoh, 1982. Refer also to Durbin et al., 1998.

Aligns two amino-acid sequences completely from end-to-end (a global alignment)

Not very practical for this reason, but illustrates important concepts

Used when an objective and optimal measure is needed to compare two sequences and it is valid to assume that the length of the sequences is equivalent.

Will determine an optimally scoring alignment of two sequences

In some cases there may be more than one equally good alignment

Algorithm

Construct matrix with one index for each sequence.

Fill that matrix recursively with scores representing the best alignment from the beginning to the index point.

Use a PAM or BLOSUM matrix to calculate the score for each pair of aligned amino acids (we will use BLOSUM).

Score a gap penalty for any case where an amino acid must be paired with a gap (i.e., an indel).

The score of an alignment will be the sum of scores for individual amino acid pairs.

Start with a score of zero

The matrix can then be filled in recursively using the maximum of three possibilities:

No new gap must be inserted.

The score of the sub-alignment is the sum of the partial alignment plus the BLOSUM score for the next two amino acids.

A gap must be inserted in sequence i.

The score of the sub-alignment is the sum of the partial alignment plus the gap penalty

A gap must be inserted in sequence j.

The score of the sub-alignment is the sum of the partial alignment plus the gap penalty.

These three possibilites are represented in the matrix by the three adjoining cells above and to the left of each cell.

For each cell in the matrix, fill in the maximum of these three possibilities, and record from which cell the calculation was propagated.

After recording the score for the cell, also record the traceback, i.e., which one of the three alternatives gave the best score.

When the entire matrix is full, follow the traceback (path of best options) from the lower right corner to the upper left to yield the alignment.

The Smith-Waterman Algorithm (the Gotoh implementation of this approach is more widely used)

Finds the highest scoring segment of alignment between two sequences (a local alignment)

Closely related to the Needleman-Wunsch algorithm

Add zero as a fourth possibility to the selection list

In other words, if any cell would be filled with a negative value, enter zero

Fill the alignment matrix as with the Needleman-Wunsch algorithm, and keep track of traceback

When the matrix is completely filled, find the highest score in the matrix and follow the traceback from there

The ends of the aligned region need not correspond to the ends of the alignment

Note that this approach assumes a scoring matrix that can distinguish random similarities from meaningful matches

The general dynamic programming approach can be extended to more complex situations

Repeated matches

Overlap matches

Others

Gibas & Jambeck, Chapter 7, pp. 172-182 (note that the discussion of homology in the text is not particularly good).

Needleman, S.B., and C.D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology 48:443-453.

Gotoh, O. 1982. An improved algorithm for matching biological sequences. Journal of Molecular Biology 162:705-708.

Durbin, R., S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. CambridgeUniversity Press, Cambridge.

Bioinformatics Home

Syllabus

Links

Reading