Models of Amino Acid Substitution

The mutational process

Random processes alter the DNA sequence that encodes a protein

Silent vs. nonsilent substitutions

Some amino acid substitutions require more than one nucleotide substitution

Chemical properties of the amino acids

Dramatically different amino acids should be substituted less often

Amino Acid Scoring Matrices

PAM matrices (Dayhoff et al., 1978)

Percent Accepted Mutation (or Point Accepted Mutation, or Probability of Accepted Mutation...)

Based on sound evolutionary principles, but original matrices were calculated in 1978

Used reconstruction of ancestral states of 34 superfamilies on 71 groups of sequences that were at least 85% similar. Based on a total of 1572 changes.

These are very small numbers by modern standards

Based on a Markov model -- each change is assumed to be independent of previous states.

For each protein, trace changes on a phylogenetic tree

Reconstruct ancestral states

In case of ambiguity, distribute among alternative possibilities

Method of tree construction was not specified.

Count the number of changes for each AA, and divide by a normalization factor (the "exposure to mutation").

1 PAM is 1% change in a sequence, i.e., one amino acid substitution per 100 amino acids.

Several matrices (PAM100, PAM250, etc.) that are linear extrapolations of the original matrix. In these a "PAM 250" matrix corresponds to 250% substitution (nominally in 2500 My). These sequences should still have about 20% identity.

Usually presented as a log-odds table:

Ratio of likelihood of alignment in related sequences divided by likelihood of alignment in unrelated sequences.

This is converted to a base two logarithm.

Assumptions in the PAM model of sequence evolution

Markov model

Linear extrapolation

IID (i.e., each site in the sequence is evolving in a similar manner, and independent of other sites)

BLOSUM (Henikoff and Henikoff, 1992) - Blocks Amino Acid Substitution Matrices

More recent than Dayhoff matrices, and consequently based on a larger number of proteins.

Not based on reconstructions on phylogenetic trees, but the method of construction tends to implicitly incorporate evolutionary information.

In practice these matrices seem to perform quite well.

Examine protein families in Prosite database

For each family, identify one or more of highly conserved sequence motifs

Blocks were found using the software tool motif, which finds patterns of matching amino acids at arbitrary intervals (aa1 d1 aa2 d2 aa3 d3 ...)

Protein families were defined in this case by the presence of a sequence motif in every member of the family.

Roughly 500 protein families and 2000 blocks (amino acid patterns) were identified.

String together motifs into ungapped blocks of sequence, providing an ungapped alignment

Because the blocks are regions of a fixed length flanked by perfectly conserved amino acids, they are assumed to be homologous.

Count the number of substitutions within each column

A potential problem is that substitutions in closely related proteins will tend to be overrepresented.

To compensate for this a series of matrices were calculated that used only data from sequences at least 60% similar (BLOSUM60) 80% similar (BLOSUM80), etc.

More recently, BLOSUM matrices have been calculated from blocks with a range of similiarities (i.e., at least 60% but less than 80%)

Searches of more distantly related proteins should use matrices calculated from relatively dissimilar blocks and vice versa.

Similarity scores are converted into log-odds scores (See Mount 2001, p. 87)

Other scoring matrices

See Mount, 2001, p. 89.

Site-specific scoring scoring matrices

It is expected that a single amino acid can be serving different functions in different locations within a protein. Consequently the expected substitution probabilities vary from site-to-site.

PSI-BLAST

Calculated dynamically from the results of blast searches.

Can perform well, but involves circular reasoning and this can lead to spectacular failure.

Calculated from reported experimental mutations

Relies on a smaller sample size than other empirical methods, and doesn't incorporate all changes in fitness caused by a mutation.

Calculated by folding algorithms from changes in free energy

Conceptually very exciting, but computationally difficult.

 


Henikoff, S. and Henikoff, J. G. 1992. Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. Sci. USA 89: 10915-10919.

Jones, D. T., Taylor, W. R. and Thornton, J. M. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8: 275-282.

Dayhoff, M. O., Schwartz, R. M. and Orcutt, B. C. 1978. A model of evolutionary change in proteins. In Dayhoff, M. O. [ed] Atlas of protein sequece and structure, supplement 3. National Biomedical Research Foundation, Washington, DC, pp. 345-352.

Schwartz, R. M. and Dayhoff, M. O. 1978. Matrices for detecting distant relationships. In Dayhoff, M. O. [ed] Atlas of protein sequece and structure, supplement 3. National Biomedical Research Foundation, Washington, DC, pp. 353-358.

Bioinformatics Home
Syllabus
Links
Reading