Models of Amino Acid Substitution
The mutational process
Random processes alter the DNA sequence that encodes a protein
Silent vs. nonsilent substitutions
Some amino acid substitutions require more than one nucleotide substitution
Chemical properties of the amino acids
Dramatically different amino acids should be substituted less often
Amino Acid Scoring Matrices
PAM matrices (Dayhoff et al., 1978)
Percent Accepted Mutation (or Point Accepted Mutation, or Probability of Accepted Mutation...)
Based on sound evolutionary principles, but original matrices were calculated in 1978
Used reconstruction of ancestral states of 34 superfamilies on 71 groups of sequences that were at least 85% similar. Based on a total of 1572 changes.
These are very small numbers by modern standards
Based on a Markov model -- each change is assumed to be independent of previous states.
For each protein, trace changes on a phylogenetic tree
Reconstruct ancestral states
In case of ambiguity, distribute among alternative possibilities
Method of tree construction was not specified.
Count the number of changes for each AA, and divide by a normalization factor (the "exposure to mutation").
1 PAM is 1% change in a sequence, i.e., one amino acid substitution per 100 amino acids.
Several matrices (PAM100, PAM250, etc.) that are linear extrapolations of the original matrix. In these a "PAM 250" matrix corresponds to 250% substitution (nominally in 2500 My). These sequences should still have about 20% identity.
Usually presented as a log-odds table:
Ratio of likelihood of alignment in related sequences divided by likelihood of alignment in unrelated sequences.
This is converted to a base two logarithm.
Assumptions in the PAM model of sequence evolution
Markov model
Linear extrapolation
IID (i.e., each site in the sequence is evolving in a similar manner, and independent of other sites)
BLOSUM (Henikoff and Henikoff, 1992) - Blocks Amino Acid Substitution Matrices
More recent than Dayhoff matrices, and consequently based on a larger number of proteins.
Not based on reconstructions on phylogenetic trees, but the method of construction tends to implicitly incorporate evolutionary information.
In practice these matrices seem to perform quite well.
Examine protein families in Prosite database
For each family, identify one or more of highly conserved sequence motifs
Blocks were found using the software tool motif, which finds patterns of matching amino acids at arbitrary intervals (aa1 d1 aa2 d2 aa3 d3 ...)
Protein families were defined in this case by the presence of a sequence motif in every member of the family.
Roughly 500 protein families and 2000 blocks (amino acid patterns) were identified.
String together motifs into ungapped blocks of sequence, providing an ungapped alignment
Because the blocks are regions of a fixed length flanked by perfectly conserved amino acids, they are assumed to be homologous.
Count the number of substitutions within each column
A potential problem is that substitutions in closely related proteins will tend to be overrepresented.
To compensate for this a series of matrices were calculated that used only data from sequences at least 60% similar (BLOSUM60) 80% similar (BLOSUM80), etc.
More recently, BLOSUM matrices have been calculated from blocks with a range of similiarities (i.e., at least 60% but less than 80%)
Searches of more distantly related proteins should use matrices calculated from relatively dissimilar blocks and vice versa.
Similarity scores are converted into log-odds scores (See Mount 2001, p. 87)
Other scoring matrices
See Mount, 2001, p. 89.
Site-specific scoring scoring matrices
It is expected that a single amino acid can be serving different functions in different locations within a protein. Consequently the expected substitution probabilities vary from site-to-site.
PSI-BLAST
Calculated dynamically from the results of blast searches.
Can perform well, but involves circular reasoning and this can lead to spectacular failure.
Calculated from reported experimental mutations
Relies on a smaller sample size than other empirical methods, and doesn't incorporate all changes in fitness caused by a mutation.
Calculated by folding algorithms from changes in free energy
Conceptually very exciting, but computationally difficult.
Henikoff, S. and Henikoff, J. G. 1992. Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. Sci. USA 89: 10915-10919.
Jones, D. T., Taylor, W. R. and Thornton, J. M. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8: 275-282.
Dayhoff, M. O., Schwartz, R. M. and Orcutt, B. C. 1978. A model of evolutionary change in proteins. In Dayhoff, M. O. [ed] Atlas of protein sequece and structure, supplement 3. National Biomedical Research Foundation, Washington, DC, pp. 345-352.
Schwartz, R. M. and Dayhoff, M. O. 1978. Matrices for detecting distant relationships. In Dayhoff, M. O. [ed] Atlas of protein sequece and structure, supplement 3. National Biomedical Research Foundation, Washington, DC, pp. 353-358.