Modeling Nucleotide Sequence Evolution
- The characteristics of nucleotide sequences
- Four nucleotides
- A, C, G, T or U
- Ambiguity codes: S, W, R, Y, B, D, H, V, N, X
- Arranged in a series with directionality
- Some sequences are protein-coding, others are not
- Replicated and maintained by cellular machinery for this purpose
- The characteristics of the replication and repair mechanisms of
the cell affect how nucleotide sequences evolve.
- For example, some species have biased base compositions
- Markov models of nucleotide substitution
- Markov models assume that there is no "memory" in the system:
only the instantaneous state of a character is important
- The probability of change from state i to state j depends upon the amount
of time that has passed and the substitution rate.
- Because well determined time points are not usually available for molecular
data, the product of rate and time (equivalent to a genetic distance)
is more commonly used.
- Mutation
- Point mutation
- Categories of point mutation
- Transitions & transversions
- Purines: A & G Pyrimidines: C & T
- Despite the fact that each nucleotide has only one transition but
two transversions available, in most sequence comparisons, transitions
are found to occur more frequently.
- This is because most living cells have mechanisms to detect mismatched
base pairs.
- Insertion and deletion events ("indels")
- We will not consider indels here, but they are an important part
of sequence evolution. Development of effective models of indel evolution
is a current area of research.
- Superimposed substitutions ("multiple hits")
- Critical to models of nucleotide evolution is the realization that because
there are only four possible character states, it is expected that as
genetic distance increases, some sites will undergo multiple superimposed
substitutions.
- In this case, some sites that have undergone change will have reverted
to the state they were originally in.
- To reliably reconstruct evolutionary relationships among divergent
sequences, this expected reversion must be taken into account.
- Simple measures of distance that do not take multiple substitutions
into account are said to be uncorrected. Corrected distances
use one of several models of sequence evolution to estimate the number
of sites that have undergone multiple substitutions.
- Uncorrected methods tend to underestimate genetic distance.
- While uncorrected measures are clearly inadequate, selecting the
most appropriate measure to use requires and understanding of models
of sequence substitution, and the assumptions that underlie each of
them.
- Models that assume all nucleotides occur at equal frequencies (25%)
- The Jukes-Cantor (JC) model
- All substitutions are equally likely.
- All nucleotides occur at the same frequency (25%).
- One parameter: the rate of subsitution (alpha).
- Kimura two parameter (K2P) model
- Transitions and transversions happen at different rates.
- All nucleotides occur at the same frequency.
- Two parameters: transition rate (alpha) and transversion rate (beta).
- Models that allow the four nucleotides to be present in different frequencies
- Felsenstein (F84) & Hasegawa-Kishono-Yano (HKY85)
models
- Two closely related models -- they use different calculations to
model essentially the same thing
- Transitions and transversions occur at different rates
- Nucleotides occur at different frequencies
- General time reversible (GTR) model
- Assumes a symmetric substitution matrix (and thus is time reversible)
- In other words, A changes into T with the same rate that T changes
into A.
- Each pair of nucleotide substitutions has a different rate
- Nucleotides can occur at different frequencies
- Relationship among these models
- These models are closely related
- Nested models are special cases of more general models.
- A model is said to be nested in another model if the simpler model is
equivalent to a specific setting of the more complex model.
- Thus the JC model is a special case of the K2P model: if the transition
and transversion parameters of the K2P model are set to the same value,
it is equivalent to the JC model.
More complex model(s) |
Corresponding Nested Model(s) |
GTR |
JC, K2P, F84, HKY85 |
F84, HKY85 |
JC, K2P |
K2P |
JC |
- Other models
- Several other models of sequence change are available.
- Several other special cases of the GTR model have been described
and named
- Models that are not time reversible (i.e., have asymmetric substitution
matrices) have been described
- Some methods (e.g., LogDet) are available that use dramatically
different models of sequence substitution
- These methods are of particular interest because they are not
nested within the GTR model, and consequently have different underlying
assumptions
- The methods described here all assume that each position is evolving
independently and identically
- Site to site rate variation has also been modeled
- Invariant sites model
- Rate distribution models
- Gamma model
- Van de Peer's method
- Special models for protein-coding sequences are in development
- Non-independence is a serious concern. Some work has been done to
examine the effects of non-independence of sites, but this needs more
attention.
- Lineage-specific models of sequence evolution
- A further complication is introduced if different lineages are evolving
differently.
- Base compositional bias
- LogDet has been used successfully in cases where base compositional
bias would violate the assumptions of the GTR family of models.
- Linked Markov Chains and other variations on Markov models can also
model lineage specific evolution
- Are there other aspects of sequence evolution that can be modeled?
- Indels
- Constraints imposed by RNA secondary structure
- Others?
- How to choose what model to use
- In general, use the simplest model that adequately explains the data
- If a more complex model yields a greater improvement in tree score (or
other measure of goodness of fit to data) than would be expected if applied
to random data, then use the more complex model.
Hillis, D.M., C. Moritz, and B.K. Mable, eds. 1996. Molecular Systematics,
2nd Ed. Sinauer Associates, Inc. Sunderland, MA.
Li, W.-H. 1997. Molecular Evolution. Sinauer Associates, Inc., Sunderland
MA.