CBMG 688N/O Home | Syllabus | Links | Hints |
Models of character evolution
A model is an abstract representation of reality. These can be used to relate data to a hypothesis, or (in other words) can be used to help provide a context for the data. Models have many uses in science, but in general they allow the scientist to perceive patterns in data that would otherwise be too chaotic to understand, and to generate synthetic data with properties that mimic those of real data.
Commonly used, nested models of DNA sequence evolution
Jukes-Cantor
Assumes that all nucleotides are present with equal frequencies
Assumes equal probabilities for all possible nucleotide substitutions
If the mutation rate is u, then with 4 nucleotides, the odds of substitution are (4/3)u.
In each unit time, dt, the probability of no event occuring is
Kimura two-parameter
Transition-transversion ratio
A |
C |
G |
T |
|
A | Transversion | Transition | Transversion | |
C | Transversion | Transversion | Transition | |
G | Transition | Transversion | Transversion | |
T | Transversion | Transition | Transversion |
Assumes that all nucleotides are present with equal frequencies
Felsenstein 1984
Nucleotide frequency
Transition-transversion ratio
General time reversible
Additional variants
Model Parameters
What is a parameter?
Parameter Estimation
In the context of likelihood, the model and the hypothesis can be changed
Thus one can use likelihood to estimate the paramter values as well as the tree topology
It is generally fastest to hold the tree topology constant and determine the parameters from this
Because the parameters do not vary greatly among similar trees, it is usually safe to estimate paramters on any reasonable tree
The concern is often raised that this form of parameter estimation involves circular reasoning.
This is a legitimate concern.
However parameter estimation and topology searching are part of an overall process of global optimization.
Thus the real concern is that parameter estimation will lead to a local, rather than a global optimum.
One could also re-estimate parameters for each tree being evaluated, but this would be extremely slow.
To explore these concerns, you should try doing parameter estimation starting with a variety of tree topologies.
One can also estimate parameters on a "star" phylogeny (a fully unresolved tree), from a distance matrix, or by Monte Carlo simulation.
Models of site to site rate variation
Invariant sites model DNArates model
Gamma distribution
Invariant sites + gamma
Assumptions made by parsimony methods
Jukes-Cantor (JC)
A | C | G | T | |
A | -3a | a | a | a |
C | a | -3a | a | a |
G | a | a | -3a | a |
T | a | a | a | -3a |
Kimura Two Parameter (K2P)
A | C | G | T | |
A | -a-2b | b | a | b |
C | b | -a-2b | b | a |
G | a | b | -a-2b | b |
T | b | a | b | -a-2b |
Hasegawa, Kishino, Yano 1985 (HKY85)
A | C | G | T | |
A | -m(kpG+pY) | mpC | mkpG | mpT |
C | mpA | -m(kpT+pR | mpG | mkpT |
G | mkpA | mpC | -m(kpA+pY | mpT |
T | mpA | mkpC | mpG | -m(kpC+pR |
Where a=m, b=mk, pR=pA+pG, and pY=pC+pT
Consider two distantly related sequences:
AAAAACCCCCGGGGGTTTTT
ACGTACGTACGTACGTACGT
Fxy=
p distance:
dxy = 1-(a+f+k+p) = 1-(0.1+0.1+0.1+0.1) = 0.60
JC:
D = 1-(a+f+k+p)
dxy = -3/4ln(1-4/3*D) =
e is the base of natural logarithms, and is the limit of the exponential function when z = 1. It is the base for natural logarithms.
exp (ln x) = x = ln (exp x)
The Poisson distribution is an approximation of the binomial distribution, and is appropriate to use in cases where the odds of an event occuring are small, but there are many opportunities for that even to occur.
HMM, page 434
Huelsenbeck, J.P., and Rannala, B. 1997. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science 276:227-232.