CBMG 688N/O Home Syllabus Links Hints

Models of character evolution

A model is an abstract representation of reality. These can be used to relate data to a hypothesis, or (in other words) can be used to help provide a context for the data. Models have many uses in science, but in general they allow the scientist to perceive patterns in data that would otherwise be too chaotic to understand, and to generate synthetic data with properties that mimic those of real data.

Commonly used, nested models of DNA sequence evolution

Jukes-Cantor

    Assumes that all nucleotides are present with equal frequencies

    Assumes equal probabilities for all possible nucleotide substitutions

    If the mutation rate is u, then with 4 nucleotides, the odds of substitution are (4/3)u.

    In each unit time, dt, the probability of no event occuring is

Kimura two-parameter

    Transition-transversion ratio

      1.  
        A
        C
        G
        T
        A   Transversion Transition Transversion
        C Transversion   Transversion Transition
        G Transition Transversion   Transversion
        T Transversion Transition Transversion  

    Assumes that all nucleotides are present with equal frequencies

Felsenstein 1984

    Nucleotide frequency

    Transition-transversion ratio

General time reversible

Additional variants

Model Parameters

What is a parameter?

Parameter Estimation

    In the context of likelihood, the model and the hypothesis can be changed

    Thus one can use likelihood to estimate the paramter values as well as the tree topology

    It is generally fastest to hold the tree topology constant and determine the parameters from this

    Because the parameters do not vary greatly among similar trees, it is usually safe to estimate paramters on any reasonable tree

    The concern is often raised that this form of parameter estimation involves circular reasoning.

    This is a legitimate concern.

    However parameter estimation and topology searching are part of an overall process of global optimization.

    Thus the real concern is that parameter estimation will lead to a local, rather than a global optimum.

    One could also re-estimate parameters for each tree being evaluated, but this would be extremely slow.

    To explore these concerns, you should try doing parameter estimation starting with a variety of tree topologies.

    One can also estimate parameters on a "star" phylogeny (a fully unresolved tree), from a distance matrix, or by Monte Carlo simulation.

     

Models of site to site rate variation

    Invariant sites model DNArates model

    Gamma distribution

    Invariant sites + gamma

Assumptions made by parsimony methods


Jukes-Cantor (JC)

  A C G T
A -3a a a a
C a -3a a a
G a a -3a a
T a a a -3a

Kimura Two Parameter (K2P)

  A C G T
A -a-2b b a b
C b -a-2b b a
G a b -a-2b b
T b a b -a-2b

Hasegawa, Kishino, Yano 1985 (HKY85)

  A C G T
A -m(kpG+pY) mpC mkpG mpT
C mpA -m(kpT+pR mpG mkpT
G mkpA mpC -m(kpA+pY mpT
T mpA mkpC mpG -m(kpC+pR

Where a=m, b=mk, pR=pA+pG, and pY=pC+pT

Examples

Consider two distantly related sequences:

AAAAACCCCCGGGGGTTTTT
ACGTACGTACGTACGTACGT

Fxy=

p distance:

dxy = 1-(a+f+k+p) = 1-(0.1+0.1+0.1+0.1) = 0.60

JC:

D = 1-(a+f+k+p)

dxy = -3/4ln(1-4/3*D) =


e is the base of natural logarithms, and is the limit of the exponential function when z = 1. It is the base for natural logarithms.

exp (ln x) = x = ln (exp x)

The Poisson distribution is an approximation of the binomial distribution, and is appropriate to use in cases where the odds of an event occuring are small, but there are many opportunities for that even to occur.

 


HMM, page 434

Huelsenbeck, J.P., and Rannala, B. 1997. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science 276:227-232.