Models of DNA Sequence Evolution

Models of character evolution

A model is an abstract representation of reality. These can be used to relate data to a hypothesis, or (in other words) can be used to help provide a context for the data. Models have many uses in science, but in general they allow the scientist to perceive patterns in data that would otherwise be too chaotic to understand, and to generate synthetic data with properties that mimic those of real data.

Commonly used, nested models of DNA sequence evolution

Jukes-Cantor

Assumes that all nucleotides are present with equal frequencies

Assumes equal probabilities for all possible nucleotide substitutions

If the mutation rate is u, then with 4 nucleotides, the odds of substitution are (4/3)u.

In each unit time, dt, the probability of no event occuring is

Kimura two-parameter

Transition-transversion ratio

	A	C	G	T
A		Transversion	Transition	Transversion
C	Transversion		Transversion	Transition
G	Transition	Transversion		Transversion
T	Transversion	Transition	Transversion

Assumes that all nucleotides are present with equal frequencies

Felsenstein 1984

Nucleotide frequency

Transition-transversion ratio

General time reversible

Additional variants

Model Parameters

What is a parameter?

Parameter Estimation

In the context of likelihood, the model and the hypothesis can be changed

Thus one can use likelihood to estimate the paramter values as well as the tree topology

It is generally fastest to hold the tree topology constant and determine the parameters from this

Because the parameters do not vary greatly among similar trees, it is usually safe to estimate paramters on any reasonable tree

The concern is often raised that this form of parameter estimation involves circular reasoning.

This is a legitimate concern.

However parameter estimation and topology searching are part of an overall process of global optimization.

Thus the real concern is that parameter estimation will lead to a local, rather than a global optimum.

One could also re-estimate parameters for each tree being evaluated, but this would be extremely slow.

To explore these concerns, you should try doing parameter estimation starting with a variety of tree topologies.

One can also estimate parameters on a "star" phylogeny (a fully unresolved tree), from a distance matrix, or by Monte Carlo simulation.

Models of site to site rate variation

Invariant sites model DNArates model

Gamma distribution

Invariant sites + gamma

Assumptions made by parsimony methods

Jukes-Cantor (JC)

	A	C	G	T
A	-3a	a	a	a
C	a	-3a	a	a
G	a	a	-3a	a
T	a	a	a	-3a

Kimura Two Parameter (K2P)

	A	C	G	T
A	-a-2b	b	a	b
C	b	-a-2b	b	a
G	a	b	-a-2b	b
T	b	a	b	-a-2b

Hasegawa, Kishino, Yano 1985 (HKY85)

	A	C	G	T
A	-m(kp_G+p_Y)	mp_C	mkpG	mp_T
C	mpA	-m(kp_T+p_R	mpG	mkpT
G	mkpA	mpC	-m(kp_A+p_Y	mpT
T	mpA	mkp_C	mpG	-m(kp_C+p_R

Where a=m, b=mk, p_R=pA+pG, and p_Y=p_C+pT

Examples

Consider two distantly related sequences:

AAAAACCCCCGGGGGTTTTT ACGTACGTACGTACGTACGT

Fxy=

p distance:

d_xy = 1-(a+f+k+p) = 1-(0.1+0.1+0.1+0.1) = 0.60

JC:

D = 1-(a+f+k+p)

d_xy = -3/4ln(1-4/3*D) =

e is the base of natural logarithms, and is the limit of the exponential function when z = 1. It is the base for natural logarithms.

exp (ln x) = x = ln (exp x)

The Poisson distribution is an approximation of the binomial distribution, and is appropriate to use in cases where the odds of an event occuring are small, but there are many opportunities for that even to occur.

HMM, page 434

Huelsenbeck, J.P., and Rannala, B. 1997. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science 276:227-232.