Phylogenetic methods can be used for many purposes, including analysis of morphological and several kinds of molecular data. We concentrate here on the analysis of DNA and protein sequences.
Comparisons of more than two sequences
Analysis of gene families, including functional predictions
Estimation of evolutionary relationships among organisms
The basic concepts of phylogenetic analysis are quite easy to understand, but understanding what the results of the analysis mean, and avoiding errors of analysis can be quite difficult. For detailed coursework you can take my graduate class on the topic.
A "quick and dirty" substitute for phylogenetic analysis
Using BLAST for multiple sequence comparisons
Emphasis is on reciprocal best hits, particularly among three genomes
This is probably an OK way to identify homologs, but it does not have the power of full phylogenetic analysis
Example with Everyday Objects
The basic model of phylogenetic analysis.
Nearly all methods of phylogenetic analysis share a number of fundamental assumptions. These include:
Homologous sequences are in a multiple sequence alignment.
• Note that homology is an a priori assumption of most phylogenetic methods. If homology is uncertain, then the analytical results should be interpreted with great caution.
The alignment is also referred to as a data matrix
Each column in the alignment is referred to as a character.
The specific residue (nucleotide or amino acid) present in a given sequence is referred to as the character state.
They are assumed to have been derived from a single common ancestor (this statement is actually redundant; by definition homologous sequences must be derived from a common ancestor).
In most cases ancestral sequences are not known, and the ancestral states must be inferred
The ancestral sequences are assumed to have undergone mutation
Modeling mutation accurately is one of the challenges of phylogenetic analysis
They are assumed to be related by a dichotomously branching tree
A priori assumptions include (but are not necessarily limited to):
Accuracy of sequence
That the sequence itself is correct
That it was determined from the correct organism
Violations of this assumption are more common than one might suspect. Several kinds of laboratory errors can result in incorrect annotation of an otherwise legitimate sequence.
That homology has been correctly determined. This applies to both the sequences themselves and the alignment.
Paralogy can cause tremendous confusion.
The assumptions that went into making the multiple sequence alignment are among the assumptions of the phylogenetic analysis that is based on that alignment.
That sufficient similarity remains among the sequences that there is usable phylogenetic information present.
The assumptions of phylogenetic analysis described above
Other critical considerations
The information content of the sequences
Assumptions particular to the analytical method (this will constitute much of our discussion for the next few lectures)
Note that even if a gene phylogeny is correctly inferred, that phylogeny may not be helpful. For example, because of paralogy, hybridization, introgression, and horizontal gene transfer, gene phylogenies do not always correspond to the phylogeny of the genome as a whole.
The data matrix
Multiple sequence alignments as data matrices
The importance of homology assessment
Phylogenetic methods can be divided into three general categories
Optimality criteria vs. tree-building algorithms
Part of a larger theoretical system refered to as "Cladistics"
Emphasises shared derived character states
The idea is that monophyletic groups can be recognized because they share derived character states ("synapomorphies").
Invariant, unique ("autapomorphic"), and ancestral character states are considered to be uninformative
Search for the tree that requires the smallest number of character-state changes
Determining the length of a tree
Minimum number of steps for a given character can be determined in one pass
We will look at a simple case with unordered characters
This tells you the tree length, but does not map the characters onto the tree
Determining a most parsimonious reconstruction requires another pass
This reconstruction will not necessarily be unique!
The problem with uncorrected methods
Parsimony is easy to understand and can be a useful analytical method, but the method makes some assumptions that may not be immediately obvious. One of parsimony's most important assumptions is that it is relatively unusual for identical character-states to appear independently in different parts of the phylogenetic tree. In other words, it assumes that convergent evolution is a relatively rare phenomenon.
Unfortunately this is not a valid assumption for biological sequence data.
When the possible number of character states is limited, then one expects to observe convergent evolution. Because DNA has only four possible character states, two unrelated DNA sequences would be expected to have the same nucleotide present in roughly 25% of all positions. Two random aligned sequences would be expected to share somewhat more than 25% sequence identity (why?).
Because of this, under some conditions parsimony methods will be inconsistent
Although amino acid data have more character states than DNA and are therefore probably less
Models of DNA Sequence Evolution
All substitutions are equally likely
All nucleotides occur with equal frequency
Kimura Two Parameter (K2P)
Transitions and transversions can occur at different rates
All nucleotides occur with equal frequency
A C G T A Transversion Transition Transversion C Transversion Transversion Transition G Transition Transversion Transversion T Transversion Transition Transversion
In the evolution of real sequences transitions are typically observed more often than transversions.
Example of a substitution probability matrix consistent with the K2P model.
A C G T A 0.6 0.1 0.2 0.1 C 0.1 0.6 0.1 0.2 G 0.2 0.1 0.6 0.1 T 0.1 0.2 0.1 0.6
These values represent the probability of the corresponding event occurring within a unit of time, t.
The values in the diagonals are selected such that each row adds up to one. Each row has to add up to one because the substitution matrix takes into account all possible events within the model.
Felsenstein 1985 and Hasegawa, Kishino, and Yano, 1985 (F84/HKY85)
Transitions and transversions occur at different rates
The four nucleotides can occur with different frequencies
General Time Reversible
Each of the six possible substitutions occurs at a different rate, but rates are always symetrical, i.e., the rate for A being substituted by C is equal to the rate for being substituted by A.
Nucleotides can occur with different frequencies.
Modeling site-to-site rate variation
Invariant sites model
Pairwise distances can be aggregated into a phylogenetic tree
Search for the tree that minimizes discrepancies among pairwise distances
May or may not use an explicit model of sequence evolution
How the distances are calculated and how the tree is found can be mixed and matched
To know what method is being used, you have to know both how the distance matrix was constructed, and how the tree was determined
A model of sequence evolution can be used to relate the data to a hypothesis (typically a tree topology).
Search for the tree that maximizes the likelihood function
The idea is to find the tree that is most likely given the data and the model
Typically uses a Monte Carlo algorithm
Estimates probabilities for branch lengths and tree topologies
Properties of analytical methods
Felsenstein, Joseph. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, MA.
Hillis, D.M., C. Moritz, and B.K. Mable, eds. 1996. Molecular Systematics, 2nd Ed. Sinauer Associates, Inc. Sunderland, MA.
Edwards, A.W.F. 1972. Likelihood, Expanded Edition. Johns Hopkins Press, Baltimore.
Hennig, W. 1966. Phylogenetic systematics. University of Illinois Press, Urbana.