Base Compositional Bias
- The models we have discussed so far assume that the data are stationary,
i.e., that the same model of sequence substitution applies across the whole
tree. This assumption is probably often violated to varying degrees with real
data.
- A well documented form of non-stationarity is lineage specific base compositional
bias, often called G-C bias.
- Base composition and codon usage varies comsiderably among lineages
- In some cases base composition is thought to be an adaptive trait
- Thermophilic bacteria have high GC content in their ribosomal RNA
genes and at sites that are free to vary in protein coding genes.
This is probably because the stronger bond in GC base-pairing helps
to stabilize the nucleic acid at high temperatures.
- Base composition also probably varies over time via a random walk.
- Data transformation can decrease the sensitivity of the analysis
to base compositional bias
- Transversion analysis (Carl Woese)
- Recode nucleotide data as purine (R) vs. pyrimidine (Y)
- Only information from transversion events is used.
- Because A & G are purines, and C & T are pyrimidines, transversions
are expected to be less strongly influenced by selection for base
composition.
- Peptide analysis
- Degeneracy in the genetic code means that amino acid composition
can remain relatively constant even if nucleotide sequence is under
selection
- Translate nucleotide sequence to amino acid sequence and analyze
that
- Analytical methods have been developed that are more tolerant of non-stationarity
- This is an active area of research in phylogenetic theory.
- LogDet (Pete Lockhart, Jim Lake)
- A determinant is a manipulation from matrix algebra that will give
you the volume of a trapezoid in n-dimensional space
- The basic idea is to use the determinate of a pairwise transformation
matrix as a measure of distance
- LogDet is a scaled determinate method that produces values that
are more directly comparable to standard genetic distances.