Base Compositional Bias

  1. The models we have discussed so far assume that the data are stationary, i.e., that the same model of sequence substitution applies across the whole tree. This assumption is probably often violated to varying degrees with real data.
  2. A well documented form of non-stationarity is lineage specific base compositional bias, often called G-C bias.
  3. Base composition and codon usage varies comsiderably among lineages
    1. In some cases base composition is thought to be an adaptive trait
      1. Thermophilic bacteria have high GC content in their ribosomal RNA genes and at sites that are free to vary in protein coding genes. This is probably because the stronger bond in GC base-pairing helps to stabilize the nucleic acid at high temperatures.
    2. Base composition also probably varies over time via a random walk.
  4. Data transformation can decrease the sensitivity of the analysis to base compositional bias
    1. Transversion analysis (Carl Woese)
      1. Recode nucleotide data as purine (R) vs. pyrimidine (Y)
      2. Only information from transversion events is used.
      3. Because A & G are purines, and C & T are pyrimidines, transversions are expected to be less strongly influenced by selection for base composition.
    2. Peptide analysis
      1. Degeneracy in the genetic code means that amino acid composition can remain relatively constant even if nucleotide sequence is under selection
      2. Translate nucleotide sequence to amino acid sequence and analyze that
  5. Analytical methods have been developed that are more tolerant of non-stationarity
    1. This is an active area of research in phylogenetic theory.
    2. LogDet (Pete Lockhart, Jim Lake)
      1. A determinant is a manipulation from matrix algebra that will give you the volume of a trapezoid in n-dimensional space
      2. The basic idea is to use the determinate of a pairwise transformation matrix as a measure of distance
      3. LogDet is a scaled determinate method that produces values that are more directly comparable to standard genetic distances.