/Alignment Statistics

Estimating Alignment Quality

What is a good alignment?

Ideally it would reflect homology, which would require knowledge of the history of indel mutations. This is very rarely possible.

As a stand-in for information about the history of evolution, alignment algorithms rely on patterns of similarity.

How are these patterns expected to vary?

Random sequences

A reasonable first approximation would be to say that an alignment is "good" if it shows greater similarity than would be expected if compared to an alignment of random sequences.

But what does random mean in this context?

If gaps can be created at no cost, then an arbitrarily long alignment can be created with a very high fraction of matching characters, hence the gap penalty.

The goal is to calculate the probability that two random sequences aligned against each other would achieve a score as high as the observed score.

Normal Distribution

The most familiar distribution.

What are its characteristics? What assumptions underly the normal distribution?

Area under curve is 1

Symmetrical around mean, (expectation value) at 0

Area under curve above the mean is 0.5, area below is the same. Variance, sigma² is 1 ?

Gumbel Extreme Value Distribution

Used when the value of interest has been pre-selected to maximize some score.

Y = exp [-x-e^-x]

Characteristics

Asymmetric, and skewed to the higher values.

Mode (expectation value) is is the Euler-Mascheroni constant = 0.57722...

Variance of x = Pi²/6 = 1.6449

Characteristic value is u, scaling factor is lambda

Typical test is probability of obtaining a score of at least x with random data.

Because of the skewness of the curve, x must be higher to be significant. This makes sense; the values being compared have been selected to be high values.

Expected score of an ungapped alignment

For a typical scoring matrix, the score that is likely to be achieved at the P=0.05 level is given by

S = log₂(nm)

For example, two 250 amino acid sequences would be expected to yield an alignment with a score of at least 16 bits.

--> find a log base 2 table

GCG has a function that makes it easy to calculate the score of a query sequence aligned against a randomized target sequence with the same composition as the target sequence of interest. Use this to get a sense for alignments that are found at random.

Gapped alignments

Gap penalties are approximations at best

BLAST

p(S>=x) = 1-exp(-e^-lambda(x-u))

u = [log(Km'n')]/lambda

K and lambda are parameters calculated from scoring matrix

m' and n' are effective sequence lengths

Actual length minus length of alignment expected at random

E = 1-e^-p(s>x)D

D - database size

Bits - units of log base 2; this is a common measure in information theory

K - A constant that represents the abundance of bases and their match-scores

Mount, D. W. 2000. Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.