Estimating Alignment Quality
What is a good alignment?
Ideally it would reflect homology, which would require knowledge of the history of indel mutations. This is very rarely possible.
As a stand-in for information about the history of evolution, alignment algorithms rely on patterns of similarity.
How are these patterns expected to vary?
Random sequences
A reasonable first approximation would be to say that an alignment is "good" if it shows greater similarity than would be expected if compared to an alignment of random sequences.
But what does random mean in this context?
If gaps can be created at no cost, then an arbitrarily long alignment can be created with a very high fraction of matching characters, hence the gap penalty.
The goal is to calculate the probability that two random sequences aligned against each other would achieve a score as high as the observed score.
Normal Distribution
The most familiar distribution.
What are its characteristics? What assumptions underly the normal distribution?
Area under curve is 1
Symmetrical around mean, (expectation value) at 0
Area under curve above the mean is 0.5, area below is the same. Variance, sigma2 is 1 ?
Gumbel Extreme Value Distribution
Used when the value of interest has been pre-selected to maximize some score.
Y = exp [-x-e -x]
Characteristics
Asymmetric, and skewed to the higher values.
Mode (expectation value) is is the Euler-Mascheroni constant = 0.57722...
Variance of x = Pi2/6 = 1.6449
Characteristic value is u, scaling factor is lambda
Typical test is probability of obtaining a score of at least x with random data.
Because of the skewness of the curve, x must be higher to be significant. This makes sense; the values being compared have been selected to be high values.
Expected score of an ungapped alignment
For a typical scoring matrix, the score that is likely to be achieved at the P=0.05 level is given by
S = log2(nm)
For example, two 250 amino acid sequences would be expected to yield an alignment with a score of at least 16 bits.
--> find a log base 2 table
GCG has a function that makes it easy to calculate the score of a query sequence aligned against a randomized target sequence with the same composition as the target sequence of interest. Use this to get a sense for alignments that are found at random.
Gapped alignments
Gap penalties are approximations at best
BLAST
p(S>=x) = 1-exp(-e-lambda(x-u))
u = [log(Km'n')]/lambda
K and lambda are parameters calculated from scoring matrix
m' and n' are effective sequence lengths
Actual length minus length of alignment expected at random
E = 1-e-p(s>x)D
D - database size
Bits - units of log base 2; this is a common measure in information theory
K - A constant that represents the abundance of bases and their match-scores
Mount, D. W. 2000. Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.