Heuristic Alignment Algorithms

BLAST - Basic Local Alignment Search Tool

BLAST is a pairwise local alignment search tool that is designed to operate maore quickly than exact methods, but without a guarantee of finding the best possible alignment.

At present BLAST is the preferred tool for searching large sequence databases such as GenBank.

Searching a database with a query sequence can be considered to be a problem in local alignment.

Local alignment tools seek a region of alignment between two sequences where the match between the sequences is better than some threshold (which is often set to be the expected match between random sequences of similar length and composition).

It is possible to calculate the expected match score for two random sequences that match the composition of the query and target sequences.

Several variants of BLAST are now available, reflecting both progressive improvement of the algorithm and tuning for specific purposes.

"Classic" BLAST (Altschul et al. 1990)

Attempts to find the maximal segment pair (MSP) for two sequences, defined as locally optimal if the score cannot be improved either by lengthening or shortening the segment pair.

Uses a matrix of similarity scores (BLOSUM, PAM, etc.)

Most effective with polypeptide sequences.

Inspired by dynamic programming algorithms such as the Needleman-Wunsch and Smith-Waterman algorithms.

These algorithms are guaranteed to find the best possible alignment (given their assumptions), but do so at the cost of evaluating a matrix of size m+1, n+1, where m and n are the lengths of the two sequences being aligned.

BLAST is not the first or the only attempt to find a heuristic approach to this problem, but it is currently the most widely used search algorithm for protein-coding sequences (FASTA can perform better with some nucleotide sequences).

BLAST also uses explicit statistical measures to improve the efficiency and precision of the search.

Three primary steps in the algorithm:

  1. Compile a list of high-scoring words
    1. Find the list of all words (typically 3 AA or 11NT) in the query sequence that would score above some threshold T
    2. Discard those words that score below T (these are assumed to carry too little information to be useful starting seeds)
  2. Scan the database for hits
    1. When T is high, the search will be rapid, but potentially informative matches will be missed.
  3. Extend the hits
    1. Attempt to extend this match to see if it is part of a longer segment that scores above the MSP score S
    2. Report only those hits that yield a score above S

    From the score S it is also possible to calculate an expectation score E, which is an estimate of how many local alignments of at least this score would be expected given the characteristics of the query sequence and database.

    The original BLAST did not permit gaps, so it would find relatively short regions of similarity, and it was often necessary to extend the alignment manually or with a second alignment tool.

Improvements to BLAST

BLAST has gone through several revisions since first published.

Modify word pair criterion to require two hits within a defined distance of each other

BLAST can screen a database for short HSPs more rapidly than a exact algorithm can screen the database, but this is still a time-consuming part of the search process.

It would help if some HSPs could be discarded quickly.

When a HSP corresponds to a genuine region of alignment, there is a high probability that there will be another HSP located nearby.

With this observation, it is possible to establish a new criterion: search only those HSPs that have a second HSP within a finite distance.

Permit gaps (Gapped BLAST)

The classic BLAST algorithm did not permit insertion of gaps.

Use an iteratively generated, position specific scoring matrix (a profile) to calculate scores (PSI-BLAST)

Pre-processing

Filter low complexity regions

PSI-Blast

Position-specific interative blast

Pros and cons

FASTA

A different algorithm, somewhat older than BLAST, but still very effective.

May outperform BLAST with nucleotide sequences

 


Altschul, S. F. and Gish, W. 1996. Local alignment statistics. Methods Enzymol. 266: 460-480.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. 1990. Basic Local Alignment Search Tool. Journal of Molecular Biology 215: 403-410.

Altschul, S. F., Madden, T. L., Schþffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402.

Bioinformatics Home
Syllabus
Links
Reading