By the end of 2002 the GenBank database had over 28x109 base pairs of DNA sequence data. Some of this has been annotated, but much of it either has no annotations or is incorrectly annotated. How can one find sequences that may be of interest if they have not been annotated? One way to find interesting sequences is to look for sequences that are similar to a known sequence. Several search algorithms have been developed that can search the database for sequences that are similar to a query sequence.
Among the most important algorithms used to search sequence databases at present (2003) are a family of algorithms based on BLAST, the "Basic Local Alignment Search Tool." BLAST performs particularly well with protein-coding sequences. A second, slightly older, algorithm FASTA may perform better with non-coding DNA sequences.
Searching a large sequence database is a difficult problem because there are many possible ways in which the query sequence might align with the database. To speed up this process BLAST looks for small regions of perfect match between the query and target sequences, and then examines the sequence that adjoins these regions to see if there is a longer stretch that matches perfectly.
The first step in understanding this process is to become familiar with the empirical properties of searching databases with BLAST. The objective of this exercise is to use variants of BLAST to search GenBank and to study how they behave under different conditions.
Consider the following DNA sequence:
ATTTGGAGCATCATGCCTGCAAACTCCGAGAAGGAGCACCTCTCCATCGT
GATTTGCGGCCATGTCGACAGTGGCAAGAGCACCACAACAGGGCGGCTCA
TCTTCGAGCTCGGTGGCCTTCCAGAGCGCGAACTTGACAAGCTGAAGCAG
GAGGCTGAGCGTCTTGGGAAAGGTTCTTTCGCCTTTGCATTCTACATGGA
CCGGCAGAAGGAGGAGCGTGAGCGTGGGGTGACCATCGCTTGCACCACGA
AGGAGTTCTACACCGAGAAGTGGCACTACACAATCATTGATGCACCGGGC
CACCGTGATTTCATCAAGAACATGATCACGGGTGCATCCCAGGCTGATGT
CGCACTCATCATGGTTCCCGCAGACGGAAACTTCACGACAGCAATCGCCA
AGGGCAACCACAAGGCGGGGGAAATCCAGGGCCAGACCAGGCAGCATTCC
CGGCTCATCAACTTGCTTGGCGTGAAGCAGATCTGCATTGGCGTGAACAA
GATGGACTGCGACACGGCGGCATACAAGCAGGCCCGTTATGATGAGATTG
CAAATGAGATGAAGAGCATGCTCGTGAANGTCGGGTGGAAGAAGGACTTT
ATTCGAGAAAACACACCCGTGATGCCCATCT
This is a DNA sequence which has been obtained by arbitrary screening of a cDNA library. We would like to learn more about the sequence. One easy way of getting insight into a sequence is to find out whether or not it resembles seqeunces that have already been reported in other studies. To do this, we will use BLAST to compare the sequence to the GenBank database maintained by NCBI (the National Center for Biotechnology Information, a branch of the NIH National Library of Medicine). We will use the sequence above as a query sequence, and use BLAST to compare the query sequence to the GenBank database. The actual analysis will be run on a massively parallel supercomputer operated by NCBI as a service to the research community. There are several ways to submit searches to the blast server; we will start with the web interface.
Note! It is essential that you understand how different computers interact to perform the analyses that you are carrying out. When you use a web browser to connect to a web site, you are initiating a host/client interaction. Your desktop computer is the client, the computer that is running the web host software is the host. In this case you will be running a computationally intensive task on the host computer, so the apparent speed with which the analysis runs will be a function of the load on the host computer (among other factors).
First, copy the sequence. Then go to the NCBI web site (http://www.ncbi.nlm.nih.gov/; this is also given in the class "links" page), and follow the link for BLAST on the NCBI home page, and then the link for Standard nucleotide-nucleotide BLAST [blastn]. In the space provided, paste the sequence and then click on the button that says BLAST!
The page will be replaced with a page called "formatting BLAST." Notice that it provides you with a blast ID number, an estimate of how long it will take for the results to be returned, and some formatting options.
While you are waiting for your blast results to be returned, open up another browser window and expore the NCBI home page. There are many useful resources provided by NCBI, and you will be visiting this site frequently. It pays off to know one's way around it. You should also read the blast overview (http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html) and other information linked to the blast page.
After waiting for a seemly period of time, go back to the "formatting BLAST" page and click on the FORMAT button. The results of your blast search will be displayed on a new web page. There is information on how to cite this analysis in scientific publications and on the nature of your search, followed by a set of colored lines that illustrate the results of the search, and then text describing the results of the search, and below that more text showing examples of the best matches.
Mouse over the colored lines and notice how the display changes. Look at how this information correlates with the text further down the page, and notice that there are links to the sequences which the query sequence matched. Take some time here and try to look at all of the features on this web page. If you understand these resources well it will save you a lot of time in the future.
What inferences about this sequence can you make from this information?
What is the identity of the sequence?
What gene do you think it encodes?
What organism do you think it comes from?
How reliable do you think this inference is? Why?
Hint: look at the bit score, at the e-value, and at the individual matches (notice that there are links you can follow).
Recall that the sequence was from a cDNA library. That means that it is probably a protein-coding sequence. Blast is more sensitive to subtle patterns in amino acid sequences than in nucleotide sequences, so it can be helpful to try a search that takes advantage of the information that this is a protein coding sequence. We don't know if the sequence is in frame, so we will want to search a translation of the sequence in all six possible reading frames against a protein database.
Because you are working with a nucleotide sequence, you will need to perform a translated search. Return to the BLAST home page (http://www.ncbi.nlm.nih.gov/BLAST/) and under Translated BLAST Searches select Nucleotide query - Protein db [blastx].
Notice that there are a number of other options you can select, but don't change any of them.
Submit the search request, and chill out learning more from the site until the results are returned.
Note: Blast searches submitted to via the web site are submitted to a queue, and they are given a priority that is a function of the number of searches you submit at the same time. If you submit a series of searches from the same computer, each search will take progressively longer. If you want to submit multiple searches it is best not to use the web interface to submit searches. We will submit searches via email later in the semester, but if you are anxious to submit searches via email, send an email consisting of the single word HELP to blast@ncbi.nlm.nih.gov.
How do the results differ from the blastn search?
What inferences can you make from the different results in the two searches
What is the identity of the sequence?
What gene do you think it encodes?
What organism do you think it comes from?
How reliable do you think this inference is? Why?
Why do nucleotide and amino acid searches behave very diffferently? How do these two types of data differ in the way in which they carry information? Remember that each amino acid is encoded by three nucleotides, but that an amino acid sequence also consists of one-third the number of characters as its corresponding nucleotide sequence.
What percent sequence identity would you expect in an alignment (without gaps) of two random DNA sequences?
What about two random amino acid sequences?
Consider the different options, including parameters, that can be set from the BLAST page. Can you determine what effect each of these will have? Some control the way in which the BLAST results are formatted, while others control how the algorithm itself will function.
Change the word size from 11 to 7 and repeat the BLASTN search. Are the results identical to the word size 11 search? How do the two searches differ? What happens if you use a word size of 15?
Additional unknown sequences are available from past homework assignments linked to the class home page. Pick one of these sequences and repeat the searches listed above. What observations can you make about how to use BLAST most effectively?
Running BLAST from a command line interface
NCBI makes available a BLAST client, blastcl3 that can be used to launch BLAST searches from a local computer without using a web interface. Although this takes somewhat more thought that using the web interface, it is much easier to automate, and consequently is preferred for analyses of multiple sequences.
A second BLAST client, NetBLAST is part of the GCG analytical package. We will use this later in the semester.