Working with GenBank

We have used BLAST to find matches to a query sequence in the GenBank database that is maintained by NCBI. In this exercise we will use Entrez to explore GenBank and look at other resources provided by NCBI. Entrez is an important integrated database provided by NCBI, and provides powerful tools to find and explore biotechnology information.

First, simply find a gene of interest. If there are very few similar sequences, this is very easy. For example, it would seem that much of anyone really cares about the (very pretty) green alga Draparnaldia.

Go to the NCBI web site and use Entrez to search GenBank with the name Draparnaldia.

To do this you can just go to NCBI, type draparnaldia in the search box, be sure that GenBank is selected in the pop-up menu, and click on go.

What have you found?

What gene sequences are available for Draparnaldia?

Follow the link and examine the GenBank file. This is a hyperlinked version of the GenBank flat file format.

Your textbook has information on the flat file format and other formats used by GenBank. A great deal of additional information is available on the NCBI website. It is very important that you become comfortable reading these files and understanding the information in them.

Notice that there are links on this page. Follow the link next in the Medline field. What does it show you? Return to the flat file.

There are two sequences shown at the bottom of the file. How are they related to each other? Which of these is primary data, and which is secondary? What does that mean?

Unfortunately the size of the database makes most such searches much more complex than finding Draparnaldia sequences. Try searching for genes from the plant Lycopersicon. Notice that you are now using entrez set to search the nucleotide database; if you want to explicitly limit your search to the nucleotide database from the outset, you can select entrez from the menu bar on the NCBI home page, and then select nucleotide from the entrez menu bar.

How did your search compare with the search on Draparnaldia?

Is there a tufA sequence available for Lycopersicon? How can you find out?

This is not as easy as it sounds. Try searching on lycopersicon AND tufA

What did you find? Why?

When you are performing an unrestricted search, entrez will find a record if it has the matched words anywhere in the record. Consequently successful searches need to be carefully constructed to find the features that you really want.

You can limit your search to specific fields by entering the search term in the format
term [field] OPERATOR term [field]

Entrez permits the use of boolean operators: AND, OR, and NOT are all supported. The boolean operator must be entered in capital letters, but the search terms themselves are not case sensitive. For more information on how entrez handles boolean operators, you can click on entrez nucleotide help, and look for the section on refining your search.

What other elongation factors are available in genbank? Search on elongation factor.

Is this better or worse? Why?

When you use two or more words in a search, they are combined by default with AND.

This search will show you several qualitatively different types of sequence. Identify some of the categories of sequence you have found.

You can limit your search to specific types of sequences. Click on the button that says limits, look at what your options are, click on the check box that says all of the above, and then repeat the search.

How has this affected the results of your search?

This is better, but it still isn't giving us the information that we want.

Entrez also maintains a history of your search, and allows you to combine and modify previous searches. Click on the button that says history, and examine the information that appears. What you want to do is combine the search on Lycopersicon with the limited search on elongation factor. You can do this by referring to the elements in your history by their number. Compose a search, and notice that there is a new button that says preview. Once your search has been composed, click on preview and see what happens. If you search is composed correctly you should have roughly 20 sequences. Once your search is composed correctly, perform the search and look at the files retrieved.

How could you improve this search?

The gene encoding the protein triose-phosphate isomerase (TPI) has been used in several studies of intron evolution. The distribution of introns in mosquito TPI genes has been of particular interest. We would like to identify what TPI genes have been determined from the insect order Diptera (flies).

How many TPI genes from Diptera can you find?

How many of these are from flies other than Drosophila?

How does the intron content of these genes compare?

Dotplot on the web:

A nice online implementation of dotplot is available at:

http://entelechon.de/bionautics/dotplot.php3

The structure of a GenBank file

LOCUS: Includes a unique identifier (often the accession number), and information on the length of the sequence, its data type, and modification date
DEFINITION: Provides brief information about the biological attributes of the record. It should include the genus and specific epithet of the source organism, the product name and gene abbreviation, and information about the nature of the sequence.
ACCESSION: This is a unique number associated with the record. This number will always be associated with the record, and is the number that you should record to keep track of which records you have worked with.
NID and VERSION: These identifiers help keep track of changes that have been made to a record. For a number of reasons, a record may be modified. In this case, the accession number will remain the same, but tracing the history of changes is possible with information from these lines.
KEYWORDS: Keywords should be selected from a list of standardized terms. Unfortunately this has not always been the case, and the keywords entry is not always useful.
SOURCE ORGANISM: Contains information about what organism the sequence was determined from, as well as taxonomic information about that organism. This is far more complex than it might seem at first, but Genbank is making a concerted effort to have this information be consistent and useful. The Genbank taxonomy browser is a very useful tool available on the Genbank web site.
REFERENCE: Information on where the sequence was published. Some records will have several references associated with them. All Genbank entries are required to have at least one reference.
FEATURES: The feature table includes a number of sub-headings, and often has extensive information about the biology of the sequence. In the NCBI model, features refer to specific parts of the sequence, while descriptors refer to the entire sequence. The feature table can contain a great deal of information, and if carefully constructed by the original author can be tremendously helpful. Unfortunately they can also contain useless or incorrect information, and some complex sequences are barely annotated.

Additional Reading

Gibas and Jambeck, Chapter 6.

Baxevanis & Ouelette, Chapter 5.

http://www.ncbi.nlm.nih.gov:80/entrez/query/static/help/helpdoc.html

Bioinformatics Home

Syllabus

Links

Reading