Converting Genbank Files to GCG

Molecular Systematics

Frequently Asked Questions

GCG

Using cut and paste to download genbank files and converting these files to GCG format

What other specialty software is available on UMBI?

PAUP

Importing PHYLIP trees into PAUP*

I have found a bunch of sequences in genbank and I want to get them into NEXUS format. How do I do that?

USING GCG

I have found a file of interest by searching genbank with a web browser, but when I try to use fetch to get the sequence, it indicates the accession number can't be found. Is there another way to convert a genbank file to GCG format?

This is probably because they are in "Genbank Updates" rather than "Genbank", and therefore may not be on the version of genbank that is on the UMBI computer. What you can do is to download the sequences directly from genbank either a) cutting and pasting -- do this only if the sequences are short -- or b) using netscape to save them to a file, and then use "fetch" or some other FTP tool to move them to the UMBI box. Note that there are two programs called 'fetch'. One is a macintosh program that is used for transfering files via the FTP protocol. The second is a part of the GCG package, and is used to retrieve files from genbank.

Cutting and pasting is a simple method of moving a genbank file from a web browser to GCG when you can't use the GCG tool fetch, but you will have to master some more unix commands. If you issue the following command:

cat > newfile

when you press carriage return the computer will start accepting input from the keyboard, so if you then paste, the unix box will put the characters you paste into the file 'newfile'. When you are done with that, press ^d (control-d, which is the unix end-of-file character). At that point you should get the prompt back. Alternatively, download the file to your mac or PC, and then use FTP to move the file to the GCG unix host.

The sequences will then be in Genbank format, *not* GCG format, you can convert them from genbank to GCG with a GCG tool called "fromgenbank".

It should look like this (use 'more' to look at the contents of the file you have created):

***********

prompt> cat > newfile

PASTE THE GENBANK FILE IN HERE (IN THIS CASE LOCUS MICSPCOX)

^d

 prompt> more newfile

CONTENTS OF THE GENBANK FILE (LOCUS MICSPOCOX)

prompt> fromgenbank

FromGenBank reformats one or more sequences in the flat file format of the    
GenBank database into individual sequence files in GCG format.

Reformat what GenBank data file?  newfile

     micspcox.seq   381 bp.

  reformatted: temp
  total files: 1
  total bases: 381

prompt> mv micspcox.seq Coleo-mt-coxII.seq

**********

Note that using this syntax, GCG assigns a new name to the file. This new name is based on the locus name, in this case micspcox.seq. So 'newfile' is still the same file it ever was. You will probably want to rename the file created by fromgenbank, so the last command shown above moves the file from one file name to another.

The cut and paste approach will only work with relatively small files. If you decide to download a huge file (e.g., the complete yeast genome) use FTP to move the file around. Reformatting the file with fromgenbank is the same in both cases.

CFD 1.18.98

Clustal and how to run long analyses on shared UNIX computers

I have installed Clustalw (another alignment program) on the UNIX cluster and Malign ( a simultaneous alignment and tree building program) on UMBI. If you are logged in as pbio699k, these programs will be available with the commands clustalw and malign, respectively. Documentation is in the "bin" directory:

~pbio699k/bin/clustalwdir/clustalw1.7/clustalw.doc ~pbio699k/bin/maligndir/MALIGN.TXT

If you perform *any* analyses that take more than a few minutes to execute, be sure to run them under "nice":

nice clustalw

For longer runs, where you will need to log out before the run is finished, use "nohup":

(nice nuhup malign PARAMFILE < INPUTFILE > OUTFILE ) >& STDERR.OUT &

We will discuss what that complex command line means in a few weeks.

CFD 3.2.98

How to import trees calculated with PHYLIP into PAUP

Trees generated by phylip (and related programs, such as fastDNAml and PROTML) can be imported into paup if the taxon names are identical in both the paup* (NEXUS) data file and the phylip tree file. If the taxon names are not identical, you will have to use search and replace to change the taxon names.

Like PAUP, phylip stores trees as sets of nested parentheses, with taxa separated by commas. Branch lengths follow each OTU, and are separated by a colon. The tree ends with a semicolon:

((taxon1: 0.209428,taxon2: 0.060451): 0.064360,(taxon3: 0.085505,(taxon4: 0.099318,taxon5: 0.038405): 0.013970):0.0;

Because the basic structure of the tree file is the same in both programs, a "wrapper" can be added to the tree file to make it acceptable to paup*:

#nexus
begin trees;
utree example=((taxon1: 0.209428,taxon2: 0.060451): 0.064360,(taxon3: 0.085505,(taxon4:0.099318,taxon5: 0.038405): 0.013970):0.0;
end;