Using cut and paste to download genbank files and converting these files to GCG format
What other specialty software is available on UMBI?
Importing PHYLIP trees into PAUP*
I have found a bunch of sequences in genbank and I want to get them into NEXUS format. How do I do that?
I have found a file of interest by searching genbank with a web browser, but when I try to use fetch to get the sequence, it indicates the accession number can't be found. Is there another way to convert a genbank file to GCG format?
This is probably because they are in "Genbank Updates" rather than "Genbank", and therefore may not be on the version of genbank that is on the UMBI computer. What you can do is to download the sequences directly from genbank either a) cutting and pasting -- do this only if the sequences are short -- or b) using netscape to save them to a file, and then use "fetch" or some other FTP tool to move them to the UMBI box. Note that there are two programs called 'fetch'. One is a macintosh program that is used for transfering files via the FTP protocol. The second is a part of the GCG package, and is used to retrieve files from genbank.
Cutting and pasting is a simple method of moving a genbank file from a web
browser to GCG when you can't use the GCG tool fetch,
but you will
have to master some more unix commands. If you issue the following command:
cat > newfile
when you press carriage return the computer will start accepting input from
the keyboard, so if you then paste, the unix box will put the characters you
paste into the file 'newfile'.
When you are done with that, press
^d (control-d, which is the unix end-of-file character). At that point you should
get the prompt back. Alternatively, download the file to your mac or PC, and
then use FTP to move the file to the GCG unix host.
The sequences will then be in Genbank format, *not* GCG format, you can convert
them from genbank to GCG with a GCG tool called "fromgenbank".
It should look like this (use 'more'
to look at the contents of
the file you have created):
***********
prompt> cat > newfile
PASTE THE GENBANK FILE IN HERE (IN THIS CASE LOCUS MICSPCOX)
^d
prompt> more newfile
CONTENTS OF THE GENBANK FILE (LOCUS MICSPOCOX)
prompt> fromgenbank
FromGenBank reformats one or more sequences in the flat file format of the
GenBank database into individual sequence files in GCG format.
Reformat what GenBank data file? newfile
micspcox.seq 381 bp.
reformatted: temp
total files: 1
total bases: 381
prompt> mv micspcox.seq Coleo-mt-coxII.seq
**********
Note that using this syntax, GCG assigns a new name to the file. This new name is based on the locus name, in this case micspcox.seq. So 'newfile' is still the same file it ever was. You will probably want to rename the file created by fromgenbank, so the last command shown above moves the file from one file name to another.
The cut and paste approach will only work with relatively small files. If you decide to download a huge file (e.g., the complete yeast genome) use FTP to move the file around. Reformatting the file with fromgenbank is the same in both cases.
CFD 1.18.98
I have installed Clustalw (another alignment program) on the UNIX cluster and Malign ( a simultaneous alignment and tree building program) on UMBI. If you are logged in as pbio699k, these programs will be available with the commands clustalw and malign, respectively. Documentation is in the "bin" directory:
~pbio699k/bin/clustalwdir/clustalw1.7/clustalw.doc ~pbio699k/bin/maligndir/MALIGN.TXT
If you perform *any* analyses that take more than a few minutes to execute, be sure to run them under "nice":
nice clustalw
For longer runs, where you will need to log out before the run is finished, use "nohup":
(nice nuhup malign PARAMFILE < INPUTFILE > OUTFILE ) >& STDERR.OUT &
We will discuss what that complex command line means in a few weeks.
CFD 3.2.98
Trees generated by phylip (and related programs, such as fastDNAml and PROTML) can be imported into paup if the taxon names are identical in both the paup* (NEXUS) data file and the phylip tree file. If the taxon names are not identical, you will have to use search and replace to change the taxon names.
Like PAUP, phylip stores trees as sets of nested parentheses, with taxa separated by commas. Branch lengths follow each OTU, and are separated by a colon. The tree ends with a semicolon:
((taxon1: 0.209428,taxon2: 0.060451): 0.064360,(taxon3: 0.085505,(taxon4:
0.099318,taxon5: 0.038405): 0.013970):0.0;
Because the basic structure of the tree file is the same in both programs, a "wrapper" can be added to the tree file to make it acceptable to paup*:
#nexus
begin trees;
utree example=((taxon1: 0.209428,taxon2: 0.060451): 0.064360,(taxon3: 0.085505,(taxon4:0.099318,taxon5: 0.038405): 0.013970):0.0;
end;