BSCI 380 Laboratory

Exercises
·Unix Introduction
·BLAST
·PERL
·Genbank
·BLAST, GCG
·GCG
·Seqlab
·Synthesis
·MSA
·Paup
·Phylogeny
·Examine

·An editor primer
·A GCG cheatsheet
·Flat2fasta homework
·Dynamic Programming homework
·High scoring words homework
·GCG homework
·Seqlab homework
·Mystery sequence homework
·Paup homework

First exposure to GCG and more BLAST

Familiarity with the Unix family of operating systems.
The ability to work with the Blast tools.
Facility with an Unix editor.

Objectives

Start accessing the Accelerys GCG package. Using these tools acquire sequence information from NCBI and perform a BLAST search using the Unix command line.

Introduction

GCG is a venerable piece of software which provided perhaps the first comprehensive array of tools for the scientist seeking to perform bioinformatic analyses. Even though it is showing its age, it still provides a tremendous array of tools for the research in a unified interface.

Starting GCG

GCG is a piece of software which runs on locus, and so before we start experimenting with gcg, we will need to log in via ssh. Here is what happens when I log in:

Logging into Locus

$ ssh abelew@locus.umiacs.umd.edu abelew@locus.umiacs.umd.edu's password: DCE credentials established. Last login: Tue Oct 5 01:06:43 2004 from pcp07718798pcs. Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001 1:18am up 313 days, 12:06, 4 users, load average: 0.02, 0.01, 0.02 Welcome to the WISCONSIN PACKAGE Version 10.3-UNIX Installed on solaris Copyright (c) 1982 - 2001, Accelrys Inc. A wholly owned subsidiary of Pharmacopeia, Inc. All rights reserved. Published research assisted by this software should cite: Wisconsin Package Version 10.3, Accelrys Inc., San Diego, CA Databases available: GenBank Release 126.0 (10/2001) GenPept Release 126.0 (10/2001) EMBL (Abridged) Release 68.0 (09/2001) PIR-Protein Release 69.0 (06/2001) NRL_3D Release 28.0 (01/2001) SWISS-PROT Release 36.0 (12/1998) SP-TREMBL Release 18.0 (10/2001) PROSITE Release 16.48 (10/2001) Pfam Release 6.6 (08/2001) Restriction Enzymes (REBASE) (11/2001) Technical support see: http://www.accelrys.com/support/ Online help: % genhelp or http://www.accelrys.com/support/bio/genhelp/ When you log in, however, gcg will not spill out its helpful startup text. In order for that to occur during login, I made the following change to my .login file, as you may recall, (ba|tc)sh keep settings in the .(ba|tc)shrc file to be kept over multiple invocations of the shell; however for those programs one wishes to start only upon the initial startup, one should edit the .login so that it contains the following: $ more .login source /opt/gcg/gcgstartup source /opt/gcg/genetics Read over the files /opt/gcg/gcgstartup and /opt/gcg/genetics to get a sense of what is going on. Now let us have some fun.
Use emacs and/or vi in order to make some customizations to your shell settings. Change the PATH environment variable to include some useful but missing directories like "/usr/openwin/bin< /usr/local/bin /opt/clustalx1.83"

Getting sequence into GCG

Gcg uses yet another format for holding sequence information. Check the user's manual of genhelp for how to make sequences into the gcg format. Grab a sequence from here and perform something like the following:

Logging into Locus

$  vi test.fasta
>example
IWSIMPANSE KEHLSIVICG HVDSGKSTTT GRLIFELGGL PERELDKLK
EAERLGKGSF AFAFYMDRQK EERERGVTIA CTTKEFYTEK WHYTIIDAPG
HRDFIKNMIT GASQADVALI MVPADGNFTT AIAKGNHKAG EIQGQTRQHS
RLINLLGVKQ ICIGVNKMDC DTAAYKQARY DEIANEMKSM LVXVGWKKDF
IRENTPVMPI

If you are uncertain about how to create this file, please refer back to the
editor tutorial.

$  ls
test.fasta
$  fromfasta test.fasta
 example.pep  210 aa.

 Finished FROMFASTA with 1 files written.
 210 sequence characters were reformatted.

FromFastA reformats one or more sequences from FastA format into
single sequence files suitable for use in GCG.

$  ls
example.pep  test.fasta
$  more example.pep
!!AA_SEQUENCE 1.0
EXAMPLE TRANSLATE of: example.seq
example.pep  Length: 210  October 5, 2004 11:21  Type: P  Check: 2967  ..
       1  IWSIMPANSE KEHLSIVICG HVDSGKSTTT GRLIFELGGL PERELDKLK
      51  EAERLGKGSF AFAFYMDRQK EERERGVTIA CTTKEFYTEK WHYTIIDAPG
     101  HRDFIKNMIT GASQADVALI MVPADGNFTT AIAKGNHKAG EIQGQTRQHS
     151  RLINLLGVKQ ICIGVNKMDC DTAAYKQARY DEIANEMKSM LVXVGWKKDF
     201  IRENTPVMPI
$  mv example.pep test.pep

Now that we have some information in the correct format we can play with it using the tools provided by gcg. reformat is another program which may be used to import information into gcg. In order to make changes to some sequence information, use seqed. Using seqed is very similar to using vi, so spend a few minutes working with seqed while reading the genhelp documentation. Some important commands include: fetch, assemble, and reformat. (net)Fetch acquires sequence from the local database or from others, assemble can be used to link parts of sequences or sequences from different files into a single sequence and reformat is used to maintain a sequence file's identity within gcg.
Gcg maintains a timestamp checksum of the current state of an input file, so if you edit your sequence file outside of gcg, you will need to run reformat to return it to the 'good graces' of gcg:

Reformat Example

$ vi test.pep

Using vi, add some information to the comment field, which is before the '..' 
in GCG formatted sequence files.

$ vi test.pep
!!AA_SEQUENCE 1.0
EXAMPLE TRANSLATE of: example.seq                                      |
HERE IS A NEW COMMENT THAT I ADDED BEFORE THE '..' RIGHT ............. v
example.pep  Length: 210  October 5, 2004 11:21  Type: P  Check: 2967  ..
       1  IWSIMPANSE KEHLSIVICG HVDSGKSTTT GRLIFELGGL PERELDKLK
....
$  reformat test.pep
$  more test.pep
!!AA_SEQUENCE 1.0
EXAMPLE TRANSLATE of: example.seq                                      |
HERE IS A NEW COMMENT THAT I ADDED BEFORE THE '..' RIGHT ............. v
example.pep  Length: 210  October 5, 2004 11:23  Type: P  Check: 2969  ..
       1  IWSIMPANSE KEHLSIVICG HVDSGKSTTT GRLIFELGGL PERELDKLK
....

Pay close attention to the time on the fourth line.

Downloading sequences with gcg

We now know how to import our own sequences into gcg, but what if we want to work with all the tufa sequences in the existing local database?

Reformat Example

$ fetch *tufa* battufa.gb_ba ... $ ls attufa.gb_pl crtufa.gb_pl ettufagen.gb_in sttufa.gb_ba test.fasta test.pep battufa.gb_ba ecotufa.gb_ba gmtufa.gb_pl test.blastp test.netblastp tobtufa.gb_pl coocptufa.gb_pl ectufa.gb_pl gmtufa2.gb_pl test.blastpgp test.out tobtufa2.gb_pl

Or we can get sequences from ncbi by accession number. Check out the genhelp documentation for netfetch and get some sequences from ncbi.

Using Blast from within gcg

Now that we have some information inside gcg, let us perform a blast search without having to go to ncbi.

A Locus Blast search

$  which blast2
blast2:          aliased to ( set noglob; $GCGUTILDIR/blast2 -programname=blast !* )

set noglob tells the shell to not accept * as referring to more than one file
$GCGUTILDIR is an environment variable set for us by the source gcgstartup

$  blast2 test.pep
BLAST searches one or more nucleic acid or protein databases
for sequences similar to one or more query sequences of any
type. BLAST can produce gapped alignments for the matches it
finds.

                  Begin (* 1 *) ?
                End (*   210 *) ?

 Search for query in what sequence database:

   1) pir     p Protein Information Resource
   2) nrl_3d  p NRL_3D Protein Sequence-Structure Database
   3) swplus  p SWISS-PROT + SP-TREMBL
   4) genembl n GenBank + EMBL (HTGs Removed)
   5) htg     n High Throughput Genomes (HTG from GenBank and EMBL)
   6) genpept p GenPept (Translated GenBank)

 Please choose one (* 1 *):  6

 Ignore hits expected to occur by chance more than (* 10.0 *) times?

 Limit the number of sequences in my output to (* 500 *) ?

 What should I call the output file (* test.blastp *) ?

This search was done using a copy of the ncbi databases which are local to the machine locus. As a result they do not have all the same pieces of sequence information as the databases at ncbi. To amend this potential problem, gcg also has netblast (which does not appear to work!); so instead try out the blastcl3, which does not accept the gcg format, so we will instead use our original fasta formatted file. Compare your experience with that from using the ncbi web site blast.

Using NCBI's Blast locally

$  blastcl3 -i test.fasta -o test.out -p blastp
$  more test.out
BLASTP 2.2.9 [May-01-2004]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

lots of output...

How are the outputs from a local blast database different from those located at ncbi? Why? Make a backup copy of your input sequence (cp test.fasta test.fasta.keep) and try out the gcg version of fasta

GCG's fasta

$  fasta test.pep
FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST.

... lots of output
$  more test.fasta
!!SEQUENCE_LIST 1.0

(Peptide) FASTA of: test.pep  from: 1 to: 210  October 5, 2004 11:47

EXAMPLE TRANSLATE of: example.seq
What the heck is this thing?

 TO: PIR:*  Sequences:    232,624  Symbols:    80,607,033  Word Size: 2

 Databases searched:
   NBRF, Release 68.0, Released on 31Mar2001, Formatted on 24Oct2001

 Scoring matrix: GenRunData:blosum50.cmp
 Variable pamfactor used
 Gap creation penalty: 12  Gap extension penalty: 2

Histogram Key:
 Each histogram symbol represents 348 search set sequences
 Each inset symbol represents 9 search set sequences
 z-scores computed from opt scores
 ... lots of information

What does the header information mean?

A Reintroduction to Blast

As we discussed in class, the matrix used for calculating the scores of an alignment defines the hit found in the alignment and its suitability. Thus one may want to ask the questions: What if the assumptions made by an amino acid substitution matrix are violated by the query sequence? What if one segment of a sequence is under stronger selection than another segment? Altschul and friends sought to amend these problems with the Position Specific Iterative version of Blast. The matrix created by this algorithm is position specific; thus each amino acid receives its own series of substition odds. These individual position matrices are in turn modified after each iteration of the algorithm by choosing which hits of the database to contribute to the position specific substitution matrix.

Here are a few sequences to experiment with; try the following:

Check out the blast documentation to read up on psi-blast.
Perform a 'normal' blast search of the sequences. What families of proteins appear?
Perform a psiblast upon the same sequences. What appears in the first iteration? Second?
Perform separate iterations of psiblast, one in which only the best matches are included in the second, third, nth iterations; another in which only the worst matches are included in the second and third iterations. What happens? How many hits appear in each iteration? How specific are the hits in each case? Why?

Further Exercises

This exercise uses sequences of triose phosphate isomerase (TPI) from mosquitoes assumptions an example.
Review the genhelp document for fetch
Fetch the assorted mosquito TPI sequences of interest.
Supply an informative filename for each sequence, including: culex-tpi.gbk

Reformat Example

$ fetch L07390 -out=culex-tpi.gbk
Assemble the Culex TPI gene

$  assemble

Now perform the following operations:

Type in the name of the source sequence.
Enter the beginning and end points for the first exon
select "add another segment"
enter the endpoints for the second exon
write the assembled sequence out to a new file, perhaps "culex-tpi.seq"

If you are going to use your sequences for multiple sequence alignment, check to be sure that your other sequences are all about the same length, and do not include any intervening or flanking seqeunces. Multiple sequence alignment programs often work best when the sequences are all about the same length.

You can also run assemble with command-line syntax. This is helpful for processing large numbers of sequences, because scripts can be used to automate the process. Consult genhelp for the command-line syntax.

A Unix command refresher

It has been a couple of weeks since your crash course with using Unix. Take a moment to refresh your memory.

Reformat Example

$  man ls
For help on any unix command, use the manual.

$  apropos ls
ldapadd [ldapmodify] (1)  - LDAP modify entry and LDAP add entry tools
ldapmodify           (1)  - LDAP modify entry and LDAP add entry tools
lev_comp             (6)  - NetHack special levels compiler
lfind [lsearch]      (3)  - linear search of an array
lock [unimplemented] (2)  - unimplemented system calls
ls                   (1)  - list directory contents
lsattr               (1)  - list file attributes on a Linux second extended file system
lsearch              (3)  - linear search of an array
lsearch              (n)  - See if a list contains a particular element
lseek                (2)  - reposition read/write file offset
lseek64              (3)  - reposition 64-bit read/write file offset
lset                 (n)  - Change an element in a list
lsetxattr [setxattr] (2)  - set an extended attribute value
lsmod                (8)  - list loaded modules
 ....
If you don't know the name of a command, apropos can help
you find it.
$  ls -aList every file, even those which begin with .

$  ls -FC
List files with special characters to mark directories etc

$  cd
Change directories, by default taking one 'home'.

$  cp oldfile newfile
$  cp olddirectory/oldfile newdirectory/newfile
Copy files around the local computer.

$  sftp username@hostname
Copy files from one computer to another.

Performing a pairwise alignment in GCG

GCG provides implementations of the most common dynamic programming algorithms for pairwise sequence alignment.

gap (Needleman-Wunsch algorithm) will find the globally optimal alignment of two sequences of similar length. gap will always completely align the two sequences, so you must know in advance that the sequences should be aligned along their full length.

bestfit (Smith-Waterman algorithm) will find the optimal local alignment of two sequences. bestfit is most effective for aligning two sequences that are known to be homologous, but are somewhat different in length.

fasta (Pearson-Lipman algorithm) uses heuristic methods to find matches between a query sequence and a sequence database. Performs well with nucleotide data, but generally slower than blast.

blast (BLAST algorithm) uses fast heuristic methods to find matches between a query sequence and a local sequence database. Several variants of blast are available for different purposes. blast can rapidly search large databases.

Use genhelp to review the syntax for each command and perform some alignments using the mosquito TPI genes previously downloaded.

Created: Wed Sep 15 00:58:22 EDT 2004 by Charles F. Delwiche
Last modified: Mon Nov 8 15:49:44 EST 2004 by Ashton Trey Belew.