|
First exposure to GCG and more BLAST
Prerequisites
|
Objectives
|
Starting GCG
|
Importing Sequences
|
Downloading Sequences
|
GCG and Blast
|
A Reintroduction to Blast
|
Further Exercises
|
Unix Redux
|
GCG pairwise alignment
|
Prerequisites
Objectives
Start accessing the Accelerys GCG package. Using these tools acquire
sequence information from NCBI and perform a BLAST search using the Unix
command line.
Introduction
GCG is a venerable piece of software which provided perhaps the first
comprehensive array of tools for the scientist seeking to perform
bioinformatic analyses. Even though it is showing its age, it still
provides a tremendous array of tools for the research in a unified
interface.
Starting GCG
GCG is a piece of software which runs on locus, and so before we
start experimenting with gcg, we will need to log in via
ssh. Here is what happens when I log
in:
Logging into Locus |
$ ssh abelew@locus.umiacs.umd.edu
abelew@locus.umiacs.umd.edu's password:
DCE credentials established.
Last login: Tue Oct 5 01:06:43 2004 from pcp07718798pcs.
Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
1:18am up 313 days, 12:06, 4 users, load average: 0.02, 0.01, 0.02
Welcome to the WISCONSIN PACKAGE
Version 10.3-UNIX
Installed on solaris
Copyright (c) 1982 - 2001, Accelrys Inc.
A wholly owned subsidiary of Pharmacopeia, Inc. All rights reserved.
Published research assisted by this software should cite:
Wisconsin Package Version 10.3, Accelrys Inc., San Diego, CA
Databases available:
GenBank Release 126.0 (10/2001)
GenPept Release 126.0 (10/2001)
EMBL (Abridged) Release 68.0 (09/2001)
PIR-Protein Release 69.0 (06/2001)
NRL_3D Release 28.0 (01/2001)
SWISS-PROT Release 36.0 (12/1998)
SP-TREMBL Release 18.0 (10/2001)
PROSITE Release 16.48 (10/2001)
Pfam Release 6.6 (08/2001)
Restriction Enzymes (REBASE) (11/2001)
Technical support see: http://www.accelrys.com/support/
Online help: % genhelp or http://www.accelrys.com/support/bio/genhelp/
$ more .login
source /opt/gcg/gcgstartup
source /opt/gcg/genetics
|
Getting sequence into GCG
Gcg uses yet another format for holding sequence information. Check the
user's manual of genhelp for how to make sequences into the gcg format.
Grab a sequence from here
and perform something like the following:
Logging into Locus |
$ vi test.fasta
>example
IWSIMPANSE KEHLSIVICG HVDSGKSTTT GRLIFELGGL PERELDKLK
EAERLGKGSF AFAFYMDRQK EERERGVTIA CTTKEFYTEK WHYTIIDAPG
HRDFIKNMIT GASQADVALI MVPADGNFTT AIAKGNHKAG EIQGQTRQHS
RLINLLGVKQ ICIGVNKMDC DTAAYKQARY DEIANEMKSM LVXVGWKKDF
IRENTPVMPI
$ ls
test.fasta
$ fromfasta test.fasta
example.pep 210 aa.
Finished FROMFASTA with 1 files written.
210 sequence characters were reformatted.
$ ls
example.pep test.fasta
$ more example.pep
!!AA_SEQUENCE 1.0
EXAMPLE TRANSLATE of: example.seq
example.pep Length: 210 October 5, 2004 11:21 Type: P Check: 2967 ..
1 IWSIMPANSE KEHLSIVICG HVDSGKSTTT GRLIFELGGL PERELDKLK
51 EAERLGKGSF AFAFYMDRQK EERERGVTIA CTTKEFYTEK WHYTIIDAPG
101 HRDFIKNMIT GASQADVALI MVPADGNFTT AIAKGNHKAG EIQGQTRQHS
151 RLINLLGVKQ ICIGVNKMDC DTAAYKQARY DEIANEMKSM LVXVGWKKDF
201 IRENTPVMPI
$ mv example.pep test.pep
|
Now that we have some information in the correct format we can play with
it using the tools provided by gcg. reformat is another program which may
be used to import information into gcg. In order to make changes to some
sequence information, use seqed. Using seqed is very similar to using vi,
so spend a few minutes working with seqed while reading the genhelp
documentation. Some important commands include: fetch,
assemble, and reformat. (net)Fetch acquires sequence from
the local database or from others, assemble can be used to link parts of
sequences or sequences from different files into a single sequence and
reformat is used to maintain a sequence file's identity within gcg.
Gcg maintains a timestamp checksum of the current state of an input file,
so if you edit your sequence file outside of gcg, you will need to run
reformat to return it to the 'good graces' of gcg:
Reformat Example |
$ vi test.pep
$ vi test.pep
!!AA_SEQUENCE 1.0
EXAMPLE TRANSLATE of: example.seq |
HERE IS A NEW COMMENT THAT I ADDED BEFORE THE '..' RIGHT ............. v
example.pep Length: 210 October 5, 2004 11:21 Type: P Check: 2967 ..
1 IWSIMPANSE KEHLSIVICG HVDSGKSTTT GRLIFELGGL PERELDKLK
....
$ reformat test.pep
$ more test.pep
!!AA_SEQUENCE 1.0
EXAMPLE TRANSLATE of: example.seq |
HERE IS A NEW COMMENT THAT I ADDED BEFORE THE '..' RIGHT ............. v
example.pep Length: 210 October 5, 2004 11:23 Type: P Check: 2969 ..
1 IWSIMPANSE KEHLSIVICG HVDSGKSTTT GRLIFELGGL PERELDKLK
....
|
Downloading sequences with gcg
We now know how to import our own sequences into gcg, but what if we want
to work with all the tufa sequences in the existing local database?
Reformat Example |
$ fetch *tufa*
battufa.gb_ba
...
$ ls
attufa.gb_pl crtufa.gb_pl ettufagen.gb_in sttufa.gb_ba test.fasta test.pep
battufa.gb_ba ecotufa.gb_ba gmtufa.gb_pl test.blastp test.netblastp tobtufa.gb_pl
coocptufa.gb_pl ectufa.gb_pl gmtufa2.gb_pl test.blastpgp test.out tobtufa2.gb_pl
|
Or we can get sequences from ncbi by accession number. Check out the
genhelp documentation for netfetch and get some sequences from ncbi.
Using Blast from within gcg
Now that we have some information inside gcg, let us perform a blast
search without having to go to ncbi.
A Locus Blast search |
$ which blast2
blast2: aliased to ( set noglob; $GCGUTILDIR/blast2 -programname=blast !* )
$ blast2 test.pep
BLAST searches one or more nucleic acid or protein databases
for sequences similar to one or more query sequences of any
type. BLAST can produce gapped alignments for the matches it
finds.
Begin (* 1 *) ?
End (* 210 *) ?
Search for query in what sequence database:
1) pir p Protein Information Resource
2) nrl_3d p NRL_3D Protein Sequence-Structure Database
3) swplus p SWISS-PROT + SP-TREMBL
4) genembl n GenBank + EMBL (HTGs Removed)
5) htg n High Throughput Genomes (HTG from GenBank and EMBL)
6) genpept p GenPept (Translated GenBank)
Please choose one (* 1 *): 6
Ignore hits expected to occur by chance more than (* 10.0 *) times?
Limit the number of sequences in my output to (* 500 *) ?
What should I call the output file (* test.blastp *) ?
|
This search was done using a copy of the ncbi databases which are local
to the machine locus. As a result they do not have all the same pieces of
sequence information as the databases at ncbi. To amend this potential
problem, gcg also has netblast (which does not appear to work!); so
instead try out the blastcl3, which does not accept the gcg format, so we
will instead use our original fasta formatted file. Compare your
experience with that from using the ncbi web site blast.
Using NCBI's Blast locally |
$ blastcl3 -i test.fasta -o test.out -p blastp
$ more test.out
BLASTP 2.2.9 [May-01-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
lots of output...
|
How are the outputs from a local blast database different from those
located at ncbi? Why? Make a backup copy of your input sequence (cp
test.fasta test.fasta.keep) and try out the gcg version of fasta
GCG's fasta |
$ fasta test.pep
FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST.
... lots of output
$ more test.fasta
!!SEQUENCE_LIST 1.0
(Peptide) FASTA of: test.pep from: 1 to: 210 October 5, 2004 11:47
EXAMPLE TRANSLATE of: example.seq
What the heck is this thing?
TO: PIR:* Sequences: 232,624 Symbols: 80,607,033 Word Size: 2
Databases searched:
NBRF, Release 68.0, Released on 31Mar2001, Formatted on 24Oct2001
Scoring matrix: GenRunData:blosum50.cmp
Variable pamfactor used
Gap creation penalty: 12 Gap extension penalty: 2
Histogram Key:
Each histogram symbol represents 348 search set sequences
Each inset symbol represents 9 search set sequences
z-scores computed from opt scores
... lots of information
|
What does the header information mean?
A Reintroduction to Blast
As we discussed in class, the matrix used for calculating the scores
of an alignment defines the hit found in the alignment and its
suitability. Thus one may want to ask the questions: What if the
assumptions made by an amino acid substitution matrix are violated by
the query sequence? What if one segment of a sequence is under
stronger selection than another segment? Altschul and friends sought
to amend these problems with the Position Specific Iterative version
of Blast. The matrix created by this algorithm is position specific;
thus each amino acid receives its own series of substition odds.
These individual position matrices are in turn modified after each
iteration of the algorithm by choosing which hits of the database to
contribute to the position specific substitution matrix.
Here are a few sequences
to experiment with; try the following:
- Check out the
blast documentation
to read up on psi-blast.
- Perform a 'normal' blast search of the sequences. What families
of proteins appear?
- Perform a psiblast upon the same sequences. What appears in the
first iteration? Second?
- Perform separate iterations of psiblast, one in which only the
best matches are included in the second, third, nth iterations;
another in which only the worst matches are included in the second
and third iterations. What happens? How many hits appear in each
iteration? How specific are the hits in each case? Why?
Further Exercises
This exercise uses sequences of triose phosphate isomerase (TPI) from
mosquitoes assumptions an example.
Review the genhelp document for fetch
Fetch the assorted mosquito TPI sequences of interest.
Supply an informative filename for each sequence, including:
culex-tpi.gbk
Reformat Example |
$ fetch L07390 -out=culex-tpi.gbk
$ assemble
|
Now perform the following operations:
- Type in the name of the source sequence.
- Enter the beginning and end points for the first exon
- select "add another segment"
- enter the endpoints for the second exon
- write the assembled sequence out to a new file,
perhaps "culex-tpi.seq"
If you are going to use your sequences for multiple sequence alignment,
check to be sure that your other sequences are all about the same length,
and do not include any intervening or flanking seqeunces. Multiple
sequence alignment programs often work best when the sequences are all
about the same length.
You can also run assemble with command-line syntax. This is helpful for
processing large numbers of sequences, because scripts can be used to
automate the process. Consult genhelp for the command-line syntax.
A Unix command refresher
It has been a couple of weeks since your crash course with using Unix.
Take a moment to refresh your memory.
Reformat Example |
$ man ls
$ apropos ls
ldapadd [ldapmodify] (1) - LDAP modify entry and LDAP add entry tools
ldapmodify (1) - LDAP modify entry and LDAP add entry tools
lev_comp (6) - NetHack special levels compiler
lfind [lsearch] (3) - linear search of an array
lock [unimplemented] (2) - unimplemented system calls
ls (1) - list directory contents
lsattr (1) - list file attributes on a Linux second extended file system
lsearch (3) - linear search of an array
lsearch (n) - See if a list contains a particular element
lseek (2) - reposition read/write file offset
lseek64 (3) - reposition 64-bit read/write file offset
lset (n) - Change an element in a list
lsetxattr [setxattr] (2) - set an extended attribute value
lsmod (8) - list loaded modules
....
$ ls -aList every file, even those which begin with .
$ ls -FC
$ cd
$ cp oldfile newfile
$ cp olddirectory/oldfile newdirectory/newfile
$ sftp username@hostname
|
Performing a pairwise alignment in GCG
GCG provides implementations of the most common dynamic programming
algorithms for pairwise sequence alignment.
gap (Needleman-Wunsch algorithm) will find the globally optimal
alignment of two sequences of similar length. gap will always
completely align the two sequences, so you must know in advance that the
sequences should be aligned along their full length.
bestfit (Smith-Waterman algorithm) will find the optimal local
alignment of two sequences. bestfit is most effective for aligning
two sequences that are known to be homologous, but are somewhat different
in length.
fasta (Pearson-Lipman algorithm) uses heuristic methods to find
matches between a query sequence and a sequence database. Performs well
with nucleotide data, but generally slower than blast.
blast (BLAST algorithm) uses fast heuristic methods to find matches
between a query sequence and a local sequence database. Several variants
of blast are available for different purposes. blast can
rapidly search large databases.
Use genhelp to review the syntax for each command and perform some
alignments using the mosquito TPI genes previously downloaded.
Created: Wed Sep 15 00:58:22 EDT 2004 by Charles F. Delwiche
Last modified: Mon Nov 8 15:49:44 EST 2004
by Ashton Trey Belew.
|