Exercises
·Unix Introduction
·BLAST
·PERL
·Genbank
·BLAST, GCG
·GCG
·Seqlab
·Synthesis
·MSA
·Paup
·Phylogeny
·Examine


·An editor primer
·A GCG cheatsheet
·Flat2fasta homework
·Dynamic Programming homework
·High scoring words homework
·GCG homework
·Seqlab homework
·Mystery sequence homework
·Paup homework

Synthesize previous work!

Prerequisites   |   Objectives   |   Introduction   |   Fixing Netfetch   |   Practice   |  
Prerequisites
Objectives

Synthesize the previous pieces of information into the ability to perform a single, coherent analysis.

Introduction

We have used gcg, Perl, and Unix to open the doors to complex searches for information. Today our goal is to expand upon these pieces of information and bring them together in order to have more complete analyses.

Fixing Netfetch

Netfetch is an extremely useful tool for quickly downloading sequences into gcg. However as we have noticed, it is flawed in its output. In order for me to get around this, I created a bash script called fixnetfetch; another option would be to modify your script which translates from the ncbi flat file format to fasta.
My fixnetfetch script looks like:

fixnetfetch

#!/bin/sh
filename=${1}
sed '/^\!\!/,/^\{/d;s/^sequence\ /\.\./;/^\}$/d' ${filename} > ${filename}.tmp
mv ${filename}.tmp ${filename}
reformat ${filename}
Some Practice

As you go through these exercises, keep a log of what you are doing using the tools on your workstation and answer all the relevant questions. When you have finished exploring these sequences, print out your log and turn it in.

Acquire the sequence for accession L07390.
What is the scientific name of the organism from which this sequence was isolated?
What is its common name?
What gene is (partially) encoded by this sequence?
Is there an intron in the sequence?
If so, where does it begin and end?
Where does the exon begin and end?
This sequence is part of a study used to help validate a particular hypothesis, what is this hypothesis? Do you agree? Why/Why not?
What manipulation would you have to perform on this DNA sequence if you were going to translate it to get the amino acid sequence shown?
What is the %GC content of the entire sequence?
What is the %GC of the coding sequence (the two exons)?
To calculate %GC use (#G+#C)/(#A+#C+#G+#T)*100.
What does the GC content of the sequence tell you?
What might it mean if a region of the sequence has a radically different GC content?

Blast the sequence against nr and find the closest match to the model mosquito, Anopheles gambiae and Caenorhabditis elegans. Compare the blast alignment between the test sequence and Anopheles to an alignment from gcg's Smith Waterman alignment via bestfit. How do the statistics provided by the two algorithms match?
How do they match when comparing the two algorithms with the test sequence and that from C. elegans?
Perform a dotplot of the two mosquito sequences, does it tell you anything interesting?
Compare the mosquito and worm sequences in the same manner.

Now imagine the following scenario:
A graduate student friend of yours has terrible habits in the maintenance of their laboratory notebook. The student has found the following sequence among his/her records, but has no idea what the sequence is, nor where it came from. What information can you provide that would help this student determine the nature and identity of this sequence? You should use GCG to analyze the sequence, but don't forget about the other skills you have developed in class (hint, hint). GCG programs that you might find particularly useful would include: MAP, CODONPREFERENCE, and FRAMES.

>unknown.seq

CCTTGATAAGTGCGTACGNCNAGGTTTTCCNATTCANANGTTNTAAAANG
ACGGCCAGTGAATNGTAATACGACTCACTATAGGGCGAATTGGGTACCGG
GCCCCCCCTCGAGGTCGACGGTATCGATAAGCTTGATatgAGTTCCAATC
TAAAAAATAATGAATATAAAGAAGGATGTTATCCTTTGTTCTTTTTTGAA
AATTTCTACGTAAAAGTCTCTATTAATACTGCTTATTATATACTTAAGAC 
AGACAAAAGAAAAAAAGATAAACAAATAAAATCCAATTTATTAAAAAAAA
ACAATCAAATGATACCTTTCATTTTCTACTTATTAAACGATTAATTCGTA
AGATAAGGCAACAAAATTTTAACTGGAGTGAATCATCTAGATTGTATGAT 
TTTTCTAATAAAATAGAACCTAATTATAAATATGAATATAATAGAATTAA
GTTATTTTATATTTTATTAATAGAGAATTTGATATTTTTAGTATTACGAT
TCTTATGGGAACAAAAACAAGAGAAAAAGAATGATTTTTCTCTTTTCATT
AAAAAATCTATTCAATTTGCTTTTCCTTTTTTAGAACATAAAATGAGTAA 
TTCTGCTTCAATAATAGAAGGACAACTTTGTTTTTCTTATACAACTAGAA 
AGCTTAATTTTTTGCTTTTTTTTCTTTACAAGAGAATCCGCGATACTGTC 
TTTATAAATTTACTAAAAAAAATATTCAAGTTTAATAAATTACTTTTAAG 
GAAAGAGAATTATTTCAATGTAAATTATTTCAATGTAATGTCTAAAATCA 
GATTATTGGACTTATTAGCAAACTTATATGGAAATGAATTTGATTCTTTT 
TTTGTTTACAATATTTTAAAAATACATAACTTAAATTGTCTTTTTTTGCC 
ATATAAATCTATAGAAGATTATTCTTTACTACAAAAACACAATATTATTA 
TTAATAGTAATAGTTATAAAAATCAAATAAATATATCTTCTTTTTCTTGG 
TTAATTATCAATTTTATATATTCCATATACGGACACATTTTCTATATACG 
CCGTGGCATTTCATTTCTAATAATCCTTAAACTAGGACGAGGTTTCTCTC 
GATTTTGGAAATTTAATTGTGTCAAATTTATACAATTGAAATTAGAATCT 
AATCGTTCTTTTTATTTAATACAGTCACGGTTTGTTTTACGTCAAAGTTC 
GTTATTCTTAGGGTATAAAATTATAAATAGGTTTTGGCAAAAAAAACTAA 
AAATTAAAGCATCTTCTTGGTCTTTTTTTGTTTTTTTAAAAGATCGAAAA 
ATATCTTCAGAAATACCAATTGATAATCTTATTACTAATTTAACTGTAAT 
TAATTTATGTAATAAAAAAGGTTATCCAATTCATAAAGCTTCGTGGTCTA 
CATTTAGTGATCAACAAATTATAAAAATTTATAATAAAGTGTGGAATGAA 
TTATTTTTGTATTATTGTGGATCTTCGAATCGTTCTATTTTAACTCAAAT 
TCAGTATATTTTAGAATTTTCATGTATTAAAACTTTAGCTTTTAAACATA 
AATCTAATATTAGATTGGCATGGGAGCAATACAGAAAAGATGTGTCATTA 
TCCAACTTAGAAAGCGATATAGATTATTTTGGTAAAATCTCATATAATTT 
TCCTTCTTTATTTCAAAAAAAAAACTTTTTTTGGCTTTTAGGAATTTCTA 
GAATTGATCATCCAAATTCTTTTATTATTGAGTCATATTCAAGAATACAT 
GAGGAAAGCCGCTTGCATtgaATCGAATTCCTGCAGCCCGGGGGATCCAC 
TAGTTCTAGAGCGGCCGCCACCGCGGTGGAGCTCCAGCTTTTGTTCCCTT 
TAGTGAGGGTTAATTTCGAGCTTGGCGTAATCATGGTCATAGCTGTTTCC 
TGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATACGAGCCGGAA 
CATAA 
    

Created: Wed Sep 15 00:58:22 EDT 2004 by Charles F. Delwiche
Last modified: Mon Nov 8 15:49:44 EST 2004 by Ashton Trey Belew.