BSCI 348s Comparative Bioinformatics
Homework 2002
Assignment 1 (25 points, due Monday, October 7, 2002)
Your completed assignment should fit on two 8.5x11 sheets of paper (single sided). The matrix from part 1 should fit on one page, and the remaining parts should fit on a single page. Use a text editor to prepare the assignment if possible, but hand-written responses are acceptable if neatly written.
Part 1 (10 points):
a) Manually apply the Smith-Waterman Algorithm without using a separate gap extension penalty (i.e., as we did it in class) to determine the best alignment for the following amino acid sequences. Use a BLOSUM62 matrix for your match scores, and a gap penalty of 8. Be sure that your matrix shows the three calculations for each cell and the traceback, and be sure that you show the alignment as well as the matrix you used to calculate it.
Sequence #1: IYGWPALK
Sequence #2: YGPALGK
You may use GCG to check your work ("bestfit" implements the Smith-Waterman algorithm, "gap" the Neeldleman-Wunsch algorithm).
Part 2 (15 points):
Use GCG to apply both the Smith-Waterman and Needleman-Wunsch algorithm to the following three pairs of sequences, and show the alignments. For this and subsequent alignments uses a gap creation penalty of 8 and a gap extension penalty of 2 (i.e., the GCG defaults).
b)
Sequence #3: IYGWPALKALE
Sequence #4: YGPALGKAI
c)
Sequence #5: IYGWSALKALE
Sequence #6: YGSALGKAI
d)
Sequence #7: IYGWSALKALE
Sequence #8: YGSLGKAI
e) How do sequences 5&6 differ from sequences 3&4? Did this affect the alignments that you found using the Smith-Waterman algorithm? What is responsible for this effect?
f) How do sequences 7&8 differ from sequences 3&4? How and why does this alignment differ from the previous alignments?
g) What are properties do you feel constitute a "good" alignment? Given these criteria, which of the alignments is the best?
j) Under what circumstances would you recommend using the Needleman-Wunsch algorithm?
Homework #2, Due Friday Nov. 2, 2002. 25 points.
The following sequence, which can also be found in GCG format on locus at /fs/bsci348s/bioinf/hwk/hwk2.seq is a single sequence read from an amplified stretch of DNA. Please tell me as much as you can about this sequence. Among the questions you should address are:
- Does this sequence represent a protein-coding sequence?
- What gene does it represent?
- What organism might it have come from? How specifically can you answer this question?
- Does it have any frame-shift errors/mutations?
- Is there anything unusual about the sequence?
- What other inferences can you make about this sequence?
Model 4000 - Sample Notepad (LI-COR Form)
[IMAGE INFORMATION]
:Name of Image: WoloZ1FRxn5 (Jafari)
[SAMPLE INFORMATION]
:Sample Filename: WoloZ1FRxn5
:Sample Number:
:Type of Enzyme:
:Type of Primer:
:Type of Template:
:Date of Reaction:
:Type of Reaction (Manual/Robot):
[ANALYSIS INFORMATION]
:Sequenced By: lw
:Number of Bases Sequenced:
:Number of Errors Made:
:How Sequenced (Auto/SemiAuto/Manual): 200 semi, rest auto
:Sequencing Comments: should be checked,esp. near end
..
GGGCGAATNGGGCTCTAGATGCATGCTCGAGCGGCCGCCAGTGTGATGGA
TATCTGCAGAATTCGGCTTGGCCTGCAGAAACCTCACTTTAATATTGGTA
CAATTGGACACGNAGACCACGGCAAAACTACCCNNACGGCTGCCATTACC
AAAGTGCTAGCCGAAAAGGGACTTTCAGAAGCTCGNTCATTTGATTCAAT
TGACTCGGCTCCYGAAGAAAAAGAACGCgGtATTACAATTAATACAGCAC
ACgtAGAATATTCYACAGCAAATCGtCACTATGCACACggAGATTGtCCa
GGgCACGCTGACTATGTTAAAaACATGGtTACAGGtGCTGCACAAATGGA
TGGGGCTATATTAGTGGttGCtTCAACAGAtGGtCCYATGCCTCAaACTC
GgGAGCATATCCYACTTGCTCGNCAAGTTGGgGTACCYCGNATTGtGGNA
TTCAtGAACAAAGTTGACATGGTTGATGACGaAGAGTTGTTAGACTTAGT
TGAAATAGAAATtCGtGAATTATTGAATAAAtATGaTTYCCCcGGgGATG
aAATTCCAAtCAtTCGgGGtTCTSCTTTAGGNGGatTGAATGGNGATGCA
GCTTGGGGNAGAAAAaAa
Homework #3, Due Wednesday Nov. 13, 2002. 25 points.
The following sequences can also be found in GCG format on locus at /fs/bsci348s/bioinf/hwk/ as hwk3a.seq and hwk3b.seq. As with homework #2, your assignment is to perform a thorough analysis of these sequences and report on the results. The two sequences are not believed to be related, so it is probably not worth your while to try to evaluate them jointly. Your report should be roughly 2-3 pages in length, and should include an annotated map for each sequence as well as a brief discussion of the analyses that you performed to reach these conclusions.
Sequence #3a:
This is a sequence obtained in a study of Polylepis, a high-altitude plant in
the Rosaceae (rose family) section Sanguisorbeae. It was obtained by cloning a PCR
product amplified with primers intended to anneal in the rRNA operon.>HWK3A Polylepis Sequence
TGCATCCAACGCGTTGGGAGCTCTCCCATATGGTCGACCTGCAGGCGGCCGCGAATTCAC
TAGTGATTGTCCACTGAACCTTATCATTTAGAGGAATGAGAAGTCGTAACAAGGTTTCCG
TAGTTTAACCTAAGAAATGATCATTGTCGAAACCTGCCTAGCAGAACCTCGCGGCCACTC
GTCCCCTCATCCTGGGAAAGGAACGTCCCGAGCGTCGCACCTCGGTGCCTCCTCCCGACT
GACCCTCCCGGGCGTACTGAACATCAGCGTGAATTACGCCAGGGAACTTGAATGAAAGAG
CGTCTCCCCCACCAGTCTCCGGAGATGGTGTTCGTGCGGACAATTTCGTTGCCTTCCATA
TGTCTAAACACTCTCAGCAACGAATATCTCGGCTCTCGCATTGATGAAGAACGTAGCGAA
ATGCGATAATTGGTGTGAATTGCAAAATCCCGTGAACTATTGAGTATTTGAACGCAAATT
GCGCCCGAAGCCATTAGGCCTAGGGCACGTCTGCCTAAGTGTTACACGTCGTTGCCCCCC
CGACCCCTTCAGGGGTCGGACGGGAAGGATGATGGCCTCCCATGTTCTCTGTCAAACGGC
TGGCATAAATACCAAGTCCCCGGCAACCAACGGTGGTTGTGAGACCTCGGTGTCCTGTCG
TGCGCACGCGTCTGTCAGGGGCCCTCATGATGGGGTTTGCCCGCTGAATTTAAGCATATC
AATAAGCGGAGGAAATGCATCCAACGCGTTGGGAGCTCTCCCATATGGTCGACCTGCAGG
CGGCCGCGAATTCACTAGTGATTGTCCACTGAACCTTATCATTTAGAGGAATGAGAAGTC
GTAACAAGGTTTCCGTAGTTTAACCTAAGAAATGATCATTGTCGAAACCTGCCTAGCAGA
ACCTCGCGGCCACTCGTCCCCTCATCCTGGGAAAGGAACGTCCCGAGCGTCGCACCTCGG
TGCCTCCTCCCGACTGACCCTCCCGGGCGTACTGAACATCAGCGTGAATTACGCCAGGGA
ACTTGAATGAAAGAGCGTCTCCCCCACCAGTCTCCGGAGATGGTGTTCGTGCGGACAATT
TCGTTGCCTTCCATATGTCTAAACACTCTCAGCAACGAATATCTCGGCTCTCGCATTGAT
GAAGAACGTAGCGAAATGCGATAATTGGTGTGAATTGCAAAASequence #3b:
This sequence is a single read from an EST project studying a dinoflagellate.
>HWK3B Dinoflagellate Sequence
GGGGAGTATCCGCAGCAGTGCCAGCTACAAACTTCTGATCTCGTCATATTGTGCCAAAGC
NTGCTGCTCTTTGATCAACTCAACCTTGTTCTGCAAGATGATGATGTGCTTCAAGCGCAT
GATCTCCAAGGCTGCCAAGTGTTCTGAGGTCTGAGGTTGTGGACAGGTCTCATTGCCGGC
AATGAGCAAAAGCGCTGCATCCATGACAGCAGCTCCATTAAGCATTGTGGCCATGAGAAC
GCAAATCAAGTTAAAAAAAATTAAAGAATAAATTATTATTATAATAAAACATTTAACTTT
TTTGTTATTTTAGACTATGTCTTCATTTATAAATGTAGGATTGAAAAAATTTTTGTTTAC
ACTACTCATTTATAGTCGTTGAACGTTATTAATATCCATTACATATACTAATATCGCTGC
AAATCTATCTTTAAGATAATTCTTGCAATTAACCCTATGACAGATCTATTCGCTTATGTC
GTGTCCGGGACAATCGACGAATGACACATGCCTCACCAGCCTCCAGTCGCCACCCTCTTC
GCATGGAAAGACATCTTCAGTCTTGCTACCACGAGATGTATAGCGGCCAGGGTTCTGCAT
CTCTTCATTGTCACACTTGTAGATCTTGGCATTCGCATAGCCCAACTTGATGGTGATGTT
ACGCTCCTTCTCCTGCTTGAACTTGACAGTGTGAATACCGCTAAGGCCCTTGACCACCGT
GGACTTACCATGTGCCACATGGCCAATAGTGCCAATATTCATCGTAGCCTGTCTGCTGAT
GACCTCAGGTGTGAGAGGCGTCAGCTTTTTCACATCCAGCTTTGTCAAGTCCTGCACTGC
AAGGCCATCATCAACTTCTGCTTCCTCCTCCGCCATCGCACCAGGGACCAGAGTCTCCTA
AGAGGCGTTTCCCGTCTTTTTGTGTCCTG
Homework #4, Due Saturday Dec. 21, 2002. 25 points.
Consider the following multiple sequence alignment in GCG's "MSF" format:
These data can be imported directly in to PAUP* (for Phylogenetic Analysis Using Parsimony and any combination of other methods or none), and can also be manipulated in GCG or ClustalX. For your convenience, a copy of this file is also located on locus at /fs/bsci348s/bioinf/hwk/homework4.msf.
These sequences correspond roughly to those used to determine the tree in Figure 2 of Kolell, K. J., and Crawford, D. L. 2002. Evolution of Sp transcription factors. Mol. Biol. Evol. 19:216-222. Kolell and Crawford worked with amino-acid sequences, while these sequences are nucleotide sequences. There are, however, some other differences between these data and those used by Kolell and Crawford, and one sequence has been added to the dataset, while some others have been omitted.
1) Determine what each of these sequences is (i.e., what organism it is from and what its tentative identification is based on the annoations in genbank), and what portion of the sequence in Genbank the homework sequence represents. Report this information in the following format:
Sequence AY057451 = Fundulus Sp3, 1700 bp corresponding to positions 321-2121 of AY057451
You may find that it helps a great deal to read the genhelp entries on "reformat" (particularly using refomat with msf format files) and "netblast"
2) The alignment was made by pileup, which is a somewhat outdated alignment tool. Can the alignment be improved? What distinguishes a good alignment from a poor one? Remember that these are protein-coding sequences.
3) Compare the results of different phylogenetic analyses. For these analyses you may use the alignment provided. Compare the results of the three phylogenetic analyses listed below. Does the method of phylogenetic analysis affect how these sequences would be interpreted? On the basis of your analyses, do you believe that all of the genbank files are annotated correctly? What change in the annotations would you recommend, and why?
a) a heuristic search using parsimony as the optimality criterion.
b) a heuristic search using minimum evolution as the optimality criterion and Jukes-Cantor distances (minimum evolution is one of several distance optimality criteria). If you wish you could also try a distance search using maximum likelihood distances and the model specified below.
c) a heuristic search using maximum likelihood as the optimality criterion, and the GTR model with both gamma and invariant sites correction for site-to-site rate variation. You should use the following parameters:
Base frequencies:
A 0.253473
C 0.305427
G 0.256295
T 0.184805
Rate matrix R:
AC 1.18667
AG 2.54000
AT 1.14272
CG 0.86606
CT 3.25480
GT 1.00000
P_inv 0.0684388
Shape 1.217872