BSCI 380 Laboratory

Exercises
·Unix Introduction
·BLAST
·PERL
·Genbank
·BLAST, GCG
·GCG
·Seqlab
·Synthesis
·MSA
·Paup
·Phylogeny
·Examine

·An editor primer
·A GCG cheatsheet
·Flat2fasta homework
·Dynamic Programming homework
·High scoring words homework
·GCG homework
·Seqlab homework
·Mystery sequence homework
·Paup homework

An Introduction to Phylogenetic Inference and Paup

Ability to work in the Unix environment
Facility with a sequence alignment editor
Experience with multiple sequence alignment

Introduction

PAUP* is a major analytical tool in phylogenetic analysis. It makes available a very wide variety of analytical methods in a single environment, and can be operated via window/mouse, command-line, or scripts.
The PAUP* web site has considerable useful information.
Unfortunately PAUP* is a commercial program (available from Sinauer Associates), and although it is quite a good value, if you wish to use software that is available free, you will have to use another package. A very flexible alternative to PAUP* is phylip, which also includes a wide range of analytical methods. The phylip web site also has an excellent summary of phylogenetic software available elsewhere on the web. Consult this if you are looking for a specific analytical capability.
There are two main elements which define how long a phylogenetic search will take: the type of search through the space of possible trees and the optimality criterion used to choose the best tree in space of possible trees. We will attempt to examine different optimality criteria and search methods.
What precisely is the "space of possible trees?" It consists of the number of possible different trees for a given number of taxa. In the case of unrooted trees it is a product series from 3 to the number of taxa of ((2 * the number of taxa) - 5)

Getting Started with Paup

When starting to work with Paup, one will first need to execute a nexus file, which is yet another metadata containing sequence file; this time containing in its data segment a multiple sequence alignment. A Nexus file looks something like this:

#NEXUS
BEGIN DATA;
dimensions ntax=26 nchar=1303;
format missing=?
symbols="ABCDEFGHIKLMNPQRSTUVWXYZ"
interleave datatype=DNA gap= -;

matrix
Zamia          --------------------------------------------------
Cycas          ------------------------------------------CGTGTTTA

The above segment shows the gaps padding the beginning of a multiple sequence alignment. Readseq (installed on both the owl cluster and locus) is the most common utility to translate from a given format into another, including nexus.
Before using paup for phylogenetic inference, create or acquire an alignment. I suggest going to the protein family database (pfam): http://pfam.wustl.edu and browsing for a protein family of interest. Take a moment to notice the % identity, the structures provided, the descriptions of the protein families, and the information provided regarding the creation of any hidden markov models for a family of interest. When downloading the alignment of a given family, keep in mind that our class is only three hours and phylogenetic inference of even a score of taxa can take multiple days or indeed months; so you may want to only download the seed alignment and even cut taxa out of it before loading it into paup.
When working with paup, it is important to recall that there are three main optimality criteria used for phylogenetic analysis; each with strengths and weaknesses: Parsimony, Distance, and Likelihood. I suggest spending a moment considering these strengths and weaknesses before continuing further; if you are uncertain about them, ask your neighbors and/or take a moment to look in the text or online.

Different ways of searching for trees

In order to compare different methodologies for searching tree space, we will use the same optimality criterion and first change the search type from Exhaustive to Heuristic and Branch-and-Bound in order to get a sense of their relative speed.
Start with an exhaustive search of a limited number of the taxa acquired from pfam (note that 8-12 taxa will take approximately 2-5 minutes using our machines.) and a constant optimality criterion (parsimony). Select Exhaustive Search in the Analysis tab, examine the options, perform the search, and inspect the tree length histogram and note how long the search takes.
Repeat the same search using a branch and bound search; these too are guaranteed to find the optimal tree without examining all tree geometries. Note again the tree and the time taken.
Finally, perform the same search using a heuristic search. These are not guaranteed to uncover the optimal tree geometry; so note especially the time and tree geometry/length.

A Parsimony Search

As the name "Phylogenetic Analysis Using Parsimony" suggests, paup's default analytical method consists of a parsimony search. If you are using the graphical user interface for paup, examine the parsimony options, noting that there are five subscreens of options which one may change. Consider each of these; if you change them, take note of the 'defaults' button in paup which allows one to return to the default settings.
It is possible to exclude ambiguously aligned characters or characters for which there is too little information by going to Data and Include/Exclude characters
When you are satisfied with the conditions of your parsimony search, start the search and note how long it takes.
pars.paup is a paup script file which may be used to run paup on the command line in order to perform a parsimony search. It illustrates how paup uses comments as well as how one may change options in the command line version of paup.

A Distance Search

Distance methods boil down the data of the multiple sequence alignment into a set of distances between each individual taxon and attempt to match a tree to best fit these distances. Thus it may at times provide a good middle ground between correctness and requisite speed. Change the optimality criterion to distance in the Analysis tab, and recall that distance methods require two steps:
1. creating a matrix of distances from the multiple sequence alignment.
2. searching for a tree which best matches a given set of distances.
Perform a search and once again note the time it takes.

A Likelihood Search

Likelihood methods often prove the most correct and difficult to complete. The two primary likelihood methods include Bayesian inference using a monte carlo search and Likelihood inference (strangely I do not believe any likelihood methods use a monte carlo search.) Paup has only implemented likelihood inference, selectable once again in the Analysis tab.
ml.paup is a paup script file which may be used to run paup on the command line in order to perform a maximum likelihood search.

Bootstrapping

A bootstrap provides a measurement of confidence in the inferences inherent in a given multiple sequence alignment and phylogenetic tree. As the name suggests, a bootstrap uses only the already existing information in order to provide this metric; in the case of paup the bootstrap provided is nonparametric and created via repeated random sampling of characters provided from the multiple sequence alignment. Each set of randomly generated characters undergoes a new phylogenetic inference and the result is compared to the original inference for as many iterations as possible in order to generate a percentage of support by the bootstrap for the phylogenetic inference. Perform a bootstrap of your maximum parsimony search using the three types of branch swapping: TBR (Tree Bisection and reconnection), SPR (subtree pruning and reconnection), and NNI (Nearest neighbor interchange) in order to get a sense of their varying support for your initial tree and varying run times.

Pseudo Random Numbers

If you automate analyses with any random component, it is very important that you provide a unique random number seed for each distinct random analysis.
True random number generators are rare.
Most programs use "pseudo-random number generators"
Pseudo-random number generators produce a sequence of numbers that seem to be random, and is in fact random for our purposes, but that follow a sequence that is predictable if you know the seed number that was used.
It does not matter what seed number you use in PAUP (in phylip, the number must be odd), but it must be a different number for each run that you want to use a different random sequence.
If you want to exactly repeat a random run, you can do it if you know the random number you used, so keep the random number recorded in your notes on your analyses.

Created: Wed Sep 15 00:58:22 EDT 2004 by Charles F. Delwiche
Last modified: Mon Nov 8 15:49:44 EST 2004 by Ashton Trey Belew.