Exercises
·Unix Introduction
·BLAST
·PERL
·Genbank
·BLAST, GCG
·GCG
·Seqlab
·Synthesis
·MSA
·Paup
·Phylogeny
·Examine


·An editor primer
·A GCG cheatsheet
·Flat2fasta homework
·Dynamic Programming homework
·High scoring words homework
·GCG homework
·Seqlab homework
·Mystery sequence homework
·Paup homework

Perl: The Pathologically Eclectic Rubbish Lister

Introduction   |   Data Types   |   Important Functions   |   Libraries   |   Regular Expressions   |   Conclusion   |   Further Practice   |   Homework   |   Links   |  
Prerequisites
Introduction

"There is more than one way to do it."
	  -- Larry Wall
    
Perl is a language which follows the above mantra in every aspect. It provides a garbage collector and type system while allowing the writer to (de)allocate memory or overload data types at will. It allows one to write nearly anywhere in the functional/procedural/oo spectrum. It provides one with the structure to write beautiful code while leaving the complexity to write obfuscated absurdity.
As this document continues, some functions or libraries will be in bold, one may wish to read the inline documentation for these libraries with the 'perldoc' command. For a function like open, one would perform "perldoc -f open" while for a library like CGI one would type "perldoc CGI," but the most important command one might use is "perldoc perl."

Getting Started

Many Perl programs start with some permutation of the following lines

#!/usr/bin/perl -w
use strict;
The first line starts with the Unix specific 'she-bang' #! (which you probably remember from working with the unix shell), which tells the computer to execute a new process using whatever follows as an interpreter and feed that interpreter everything from the second line until the end of the file. The -w argument for Perl tells it to print lots warnings for common programming errors. (See perlrun for more information.) A good alternative would be to do '#!/usr/bin/env perl' in case perl is not located in /usr/bin; /usr/bin/env should find it for you.
use strict; is a precompiler directive which forces the programmer to follow lots of good programming practices like setting the scope of all variables or not allowing one to use variables until they have been specified. In more complicated pieces of code, this has the added benefit of speeding up code slightly.

Data Types and some provided variables

The best ways to learn about Perl's data types is to play with them, but reading perlintro will not hurt.
A scalar is defined as a single piece of information. A scalar may be a number, character, string, or memory address of something more complicated. In Perl, all scalars are delineated with the '$' character (comments start with '#'):

Scalars and Arrays

Code Listing

## My first program?
my $output = 'Hello world';   ### This is not very original
## We will talk about $variable{keyword} soon
print "$output,
as written by $ENV{USER}\n";
The 'my' keyword tells Perl the scope of $output, which is to say that it 
defines the lifetime of $output.  Write your own Hello world program,
experiment with the quotations to learn the difference between single and double
quotes.
For example: '$output' versus "$output"
An array is a list of scalars.  Thus they may be anything from a list of
numbers, strings, or memory addresses of other things including other scalars,
arrays, etc.  Example:

## All arrays start with @
my @animals = ('parrot', 'python', 'camel');
An array is a piece of data which contains a list of other data.  This data
might include scalars (like $output above), other arrays, or even references
(pointers) to scalars or arrays, or even more complicated pieces of data.

## Arrays are 0 indexed, just like in C; if you want a value
## from an array, remember that it is a scalar
print "I like @animals, but especially $animals[0]\n";
@animals prints the entire list, while $animals[0] prints the first entry in the
list.

$animals[4] = 'elephant';
## If the 5th element of animals is 'elephant,' what is the fourth?
print "What is the fourth element: $animals[3]\n";
In Perl it is quite easy to change the elements of existing arrays or add new 
elements.  Take the above code and run it so that errors and output are 
redirected into two separate files.  If you do not recall how to do so, check
back to the previous introduction to unix.

In the above code fragment I created an array and an array reference (often called a list). There are many ways to add more information to an array and access the information kept in an array, some of them are above.
Add some code to your program to keep a list of the accession numbers of three genes which interest you and print them. In the highly unlikely event you do not have three genes immediately in mind, go to genbank in order to explore and also to find their accession numbers.

Hashes

A hash is a series of keys and values. Each key is unique and points to a single scalar (but we know that scalars may be complex). Thus hashes provide an easy way to keep data organized. Hashes provide some interesting advantages over arrays: if one has an array of 10,000,000 elements and one wants to read the 9,990,999th element, Perl must iterate through all of the array until it arrives at the correct element. The hashing functions of Perl can potentially find a particular element out of 10,000,000 in far fewer iterations. For example:

Code Listing

## Hashes are delineated with the '%' and key value pairs are separated
## with commas
my $keyname = 'orange';  ## Just another scalar
my %colors = ( green => 'kiwi',
## and may have strings for values
  yellow => 'banana',
## or lists... 
  red => [ 'strawberry', 'apple' ],
## keys may also be arbitrarily complex, but be careful
 'bigger key' => { funky => 'town', },
## keys may also be variables
  "$keyname" => 'citrus',
);
## Add another value:
$colors{blue} = 'blueberry';
The above code is quite complex.  Lets break it down one line at a time.
The actual hash consists of everything from the first '(' until the final ');'
Each key value pair is separated by either '=>' or a comma (I prefer the more
visible '=>').
The hash above consists of four elements.  The first key listed is 'green' and
holds the value 'kiwi.'  This is followed by yellow and banana.  Red is more 
interesting, it contains a list of red fruits.  Finally, 'bigger key' contains a
hash reference.  More on those shortly.
One is not limited to using static names as $keyname illustrated.
The final line is a separate declaration which adds a key for the color blue.
Therefore, the hash contains four scalars, a list, and a hash reference.

my %log = ();  ## Creating an empty hash
my $number = 1;
while ($number < "10000000.0") { ## It is so easy to switch between integers and
  ## floating numbers in Perl
  $log{$number} = log($number);  ## Fill out this hash with natural logs
}
print "$log{9465306}";

        

I leave it as an exercise for the reader to figure out how to turn this into an array of natural logarithms and print out the 9.5 millionth.

Functions

The function is the fundamental means to of defining ways to manipulate data. For those who like the object oriented sauce, functions are also the means of writing in an object oriented style. Run perldoc perlboot for more information.

Code Listing


## A function to calculate the number of hydrogen bonds in a sequence
my $sequence = 'ATGCGCTTAAAGAAAGGGTAA';
sub Count_Hydrogen {
  ## sub makes the function
  my $sequence = shift;  ## among its other uses, shift may be used to create
  ## the arguments for a function
  my @seq = split(//, $sequence);  ## split is a quick way to make an array from
  ## a string
  my $bonds = 0;
  foreach my $base (@seq) {  ## this is an idiom to iterate over everything in a
    ## list
    if ($base eq 'A' or $base eq 'T' or $base eq 'U') {
    ## if statements, very useful.
      $bonds = $bonds + 2;
    }
    elsif ($base eq 'C' or $base eq 'G') {
      $bonds = $bonds + 3;
    }
    else {
      die("Bases can only be A,U,T,G,C!");
    }
  }
  return($bonds);  ## most functions have to return something
}

References

As a programming biologist, it may prove important to manipulate large lists of sequences or perhaps or massive hashes. Imagine pushing a list of 356,000,000,000+ characters through many separate manipulations and functions. What will happen if one creates ten functions and passes this complete list to all of them? Perl is quite happy to push this complete list around. Before it finishes, the computer may strangle the user and jump out of the nearest window. There is a better way.

Code Listing


## Start with hashes!  Lets explore the multiple methods of creating hashes and
## hash refs.
## Two ways of making a hash
my %restriction = ( BamHI => 'G^GATCC' ,
                    NotI => 'GC^GGCCGC',
	                RsaI => 'GT^AC',  ##  Least restrictive enzyme _ever_
  );
## Ready to make a reference to %restriction!?
my $restriction_ref = \%restriction;
my %more = ();  ## Start with an empty hash
$more{XhoI} = 'C^TCGAG';  ## Now it has 1 entry
$more{AluI} = 'AG^CT';    ## Maybe not the least
### Now lets make some references to hashes instead.
my $codons = {  ## Note the use of the dollar sign and the curly brace instead
			    ## of parenthesis.
  Alanine => [ 'GCA', 'GCC', 'GCG', 'GCU' ],  ## haha a reference in a 
## reference!
  Methionine => 'AUG',
  Proline => [ 'CCA', 'CCC', 'CCG', 'CCU' ], };
my $more_codons = {};  ## Also empty
$more_codons->{Lysine} = [ 'AAA', 'AAG' ];
$more_codons->{Isoleucine} = [ 'AUA', 'AUC', 'AUU' ];
$more_codons->{Tryptophan} = 'UGG';
### Now lets print some pieces of our references
print "Methioine is: $codons->{Methionione}
Tryptophan is: $more_codons->{Tryptophan}\n";
### That was easy, and now...
print "The second Isoleucine codon (ordered how!?) is:
$more_codons->{Isoleucine}->[1]\n";
### Print them all.
foreach my $amino ( keys %{$codons} ) {  ## %{$hash} is a way to dereference
  if (ref($codons->{$amino}) eq 'ARRAY') {  ## Test for a reference to an array
	print "The codons for $amino are: @{$codons->{$amino}})\n"; ## @{$hash} is 
##    another deref
  }
  else {
    print "The codon for $amino are: $codons->{$amino}\n";
  }
}
## Before I forget, what if we want to know the location of a reference?
print "What is $codons\n";  ## This will print something like: HASH(0x81683c8)
## Random aside: On a 64 bit computer this number looks slightly different.
Yeah, references are not the simplest concept in the world.  They are powerful.
The key points to remember are:
the escape character (\) will turn a variable into a reference.
pulling out a piece of a reference is done with the arrow '->' operator
pulling out the whole thing is done with @{$something} for arrays %{$something}
for hashes and $$something for scalars.

## Now let us make an array reference, first an array
my @nucleotides = ('A','T','G','C');
## Easy way to make a reference
my $nuc_ref = \@nucleotides;
## Another way
my $slippery = [ 'AAAAAAU', 'AAAUUUG' ];
## Here too the notation is slightly different.
## Two ways to acquire the elements of a reference:
foreach my $nuc (@{$nuc_ref}) {
  print "The nucleotide is $nuc\n";
}
## OR
print "Tell me $slippery->[1]\n";  ## The 
    second
element.

Add some code to your program which uses the accession numbers you found a few minutes ago and keys them to their sequences in genbank. Print the sequences sorted by accession.

Some basic and important functions

Every function here is accessible through Perl's online documentation. Use it via perldoc -f functionname

Code Listing

split
my $sentence = 'Violence is the last refuge of the incompetent.';
## Create an array out of every space separated word in the sentence.
my @word_list = split(/ /, $sentence);
## Now copy and sort it...
my @word_copy = sort(@word_list);
## print the first and last words...
my $num_elements = scalar(@word_list);
print "First word: $word_list[0], last word: $word_list[$#word_list]
first word alphabetically: $word_copy[0] and last: $word_copy[$#word_copy]
how many elements?  $num_elements\n";
One nice thing about printing in Perl as opposed to other languages includes
the fact that it allows one to have newlines in the middle of the code.

for(each) loops
for my $index (0 .. $#word_list) {
  print "The $index element is $word_list[$index]\n";
}
## OR, we can do this:
foreach my $element (@word_list) {
  print "The element is $element\n";
}
The .. operator is a nice shortcut for iterating from point a to point b
$#word_list returns the number of the last element of an array
When you do not care about keeping the number of the elements
of a list, foreach is quite useful

if statements
my $input = <STDIN>;
chomp $input;
my %passwords = ( superuser => 'rootpass',
  normaluser => 'userpass',
 );
if ($input eq $passwords{superuser}) {
  print "Welcome, super user!\n";
}
elsif ($input eq $passwords{normaluser}) {
  print "Welcome, normal user!\n";
}
else {
  die("You are a bad bad man.");
}
STDIN is a mechanism for accessing keyboard input among other things

while loops
my $counter = 10;
while ($counter >= 0) {
  print "You have $counter seconds!\n";
  $counter = $counter - 1;
  sleep(1);
}
die("Boom!");

open and close
open(FILEHANDLE, "sequence1.fasta") or
			  die("I could not open sequence1.fasta $!");
while(my $line = <FILEHANDLE>) {
  if ($line =~ m/^\>/) { print "Comment: $line\n" }
  else { print "Data: $line\n"; }
}
close(FILEHANDLE);
open(PROGHANDLE, "ps aux |") or
			  die("I could not run the ps program. $!);
my %pids = ();
while(<PROGHANDLE>) {
  my $line = $_;
  chomp $line;
  next if ($line =~ m/^USER/);
  my @information = split(/\t/, $line);
  print "The command, run by user: $information[0] is\
$information[$#information]\n";
  $pid{$information[1]} = $information[$#information];
}
close(PROGHANDLE);
Run perldoc -f open to learn about this one

Using External Libraries

According to Larry Wall, one of the biggest virtues of a programmer is laziness. In that spirit it is often helpful to be able to use someone else's work in your own programs. Furthermore, the Perl community has banded together to facilitate such laziness with CPAN, the Comprehensive Perl Archive Network. Using the resources it provides, it is easy to immediately query just what functionality is available in external modules. One must be able to use such functionality, so here are a couple of examples:

Code Listing


## Here is an object oriented approach to writing interactive web pages
## The use directive compiles in the CGI module before running the
## rest of the script
use CGI;
## new is a function (or method) built into the CGI package, it
## creates an instance of the CGI object so that we may use/modify it
my $page = new CGI;
## In this case the -> is telling Perl to use the header() method provided by
## CGI, thus if some other piece of code also defines header() the two will
## not cause trouble with one another.
print $page->header;
print $page->start_html("A page!");
print $page->end_html;


Here is a non object oriented example:

Code Listing


use Data::Dumper;
## First make a data structure
my $boo = {'first' => 1, 'second' => 2};
## Now we can set some variables used by the Data Dumper
$Data::Dumper::Terse = 1;          # don't output names where feasible
$Data::Dumper::Indent = 0;         # turn off all pretty print
## Finally, use the Data Dumper to print our data structure
print Dumper($boo), "\n";


And a more practical example:

Code Listing


## BPlite is a parser written to make accessing the output
## from BLAST easier
use Bio::Tools::BPlite;
## We need to feed it some input
open(INPUT, "<some_blast_output.txt");
## The \* means we are feading BPlite a pointer to the filehandle
my $report = new Bio::Tools::BPlite(-fh=>\*INPUT);
## nextSbjct is a method provided by Bio::Tools::BPlite
while (my $subject = $report->nextSbjct) {
## As are name, score, bits, P, and sbjctSeq.
        my $name = $subject->name;
        my $hsp = $subject->nextHSP();
	my $score = $hsp->score();
	my $bitscore = $hsp->bits();
	my $pvalue = $hsp->P();
	my $subject_sequence = $hsp->sbjctSeq();
## Once we have some data, print it!
        print "Some stats from $name:  Score: $score,
Bitscore: $bitscore, P value: $pvalue,
sequence: $subject_sequence\n";
}


Regular Expressions

Regular expressions consist of a whole family of methods to take some input and rules and return some output. The input consists of a string of characters while the rules provide a series of elements to match and/or substitute. Regular expressions are simultaneously one of the most useful and difficult topics when dealing with large quantities of text. Here are a couple of examples, read perlre for more.

Code Listing

## My apologies to Lewis Carroll
my $string = 'Beware the Jubjub bird, and shun the frumious Bandersnatch!';
## =~ tells Perl to bind a scalar value to a pattern match
## In this case, the m/ tells Perl that it will perform a match with no other
## funny buisness.  Thus it should return any input which has the search string
## 'andersn' in it 
## slashes (/) are commonly used to delineate the beginning and end of rules.

if ($string =~ m/andersn/) {  ## Found in the word Bandersnatch!
  print "I found the string in $string\n";
}
else {
  print "I did not find the string in $string\n";
}
## You are allowed to use variables in your regular expressions.

my $search = 'funky';
if ( m/$search/) {
  print "I found the string 'funky' in $string\n";
}
else {
  print "I did not find the string 'funky' in $string\n";
}
##  copy the string to $output

my $output = $string;
## And perform a translation from all a...z to A...Z
$output =~ tr/a-z/A-Z/;
print "Now it is $output\n";
## Copy over $output back with the original string
$output = $string;
$output =~ s/^([^ ]*) *([^ ]*)/$2 $1/;
## Perform something a bit more complicated...
## The ^ character specifies the beginning of line or match
## Any * character tells Perl to match 0 or more occurrences of the pattern
## Any + matches on 1 or more
## {} can be used to specify a maximum and/or minimum number of matches
## Parentheses () make logical groups, each group fills 
## pre-named variables $1 .. $n
## (a (b) (c)) -- $1 gets all of a b c, $2 gets only b, $3 gets c
## Square brackets make a character based group or 'class', thus
## [a-z] consists of the group of characters from a ... z, in this case the group 
## consists of some text followed by 0 or more spaces, separated by spaces
## Each one of these groups gets a $1, $2 etc.
## The match rule terminates at the second /, then the print rule tells Perl
## to print the second word followed by a space, then the first word...

print "Now output is: $output\n";
my $last =  "The food is under the bar in the barn.";
if ( m/foo(.*)bar/ ) {
     print "got _$1_\n";
}
## Will this print out:
#  got _d is under the _
# or
#  got _d is under the bar in the _
# ?

Conclusion

Hopefully this introduction to Perl provides a sense of how quickly one might manipulate large quantities of text. Furthermore Perl allows the user to instantly create arbitrarily complex data structures for those many occasions when the biologist needs to gather information, munge it, and spit it back out.
Perl also has a fairly impressive list of modules designed to facilitate working with sequence data.

Further Practice

Interesting problems with which to practice your new Perl skills:
1. Create a hash which will allow you to translate from amino acids to codons and vice versa.
2. Write a program which uses this hash to translate a nucleotide sequence from NCBI.
3. Create the program "revcomp" which translates a sequence into its reverse complement.

The homework

Write a program which reads the NCBI flat file format for a sequence and writes out a file in the fasta format. The output file should contain a useful subset of the information provided by the flat file including: size, locus (if available), organism, accession, and description.

Links

Some other links and tutorials in no particular order


Created: Wed Sep 15 00:58:22 EDT 2004 by Charles F. Delwiche
Last modified: Mon Nov 8 15:49:44 EST 2004 by Ashton Trey Belew.