|
Perl: The Pathologically Eclectic Rubbish Lister
Introduction
|
Data Types
|
Important Functions
|
Libraries
|
Regular Expressions
|
Conclusion
|
Further Practice
|
Homework
|
Links
|
Prerequisites
Introduction
"There is more than one way to do it."
-- Larry Wall
Perl is a language which follows the above mantra in every aspect. It
provides a garbage collector and type system while allowing the writer to
(de)allocate memory or overload data types at will. It allows one to write
nearly anywhere in the functional/procedural/oo spectrum. It provides one
with the structure to write beautiful code while leaving the complexity to
write obfuscated absurdity.
As this document continues, some functions or libraries will be in
bold, one may wish to read the inline documentation for these
libraries with the 'perldoc' command. For a function like open, one would
perform "perldoc -f open" while for a library like CGI one would type
"perldoc CGI," but the most important command one might use is "perldoc
perl."
Getting Started
Many Perl programs start with some permutation of the following lines
#!/usr/bin/perl -w
use strict;
The first line starts with the Unix specific 'she-bang' #! (which you probably
remember from working with the unix shell), which tells the computer to execute a
new process using whatever follows as an interpreter and feed that interpreter
everything from the second line until the end of the file. The -w argument for Perl
tells it to print lots warnings for common programming errors. (See perlrun
for more information.) A good alternative would be to do '#!/usr/bin/env perl' in
case perl is not located in /usr/bin; /usr/bin/env should find it for you.
use strict; is a precompiler directive which forces the programmer to follow lots of
good programming practices like setting the scope of all variables or not allowing one
to use variables until they have been specified. In more complicated pieces of code,
this has the added benefit of speeding up code slightly.
Data Types and some provided variables
The best ways to learn about Perl's data types is to play with them, but reading
perlintro will not hurt. A scalar is defined as a single piece of
information. A scalar may be a number, character, string, or memory address of
something more complicated. In Perl, all scalars are delineated with the '$'
character (comments start with '#'):
Scalars and Arrays
Code Listing |
## My first program?
my $output = 'Hello world'; ### This is not very original
## We will talk about $variable{keyword} soon
print "$output,
as written by $ENV{USER}\n";
## All arrays start with @
my @animals = ('parrot', 'python', 'camel');
## Arrays are 0 indexed, just like in C; if you want a value
## from an array, remember that it is a scalar
print "I like @animals, but especially $animals[0]\n";
$animals[4] = 'elephant';
## If the 5th element of animals is 'elephant,' what is the fourth?
print "What is the fourth element: $animals[3]\n";
|
In the above code fragment I created an array and an array reference (often called
a list). There are many ways to add more information to an array and access the
information kept in an array, some of them are above.
Add some code to your program to keep a list of the accession numbers
of three genes which interest you and print them. In the highly unlikely
event you do not have three genes immediately in mind, go to genbank in order to
explore and also to find their accession numbers.
Hashes
A hash is a series of keys and values. Each key is unique and points to a single
scalar (but we know that scalars may be complex). Thus hashes provide an easy way to
keep data organized. Hashes provide some interesting advantages over arrays: if
one has an array of 10,000,000 elements and one wants to read the 9,990,999th element,
Perl must iterate through all of the array until it arrives at the correct element.
The hashing functions of Perl can potentially find a particular element out of
10,000,000 in far fewer iterations. For example:
Code Listing |
## Hashes are delineated with the '%' and key value pairs are separated
## with commas
my $keyname = 'orange'; ## Just another scalar
my %colors = ( green => 'kiwi',
## and may have strings for values
yellow => 'banana',
## or lists...
red => [ 'strawberry', 'apple' ],
## keys may also be arbitrarily complex, but be careful
'bigger key' => { funky => 'town', },
## keys may also be variables
"$keyname" => 'citrus',
);
## Add another value:
$colors{blue} = 'blueberry';
my %log = (); ## Creating an empty hash
my $number = 1;
while ($number < "10000000.0") { ## It is so easy to switch between integers and
## floating numbers in Perl
$log{$number} = log($number); ## Fill out this hash with natural logs
}
print "$log{9465306}";
|
I leave it as an exercise for the reader to figure out how to turn this
into an array of natural logarithms and print out the 9.5 millionth.
Functions
The function is the fundamental means to of defining ways to manipulate
data. For those who like the object oriented sauce, functions are also the means
of writing in an object oriented style. Run perldoc perlboot for
more information.
Code Listing |
## A function to calculate the number of hydrogen bonds in a sequence
my $sequence = 'ATGCGCTTAAAGAAAGGGTAA';
sub Count_Hydrogen {
## sub makes the function
my $sequence = shift; ## among its other uses, shift may be used to create
## the arguments for a function
my @seq = split(//, $sequence); ## split is a quick way to make an array from
## a string
my $bonds = 0;
foreach my $base (@seq) { ## this is an idiom to iterate over everything in a
## list
if ($base eq 'A' or $base eq 'T' or $base eq 'U') {
## if statements, very useful.
$bonds = $bonds + 2;
}
elsif ($base eq 'C' or $base eq 'G') {
$bonds = $bonds + 3;
}
else {
die("Bases can only be A,U,T,G,C!");
}
}
return($bonds); ## most functions have to return something
}
|
References
As a programming biologist, it may prove important to manipulate large lists of
sequences or perhaps or massive hashes. Imagine pushing a list of 356,000,000,000+
characters through many separate manipulations and functions. What will happen if
one creates ten functions and passes this complete list to all of them? Perl is
quite happy to push this complete list around. Before it finishes, the computer may
strangle the user and jump out of the nearest window. There is a better way.
Code Listing |
## Start with hashes! Lets explore the multiple methods of creating hashes and
## hash refs.
## Two ways of making a hash
my %restriction = ( BamHI => 'G^GATCC' ,
NotI => 'GC^GGCCGC',
RsaI => 'GT^AC', ## Least restrictive enzyme _ever_
);
## Ready to make a reference to %restriction!?
my $restriction_ref = \%restriction;
my %more = (); ## Start with an empty hash
$more{XhoI} = 'C^TCGAG'; ## Now it has 1 entry
$more{AluI} = 'AG^CT'; ## Maybe not the least
### Now lets make some references to hashes instead.
my $codons = { ## Note the use of the dollar sign and the curly brace instead
## of parenthesis.
Alanine => [ 'GCA', 'GCC', 'GCG', 'GCU' ], ## haha a reference in a
## reference!
Methionine => 'AUG',
Proline => [ 'CCA', 'CCC', 'CCG', 'CCU' ], };
my $more_codons = {}; ## Also empty
$more_codons->{Lysine} = [ 'AAA', 'AAG' ];
$more_codons->{Isoleucine} = [ 'AUA', 'AUC', 'AUU' ];
$more_codons->{Tryptophan} = 'UGG';
### Now lets print some pieces of our references
print "Methioine is: $codons->{Methionione}
Tryptophan is: $more_codons->{Tryptophan}\n";
### That was easy, and now...
print "The second Isoleucine codon (ordered how!?) is:
$more_codons->{Isoleucine}->[1]\n";
### Print them all.
foreach my $amino ( keys %{$codons} ) { ## %{$hash} is a way to dereference
if (ref($codons->{$amino}) eq 'ARRAY') { ## Test for a reference to an array
print "The codons for $amino are: @{$codons->{$amino}})\n"; ## @{$hash} is
## another deref
}
else {
print "The codon for $amino are: $codons->{$amino}\n";
}
}
## Before I forget, what if we want to know the location of a reference?
print "What is $codons\n"; ## This will print something like: HASH(0x81683c8)
## Random aside: On a 64 bit computer this number looks slightly different.
## Now let us make an array reference, first an array
my @nucleotides = ('A','T','G','C');
## Easy way to make a reference
my $nuc_ref = \@nucleotides;
## Another way
my $slippery = [ 'AAAAAAU', 'AAAUUUG' ];
## Here too the notation is slightly different.
## Two ways to acquire the elements of a reference:
foreach my $nuc (@{$nuc_ref}) {
print "The nucleotide is $nuc\n";
}
## OR
print "Tell me $slippery->[1]\n"; ## The element.
|
Add some code to your program which uses the accession numbers you
found a few minutes ago and keys them to their sequences in genbank.
Print the sequences sorted by accession.
Some basic and important functions
Every function here is accessible through Perl's online documentation.
Use it via perldoc -f functionname
Code Listing |
split
my $sentence = 'Violence is the last refuge of the incompetent.';
## Create an array out of every space separated word in the sentence.
my @word_list = split(/ /, $sentence);
## Now copy and sort it...
my @word_copy = sort(@word_list);
## print the first and last words...
my $num_elements = scalar(@word_list);
print "First word: $word_list[0], last word: $word_list[$#word_list]
first word alphabetically: $word_copy[0] and last: $word_copy[$#word_copy]
how many elements? $num_elements\n";
for(each) loops
for my $index (0 .. $#word_list) {
print "The $index element is $word_list[$index]\n";
}
## OR, we can do this:
foreach my $element (@word_list) {
print "The element is $element\n";
}
if statements
my $input = <STDIN>;
chomp $input;
my %passwords = ( superuser => 'rootpass',
normaluser => 'userpass',
);
if ($input eq $passwords{superuser}) {
print "Welcome, super user!\n";
}
elsif ($input eq $passwords{normaluser}) {
print "Welcome, normal user!\n";
}
else {
die("You are a bad bad man.");
}
while loops
my $counter = 10;
while ($counter >= 0) {
print "You have $counter seconds!\n";
$counter = $counter - 1;
sleep(1);
}
die("Boom!");
open and close
open(FILEHANDLE, "sequence1.fasta") or
die("I could not open sequence1.fasta $!");
while(my $line = <FILEHANDLE>) {
if ($line =~ m/^\>/) { print "Comment: $line\n" }
else { print "Data: $line\n"; }
}
close(FILEHANDLE);
open(PROGHANDLE, "ps aux |") or
die("I could not run the ps program. $!);
my %pids = ();
while(<PROGHANDLE>) {
my $line = $_;
chomp $line;
next if ($line =~ m/^USER/);
my @information = split(/\t/, $line);
print "The command, run by user: $information[0] is\
$information[$#information]\n";
$pid{$information[1]} = $information[$#information];
}
close(PROGHANDLE);
|
Using External Libraries
According to Larry Wall, one of the biggest virtues of a programmer is
laziness. In that spirit it is often helpful to be able to use someone else's
work in your own programs. Furthermore, the Perl community has banded
together to facilitate such laziness with
CPAN, the Comprehensive Perl Archive
Network. Using the resources it provides, it is easy to immediately query just
what functionality is available in external modules. One must be able to use
such functionality, so here are a couple of examples:
Code Listing |
## Here is an object oriented approach to writing interactive web pages
## The use directive compiles in the CGI module before running the
## rest of the script
use CGI;
## new is a function (or method) built into the CGI package, it
## creates an instance of the CGI object so that we may use/modify it
my $page = new CGI;
## In this case the -> is telling Perl to use the header() method provided by
## CGI, thus if some other piece of code also defines header() the two will
## not cause trouble with one another.
print $page->header;
print $page->start_html("A page!");
print $page->end_html;
|
Here is a non object oriented example:
Code Listing |
use Data::Dumper;
## First make a data structure
my $boo = {'first' => 1, 'second' => 2};
## Now we can set some variables used by the Data Dumper
$Data::Dumper::Terse = 1; # don't output names where feasible
$Data::Dumper::Indent = 0; # turn off all pretty print
## Finally, use the Data Dumper to print our data structure
print Dumper($boo), "\n";
|
And a more practical example:
Code Listing |
## BPlite is a parser written to make accessing the output
## from BLAST easier
use Bio::Tools::BPlite;
## We need to feed it some input
open(INPUT, "<some_blast_output.txt");
## The \* means we are feading BPlite a pointer to the filehandle
my $report = new Bio::Tools::BPlite(-fh=>\*INPUT);
## nextSbjct is a method provided by Bio::Tools::BPlite
while (my $subject = $report->nextSbjct) {
## As are name, score, bits, P, and sbjctSeq.
my $name = $subject->name;
my $hsp = $subject->nextHSP();
my $score = $hsp->score();
my $bitscore = $hsp->bits();
my $pvalue = $hsp->P();
my $subject_sequence = $hsp->sbjctSeq();
## Once we have some data, print it!
print "Some stats from $name: Score: $score,
Bitscore: $bitscore, P value: $pvalue,
sequence: $subject_sequence\n";
}
|
Regular Expressions
Regular expressions consist of a whole family of methods to take some
input and rules and return some output. The input consists of a string of
characters while the rules provide a series of elements to match and/or
substitute. Regular expressions are simultaneously one of the most useful
and difficult topics when dealing with large quantities of text.
Here are a couple of examples, read perlre for more.
Code Listing |
## My apologies to Lewis Carroll
my $string = 'Beware the Jubjub bird, and shun the frumious Bandersnatch!';
if ($string =~ m/andersn/) { ## Found in the word Bandersnatch!
print "I found the string in $string\n";
}
else {
print "I did not find the string in $string\n";
}
my $search = 'funky';
if ( m/$search/) {
print "I found the string 'funky' in $string\n";
}
else {
print "I did not find the string 'funky' in $string\n";
}
my $output = $string;
## And perform a translation from all a...z to A...Z
$output =~ tr/a-z/A-Z/;
print "Now it is $output\n";
## Copy over $output back with the original string
$output = $string;
$output =~ s/^([^ ]*) *([^ ]*)/$2 $1/;
print "Now output is: $output\n";
my $last = "The food is under the bar in the barn.";
if ( m/foo(.*)bar/ ) {
print "got _$1_\n";
}
|
Conclusion
Hopefully this introduction to Perl provides a sense of how quickly
one might manipulate large quantities of text. Furthermore Perl
allows the user to instantly create arbitrarily complex data structures
for those many occasions when the biologist needs to gather information,
munge it, and spit it back out.
Perl also has a fairly impressive list of modules designed to facilitate
working with sequence data.
Further Practice
Interesting problems with which to practice your new Perl skills:
1. Create a hash which will allow you to translate from amino acids to codons and vice versa.
2. Write a program which uses this hash to translate a nucleotide sequence from NCBI.
3. Create the program "revcomp" which translates a sequence into its reverse complement.
The homework
Write a program which reads the NCBI flat file format for a sequence and
writes out a file in the fasta format. The output file should contain a useful
subset of the information provided by the flat file including: size, locus (if
available), organism, accession, and description.
Links
Some other links and tutorials in no particular order
|