BSCI 380 Laboratory

Exercises
·Unix Introduction
·BLAST
·PERL
·Genbank
·BLAST, GCG
·GCG
·Seqlab
·Synthesis
·MSA
·Paup
·Phylogeny
·Examine

·An editor primer
·A GCG cheatsheet
·Flat2fasta homework
·Dynamic Programming homework
·High scoring words homework
·GCG homework
·Seqlab homework
·Mystery sequence homework
·Paup homework

An Introduction to Unix like Operating Systems

Patience.
A Computer with a Unix like operating system.
A Bourne Shell (in order to perform the exercises as listed.

A note about documentation: As these documents progress, a number of commands will appear in bold face. In every instance the reader should be able to competently use the provided command or make use of the online documentation in order to become competent.
These documents sometimes assume the user is sitting before an Apple computer The examples given in this text also follow the format provided by the Gentoo Documentation
All examples in this document assume the reader is using the 'bash' shell.

Introduction

As the quantity of Biological information continues to increase at rates greater than Moore's Law, it becomes increasingly important for a modern Biologist to make use of the most efficient tools available. The Unix family of operating systems grew out of a single individual's desire to make use of an otherwise deprecated computer; as a result these systems have a legacy of simplicity and speed.

In the 1960's ATT and Bell labs were looking to make a system to run on their mainframes. They began a system known as Multics; after a very short time Bell labs pulled out and the project died. One of the original developers, Ken Thompson then needed work and so he "found a small computer (a Digital Equipment Corp. PDP-7) on which he began developing space related programs (satellite orbit predictors, lunar calendars, space war games, etc.)." [*] Shortly thereafter he rewrote the system so that multiple people could work simultaneously. Brian Kernighan saw this and suggested the name Unics, which was quickly shortened to UNIX. The system found sudden adoption in Bell Labs until the programming language C was written for UNIX (and later used to re-write UNIX). The system was simple enough and small enough that programmers all over the country ported UNIX to other architectures and added new features. Over time, the UNIX trademark passed from company to company until now there is no system which can properly be called 'UNIX,' but each implementation is a flavor of UNIX. To learn more about the history of the UNIX family of operating systems, please explore: http://www.levenez.com/unix/

The Shell

The shell is the fundamental method of interacting with a Unix like operating system. Neal Stephenson, wrote an interesting exposition about why this is the case entitled: In the beginning was the command line.
When the shell is first opened, it may provide the user with something like this:

Code Listing: A Shell

This is the shell. It provides the user a simple and consistent interface to most commands on the system. Running a shell command takes the form "COMMAND ARGUMENTS INPUT." There are a few caveats: case matters; the arguments generally take the form "--switch option" or "-s option"; and most commands have a simple built in help accessible via "--help or -h" which enumerates the available options. The most important command for a new user is the manual or man(1)

Code Listing

$ man man
man(1)                                 Manual pager utils                                 man(1)

NAME
       man - an interface to the on-line reference manuals

SYNOPSIS
       man  [-c|-w|-tZ]  [-H[browser]]  [-T[device]]  [-adhu7V]  [-i|-I]  [-m  system[,...]] [-L
       locale] [-p string] [-C file] [-M path] [-P pager] [-r prompt] [-S list]  [-e  extension]
       [[section] page ...] ...
       man  -l [-7] [-tZ] [-H[browser]] [-T[device]] [-p string] [-P pager] [-r prompt] file ...
       man -k [apropos options] regexp ...
       man -f [whatis options] page ...
(continued)

The beginning of the manual.

There are many different types of shells, including but not limited to: sh, bash, ash, csh, tcsh, ksh, zsh, psh; each shell has its own syntax, variable system, and oddities but they all share at the very least three extremely important variables: STDIN, STDOUT, and STDERR. These variables hold the shell's current input, output, and error. For example:

Code Listing

$ ls
exercise1.shtml  header.html  index.html  index.shtml  stylesheet.css  template.shtml
I typed the ls(1) command in order to view the document I 
am currently editing.
$ ls > ls.output
$
The tells the shell to redirect its STDOUT to the file
named ls.output.
$ ls 
exercise1.shtml  header.html  index.html  index.shtml  stylesheet.css  template.shtml ls.output
Thus the new file, ls.output appeared.
$ more ls.output 
exercise1.shtml
header.html  
index.html  
index.shtml  
stylesheet.css  
template.shtml
more(1) is a command used to read simple text.  
In this case I am reading the output from my earlier 'ls' command.
$ ls -l funkytown index.html 2>ls.error 1>ls.output
$
This example is more difficult.  I am using the '-l'
option of ls(1) and asking it to give me a listing of two files, one
named 'funkytown' (which does not exist) and another 'index.html,' which does.

$ more ls.error
funkytown: No such file or directory
The file ls.error contains the text of the error
reported by ls when we asked it to view a non existing file.
$ more ls.output
-rw-r--r--  1 trey staff   57 Jan 31 18:47 index.html
While ls.output now contains considerably more 
information about a single file.  Reading the file left to right it tells the
following:  The permissions of the file, the number of inodes pointing to it,
the user which owns the file, the group which owns the file, its size, the
date and time the file was created, and finally its name.  (More on that later)

As a challenge: find which three single letter alphabetical arguments are _not_ valid switches for ls.

There are two types of variables when working with a Unix shell: instance variables and environment variables. By convention environment variables consist of all capital letters while instance variables do not. Environment variables are passed from the process initiating the variable down to every following subprocess. Shell variables on the other hand exist only during the lifetime of the initiating process.

Code Listing

$ echo "Hello $USERNAME"
Hello trey
echo(1) does just that.  In this case I asked echo
to print the USERNAME environment variable.
$ echo $nonesuch
$
$nonesuch does not exist and so nothing is printed.
$ test='a new variable'
$
I asked the shell to create a new instance variable
test with the value 'a new variable'
$ echo $test
a new variable
Now the shell returns the value of test.  The '$' tells
the shell that the following text is a variable.  One may also use 'echo ${test}
$ bash
$
I started a new bash shell with an entirely new
and clean environment.
$ echo $test
$
This new shell knows nothing about test
$ exit
$
So I logged out of that shell...
$ export TEST='an environment variable'
$
in order to create a new environment variable...
$ echo TEST
an environment variable
echo does just what we expect...
$ bash
$
so I started another new shell...
$ echo TEST
an environment variable
and this time the value of TEST got passed down to the
new shell, and indeed it will be passed on to every process I start from this
shell.

If you are using the [t]c-shell, the required commands to do the same thing are: set and setenv respectively.

On the computer I am using, the command prompt looks like this:

(19:38:08)trey@sedition:~/docs/>

and is defined by the environment variable PS1, look up in bash(1) how to change this variable to something more interesting that just '$.'

The Filesystem

Unix like operating systems have a hierarchal filesystem with a single root called '/' (unlike other systems). Every accessible disk, floppy, pseudo-fs(to be explained later), etc is grafted or 'mount(8)ed' upon this tree. There are few rules regarding the layout of files in this tree, but many customs. Here are a few: etc is for configuration information, bin for binaries, sbin for super user binaries, var for variable information (logs spools, etc), lib for libraries, tmp for temporary information, and share for information shared among programs (things like images and icons or data files). Every segment of the filesystem has its own permissions, viewable via the 'ls' command. Use the manual to find the correct switches to ls to see the permissions of the files in your home directory. Follow along for a short tour of a unix system: (Note: Users of Apple OSX will find that the names of the assorted directories are changed to their non abbreviated cognates: Library instead of lib, Applications instead of bin, etc...)

Open a new Shell

$ pwd
/home/trey
pwd stands for 'print working directory' and tells the
user the current location in the filesystem hierarchy.  A newly opened shell by
default places the user in his/her home directory.
$ cd /
$
change directory to the 'root' of the filesystem.
$ ls
boot  dev  floppy  initrd  lib         proc  sbin  tmp  var
bin      etc  home    key     lost+found  mnt    root  sys   usr  vmlinuz
There is a tremendous amount of information available 
here.  However, we can both provide more information and make it clearer with
the following:
$ alias ls='ls -F'
$
$ ls
boot/  dev/  floppy/  initrd/  lib/         proc/  sbin/  tmp/  var/
bin/      etc/  home/    lost+found/  mnt/    root/  sys/   usr/  vmlinuz@
The root directory of this computer contains the elements
mentioned above.  All entries which end in '/' are directories while those which
end with '@' are symbolic links to something else on the system.
'dev,' 'proc,' and 'sys' are pseudo-filesystems.  They contain
filesystems which take up no space on disk but instead provide interfaces for
accessing the hardware on the system (dev stands for devices), accessing
information about each individual function and process currently active on the
system (proc), and viewing/changing operating system variables (sys).
lost+found is a special 'house keeping' directory maintained in case the system
finds a corrupted file system, in which case the data of the affected files may
be written into lost+found.  mnt contains a set of directories intended for
adding new hard drives or storage devices to the system.
$ ls -l vmlinuz
lrwxrwxrwx  1 root root 13 May 28  2004 vmlinuz -> /boot/vmlinuz
As we can see, vmlinuz points to another file which
lives in /boot.  The fifth field contains the size of the file, which
non-coincidently is the number of characters as the full path name.  The
permissions of symbolic links are also interesting, for by themselves they imply
that any user has full access to the file, which is not  true.
$ ls -l /boot/vmlinuz
lrwxrwxrwx  1 root root 19 Jan 29 18:20 /boot/vmlinuz -> vmlinuz-2.6.10-ac11
$ ls -l /boot/vmlinuz-2.6.10-ac11
-rw-r-----  1 root root 1411930 Jan 29 18:20 /boot/vmlinuz-2.6.10-ac11
The first column provides us with information regarding
the permissions of the file.  When looking at the symbolic links, they started
with 'l', while this starts with '-' signifying a file rather than link,
directory, socket, character, etc. The next 9 characters
are intended to be read in triplets.  The first 'rw-' means that the owner of
the file (which according to the third column is 'root') is allowed to read and
write to this file, but not execute it.  The group of users who own the file
(also root according to the fourth column) may only read the file, while the
rest of the users on the computer may do nothing '---'.  The second column shows
how many files point to the same set of inodes on disk.  An inode provides a map
from the name of the file to the list of individual blocks on the physical media
which actually contain the information.  It is possible for multiple files to
point to the same inode(s) (in which case a change to one file is a change to 
the other).  The next two columns provide the user and group ownership.  Next we
have the size of the file, then the creation date and last the filename.

Some of the most important commands for dealing with the filesystem include: chmod(1) allows a user to change these permissions for any file owned by him/her. Perform the following:

Open a new Shell

$ man chmod > chmod.man
$
$ more chmod.man
CHMOD(1)                                  User Commands                                 CHMOD(1)

NAME
       chmod - change file access permissions

SYNOPSIS
       chmod [OPTION]... MODE[,MODE]... FILE...
       chmod [OPTION]... OCTAL-MODE FILE...
       chmod [OPTION]... --reference=RFILE FILE...

DESCRIPTION
       This  manual  page  documents the GNU version of chmod.  chmod changes the permissions of
       each given file according to mode, which can  be  either  a  symbolic  representation  of
...
more(1) is the pager, which allows the user to read
the contents of a file.
$ cp chmod.man chmod2.man
$
I made a copy of chmod.man with cp(1)
$ mv chmod.man chmod2.man
$
It is possible to overwrite an existing file with another
by means of the mv(1) command.
$ rm chmod2.man
$
Now chmod2.man and chmod.man are gone from the
system

Pipes

At this point in the document it is assumed that the reader is making great use of the man(1)ual. The following discussion will surround the following commands: awk(1): a programming language and tool, find(1) used find files, grep: the [G]lobal [R]egular [E]xpression [P]attern matching tool, head(1)/tail(1): print the beginning/end of a file, sort(1) which sorts its input, uniq(1): prints unique input, and xargs(1): which performs the same action on multiple inputs.

An extremely important attribute of Unix-like systems is the ability to 'pipe' the output from one program as the input to another; for example:

Open a new Shell

$ ps aux 
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   1580   512 ?        S    Jan28   0:00 init [2]
root         2  0.0  0.0      0     0 ?        SN   Jan28   0:00 [ksoftirqd/0]
root         3  0.0  0.0      0     0 ?        S<   Jan28   0:00 [events/0]
root         4  0.0  0.0      0     0 ?        S<   Jan28   0:00 [khelper]
root        16  0.0  0.0      0     0 ?        S<   Jan28   0:00 [kblockd/0]
root        29  0.0  0.0      0     0 ?        S    Jan28   0:00 [khubd]
root        94  0.0  0.0      0     0 ?        S    Jan28   0:00 [pdflush]
root        95  0.0  0.0      0     0 ?        S    Jan28   0:00 [pdflush]
root        97  0.0  0.0      0     0 ?        S<   Jan28   0:00 [aio/0]
root        96  0.0  0.0      0     0 ?        S    Jan28   0:02 [kswapd0]
root       685  0.0  0.0      0     0 ?        S    Jan28   0:03 [kseriod]
root       739  0.0  0.0      0     0 ?        S    Jan28   0:00 [kjournald]
root      2221  0.0  0.0      0     0 ?        S    Jan28   0:00 [kapmd]
root      2744  0.0  0.1   1584   576 ?        Ss   Jan28   1:17 /usr/sbin/apmd -P /etc/apm/apmd_proxy --proxy-timeout 30
root      2869  0.0  1.2   7704  6304 ?        Ss   Jan28   0:00 /usr/bin/X11/xfs -daemon
...
There are quite a few more things happening on my computer
right now, but I am curious about the 'init' process.
$ ps aux | grep init
root         1  0.0  0.0   1580   512 ?        S    Jan28   0:00 init [2]
trey     24545  0.0  0.0   1624   468 pts/8    R+   16:18   0:00 grep init
For the curious, the init(8) process works to spawn
every following process on the computer.
$ ps aux | grep init | grep -v grep
root         1  0.0  0.0   1580   512 ?        S    Jan28   0:00 init [2]
The previous command is a little silly.  It is using a
whole new process just to filter out a single line.

Two ways to perform a killall(1) functionality:

Open a new Shell

$ ps aux | grep 'defunct' | awk '{print $2}' | xargs kill -9	  {} ';'
bash: kill: (2331) - No such process
A process "x" is labeled 'defunct' if it has been killed
but is not able to properly send a final status code to its parent "w."  This
may happen in the case when "x" has to do some cleanup work before exiting,
but while performing these final tasks, the parent dies.  In most situations,
this newly orphaned process "x" will be taken in by the mother of all processes
on the computer, 'init,' and then be able to exit normally.  But I digress:

Ps provides a list of all the processes on the computer.  Given this list, pull
out the second column ($2) with awk and feed that list of numbers to xargs
which in turn creates a single command that looks something like: 
kill -9 123 ; kill -9 144 ; ...

Why does the previous example return an error while still working properly? (hint, try each piece of the command one at a time: ps aux,
then ps aux | grep defun etc...)

Open a new Shell

$ kill -9 `ps aux | grep 'defunct' | awk '{print $2}'`
This example makes use of the `` notation, which provides
the output of the given commands as a text string for the rest of the
command.

Job Control

At the beginning of this discussion, Unix was called 'multitasking.' The operating system is capable of switching quickly from one task to another. When working from within the shell one accesses this functionality through 'job control.'

Open a new Shell

$ man ls &
[1] 24714
Placing the '&' at the end of a command places it in the
background of the system.  The system then returns both the process id (24714) and the
job id (1).  Every process id is unique to the system while every job id is
unique to that particular shell.
$ man emacs &
[2] 24717
A new process id and job id is created.
$ ps aux | grep 24717
trey     24717  0.1  0.2   2152  1276 pts/8    T    16:47   0:00 man emacs
trey     24783  0.0  0.0   1624   472 pts/8    R+   16:47   0:00 grep 24717
Check on the process
$ man vi &
[5] 24786
$ man find &
[6] 24803
$ fg %-
VIM(1)                                                                                    VIM(1)

NAME
       vim - Vi IMproved, a programmers text editor
...
The % is used to select a process id. In this case, the
'-' tells the system to look at the previous process, and since %5 was the last
one used, now it will look at %4.
$ Ctrl-Z
[5]+  Stopped                 man vi
Control-Z is used to stop the currently running process.
Keep in mind that stopping and killing a process are two separate operations.
Killing a process removes it from the computer while stopping a process merely
freezes it in place.
$ jobs
[1]   Running                 xmms /media/music/albums/me_first_gimmie_gimmies/unknown/ &
[2]   Stopped                 man ls
[3]   Stopped                 man emacs
[4]   Stopped                 man mv
[5]+  Stopped                 man vi
[6]-  Stopped                 man find
jobs provides a list of the jobs controlled by the current
shell.
$ fg %+
FIND(1)                                                                                  FIND(1)

NAME
       find - search for files in a directory hierarchy
...
%+ goes to the next job in the list.
$ Ctrl-Z
[5]+  Stopped                 man find
The Control Z sequence immediately stops the currently
running process dead in its tracks and places the user back in the shell.
$ fg %2
EMACS(1)                                                                                EMACS(1)

NAME
       emacs - GNU project Emacs
...
$ kill -9 %3
[2]   Stopped                 man emacs
$ hit return
[2]   Killed                  man emacs
You may have noticed a short pause before they
system reported that the process was killed.
$ disown %2
bash: warning: deleting stopped job 2 with process group 5963
After disowning a process one must explicitly kill it;
other processes will go away if you log off of the computer, except this
one.

Another aspect of job control deals with the problem that when one logs off from a computer, one's processes are by default killed by the computer in what is called the 'hangup.' nohup(1) is a command which tells the computer to keep a process alive even if it receives a hangup signal. The example provided uses paup, a program with which we will become extremely familiar. Its purpose is to examine a multiple sequence alignment and find the phylogenetic tree which best fits this alignment. Running paup may take many months to examine a single dataset. Therefore it is good to use nohup with paup to make sure the job does not die prematurely.

Open a new Shell

$ nohup paup -f paupscript &
nohup: appending output to `nohup.out'
[1] 3807
As before using '&' runs the job in the background
and causes the system to report the jobid (1) and processid of the paup job.
nohup also reports back that all the output from paup will go into the file
'nohup.out'
$ ps aux | grep 3807
trey      3807  0.0  0.1   1816   812 ?        Ss   Feb12   2:04 /usr/local/bin/paup
trey      3908  0.0  0.0   1624   468 pts/7    R+   00:10   0:00 grep 3807
You may recall that ps(1) prints a list of
currently running processes and grep searches for a text string.  If one is
using a SysV system like Solaris, you may want something like this:
$ ps -ef | grep 3807
trey      3807  0.0  0.1   1816   812 ?        Ss   Feb12   2:04 /usr/local/bin/paup
trey      3908  0.0  0.0   1624   468 pts/7    R+   00:10   0:00 grep 3807
I mention Solaris because we will shortly
be using that operating system.

The problem with nohup is that it requires that you think of it before running your job. If you use the bash shell, this is not a problem:

Open a new Shell

$ paup -f paupscript 2>paup.output 1>&2

I remembered to send the output and errors of my job
to the file paup.output.  Unfortunately I must log off in a moment!
$ Ctrl-Z
[1]+  Stopped      paup
Ok, now my job is stopped, great.
$ bg
$
Now it is running in the background, but if I log off
it will still be killed.
$ disown %1
bash: warning: deleting stopped job 1 with process group 3936
That sounds bad, but it is not, look:
$ ps aux | grep 3936
trey      3936  0.0  1.0   9376  5280 pts/7    T    03:22   0:00 paup
trey      3942  0.0  0.0   1624   468 pts/7    R+   03:24   0:00 grep 3936
My paup job is still running.  And if I log off and back
on to the system, it will continue to run.

Shell Scripting

It turns out that the bourne shell is a self contained and complete programming language with an Algol-like syntax. Below is an example of a shell script which I use to start up the windowing system on my computer:

Open a new Shell


1   #!/usr/bin/env bash
2   export GDK_USE_XFT=1
3   . ~/.bashrc
4   export WINDOW_MANAGER=metacity
5   IDENTITIES="${HOME}/.ssh/identity ${HOME}/.ssh/id_dsa"
6   ATTEMPTS="gnome-session startkde sawfish wmaker afterstep fvwm"
7   HOSTTYPE=`uname`
8   if [ "${HOSTTYPE}" == "SunOS" ]; then
9     ### The following is only upon my Sun with the funky keyboard.
10    xmodmap -e 'keycode 127  = Alt_L Meta_L'
11    xmodmap -e 'keycode 129  = Mode_switch'
12    xmodmap -e 'clear Mod1'
13    xmodmap -e 'add Mod1 = Alt_L'
14    xmodmap -e 'clear Mod2'
15    xmodmap -e 'add Mod2 = Mode_switch'
16    xmodmap -e 'remove Mod4 = Mode_switch'
17    xmodmap -e 'remove Mod4 = Mode_switch'
18    xmodmap -e 'clear Mod4'
19  fi
20  for i in ${ATTEMPTS}
21          do
22          PLACE=`which ${i} | grep -v ' '`
23          ## $PLACE checks to see if the window manager exists
24          ## the grep -v ' ' helps annoying systems which tell
25          ## you if the program does not exist
26          if [ "${PLACE}" ]
27                  then
28                  SUCCESS=1  ## Useful only if none of these exist
29                  ## SSH_AGENT_PID will only be set on computers which
30                  ## run ssh-agent for you
31                  if [ "${SSH_AGENT_PID}" ]
32                    then
33                    ## If you use pam-ssh, this will work for you so that
34                    ## It doesn't try to export two copies of your keys
35                    if [ -f ~/.ssh/agent-${HOSTNAME}-\:0 ]
36                      then
37                      exec ${i}
38                    else
39                    ## Otherwise, you will have to run an xterm and
40                    ## run ssh-add yourself
41                    ## The following echo statements are useful
42                    ##  if you want to use cron
43                    ## with ssh, just make sure your cron jobs have a
44                    ## . ~/.ssh/agent-`hostname` in them
45                    ##  and that the agent is still running
46                    ## when the cron job attempts to run
47                      echo "SSH_AUTH_SOCK=${SSH_AUTH_SOCK}; \
48  export SSH_AUTH_SOCK" > ~/.ssh/agent-${HOSTNAME}
49                      echo "SSH_AGENT_PID=${SSH_AGENT_PID}; \
50  export SSH_AGENT_PID" >> ~/.ssh/agent-${HOSTNAME}
51                      xterm -e ssh-add ${IDENTITIES} &
52                      exec ${i}
53                    fi
54                  else
55                  ## This computer does not run ssh-agent for me
56                  ## So I will do that _and_ ssh-add myself
57                  ## The tee makes a ~/.ssh/agent-`hostname` for me
58                    eval `ssh-agent | tee > ~/.ssh/agent-${HOSTNAME}`
59                    xterm -e ssh-add ${IDENTITIES} &
60                    exec ${i}
61                  fi
62          ## The chosen windowmanager $i does not exist
63          else
64              continue
65          fi
66  done
67  ##  If none of my window managers exist
68  ## Actually, this if statement isn't really needed since
69  ## I am doing exec calls everywhere above, but who cares?
70  if [ ${SUCCESS} ]
71        then
72          continue
73          else
74          eval `ssh-agent | tee > ~/.ssh/agent-${HOSTNAME}`
75          exec xterm
76  fi

There are some interesting elements to this script. Line 1 shows one of many ways to start a shell to interpret a script. env(1) allows one to run any given program in its own environment, searching through the system's path to find it. Line 3 makes use of the '.' command, which executes an input file (analagous to the [t]csh source command) and sets up some other variables for the rest of the script. Lines 8-19 show the peculiar bash if; then; else; fi syntax. Line 20 illustrates a for; do; done. For loops are especially important and tricky, the given commands are done for every space delineated entry of the given list. You may also note that variables can be either evaluated as $VARIABLE or ${VARIABLE}; this allows one to do something like: echo ${VARIABLE}Now when echo $VARIABLENow would fail. exec(1) is important to this script, as are shell redirections (as shown above) and grep(1). tee(1) is a generally less used command, it duplicates STDOUT so that one may evaluate a command and save its output for later. `` is an especially interesting convention, it allows one to execute a shell within the running shell and save its output as a variable.

Conclusion

Here is an unordered list of potentially useful links:

Practice

Use mkdir(1) to create a directory in which to keep some amino acid sequences, then click here to download them to your computer. Use mv(1) to get them into your sequences directory. Examine the permissions and sizes of the files and change their permissions so that only you may read or write to the files.

Use one of the default Unix editors: emacs(1), vi(1), pico(1) in order to add the current date and time to the sequence files previously downloaded.

Use a pipe or two, cal(1) and grep(1) to find the day of the week of your day of birth for every month of the year your were born, perhaps use cut(1) to print out only the column of the calendar of your birthday.

(Do this _after_ reading through the Blast exercises.) Write a short shell script which allows some cursory examination of the sequences you downloaded earlier. Some ideas include using a script to con cat(1)enate the files into a single file so that clustalx may make an alignment of them, use blastcl3 in order perform individual local alignments of the files, and use wc(1) to examine their relative length. (I suggest lines 20-22 of the above script for ideas)

Check this output from a script an anonymous student wrote:

PS        UUUUUUAGUU......11-......222222.......222222..11.CUGAAUAAGA      0,p:5,s:79,b:11
PS        UUUUUUAGUUU.......-......11.1....1.11...UUACGGGUAC               0,p:1,s:85,b:5
PS        UUUUUUAGU111111...-......111111.........UUACGGGUAC               0,p:4,s:82,b:8
PS        UUUUUUAGU.......11-......222222.......222222..11.CUGAAUAAGA      0,p:5,s:79,b:11
PS        UUUUUUAGUU........-......11.1....1.11...UUACGGGUAC               0,p:1,s:85,b:5
PF  AGCGCUUUUUUUAGUUUUUACAAC-AAAAGAGUGAGAGAUGACGUUUUACGGGUACUGAAUAAGAUCCCG YCR032W=BPH1
NT  :|  | .::: :  ::  ::|::  ::::|:| .:|   :|:||:.:::    ::  .:::: |::
SO  UCAAGUGAAACAAGAAUUAUGUUCAUUUUCUCCUUCGGAACUGCAGAAUAAAAAUUCUUUAUACUAUAAA YBR182C

There are approximately 300 similar stanzas in the output file, with different lines which start with 'PF' and 'SO'. These lines end with the name of a gene in the yeast genome which may be of significance. How might one quickly get a list of all of these potentially interesting genes?

Open a new Shell

$ grep SO output_file.txt | awk '{print $3}' | sort | uniq
YCR020W
YDL045W
YDL067C
YDL130W
YDL160C
YDL232W
YDR003W
YDR034C
YDR034W
YDR079C
.... for another 50+ lines
For the curious, PF stands for programmed ribosomal
frame shift while SO stands for small orf, and the peculiar notation above is a
comparison of a set of small orfs to a set of putative programmed ribosomal
frame shift sequences.

Created: Wed Sep 15 00:58:22 EDT 2004 by Charles F. Delwiche
Last modified: Mon Nov 8 15:49:44 EST 2004 by Ashton Trey Belew.