PIIM: Population genetic Inference In Metagenomics

Philip Johnson

Documentation for version 1.0

Contents:

About
File list
Unpacking/compiling
Input format
Running piim
Running simulate_data
License (GNU General Public License)

About

PIIM uses the site-frequency spectrum to estimate the population genetic parameters θ=2Nu (where u is the per-site mutation rate) and R=Nr (where r is the exponential growth rate) from metagenomic-style data (i.e. genome-level population data of variable sample depth and quality). For details on the method, see: Johnson, PLF and Slatkin M. 2006. "Inference of population genetic parameters in metagenomics: a clean look at messy data." Genome Research 16, 1320-1327.

Questions or bug reports may be e-mailed to me (plfjohnson) at my institution (@berkeley.edu).

Files

The PIIM distribution includes this documentation and source code for the following programs:

piim: performs the maximum likelihood estimation incorporating quality scores into the likelihood function
simulate_data: simulates metagenomic-style data under a range of scenarios -- particularly useful for calculating the empirical distribution of the composition likelihood ratio test for growth.
ms2spectrum.pl: used by simulate_data if using m.s. (see below)
ace2xml.pl: converts from the ACE file format (used by PHRAP, CONSED ) to the XML used by PIIM (added as of 6 Aug 2007)

Compiling the source code requires the two external dependencies listed below. Depending on the command-line options, running simulate_data may require Richard Hudson's program m.s. along with PERL to run the script ms2spectrum.pl.

Unpacking & compiling

If you downloaded the source package:
1. Install the GNU Scientific Library (http://www.gnu.org/software/gsl/)
2. Install libxml++ (http://libxmlplusplus.sourceforge.net/)
3. Unpack the archive by typing:
```
tar xvzf piim.tar.gz
```
4. Change to the newly-created 'piim' directory and compile:
```
cd piim
make release
```
If you downloaded a binary package:
1. Unpack the archive by typing:
```
tar xvzf piim.tar.gz
```
2. While the GSL and libxml++ libraries have already been linked into the binaries you downloaded, you may still be missing some other, system-level libraries. Change to the newly-created 'piim' directory and see if the output of the "ldd" command reports anything "not found":
```
cd piim
ldd piim
```
3. If all libraries are present, the program will probably run. If not, find a linux guru to help you install them.

Preparing your input data

New: The PERL script ace2xml.pl will convert from the output format used by the phrap assembler to the XML format used by PIIM.

Unfortunately, there are as many assembly output formats as assembly programs. As a result, I chose to implement a simple XML format that includes only the information required for PIIM: namely, for each site (column) in the assembly, the list of quality scores and nucleotide category for each base. This specification is formalized in "piim.dtd", but a simple example should suffice:

<Contig>
  <Column pos="1">
    <Chr score="6"/>
    <Chr score="26"/>
    <Chr score="25" minor=""/>
  </Column>
  <Column pos="2">
    <Chr score="30"/>
    <Chr score="40" minor=""/>
    <Chr score="35" other="1"/>
    <Chr score="29" other="1"/>
    <Chr score="37" other="2"/>
  </Column>
</Contig>

Here, the chromosomes without an attribute have the major (or ancestral) nucleotide, the ones with the "minor" attribute have the minor (or derived) nucleotide while the "other" attributes are used to signal either the 3rd (other="1") or 4th (other="2") nucleotides. As discussed in the Methods section of the paper cited above, chromosomes from sites with 3 or 4 different nucleotides are split into two groups so they can be treated as "biallelic" by the likelihood function.

Running piim

IMPORTANT: This model makes the assumption that adjacent sites have independent evolutionary histories (i.e. undergo free recombination). Depending on the recombination rate in your microbial population, this assumption may be inapplicable. Non-independence will increase the variance of the parameter estimates and potentially lead to bias.

Running './piim' by itself will produce a short list of the command-line parameters. The typical usage would be something like:

   ./piim -i my_input_data.xml -p -f growth

Here, we take our input file (my_input_data.xml) and search for the maximum likelihood parameters (-p) to the frequency spectrum under exponential population growth (-f growth). The output simply consists of the maximum likelihood parameters and the log likelihood of the data under these parameters:

   LogLikelihood: -195549.67 / Growth w/ 0.014967083	48.701707

Details about parameters:

-i <filename> → [required] Reads input date from the supplied filename.
-p → use the Nelder-Mead simplex algorithm (as implemented in the GNU Scientific Library) to find the maximum likelihood parameters. If this parameter is not provided, the program will either read the parameters from the input file (if found there) or will use the defaults.
-f <neutral|growth> → which frequency spectrum model to use.
- neutral == constant population + neutral with respect to natural selection.
- growth == exponentially growing population + neutral with respect to natural selection. If this parameter is not supplied, the program will either read from the input file (if found there) or will use the default.
-a → use all of the spectrum instead of folding it.
-o → produce an output XML file containing the input data plus the results of the most time-consuming step of the calculation (basically the results of eqn 3 from the paper). This file can then be fed back in as input to the program, perhaps to find the parameters to a different frequency spectrum. However, this second run will be much quicker as a result.
-g → if you don't trust the maximum likelihood point given to you by the '-p' algorithm, try this – it performs an exhaustive search over a grid across parameter space.
-1 → output the empirical Bayes probability of a site being truly fixed for apparent singleton sites.
-2 → output the empirical Bayes probability of a site being truly fixed for apparent doubleton sites.
-3 → output the empirical Bayes probability of a site being truly fixed for apparent tripleton sites.

Running simulate_data

As with piim, running './simulate_data' by itself will produce a short list of the command-line parameters. Typical usage:

   ./simulate_data -n 100000 -s score_dist.dat -d depth_dist.dat -f growth -p '0.01  10'  > my_simulated_data.xml

Here we ask for 100000 sites (-n 100000) using the supplied score and depth distributions (-s score_dist.dat -d depth_dist.dat) under exponential population growth (-f growth) with parameters θ=0.01 and R=10 (-p '0.01 10'). Output consists of xml-formatted data suitable for input into piim.

Details about parameters:

-n <# sites> → [required] number of sites to generate
-f <neutral|growth|ms> → [required] which frequency spectrum model to use.
- neutral == constant population + neutral with respect to natural selection (takes 1 parameter: θ).
- growth == exponentially growing population + neutral with respect to natural selection (two parameters: θ, R).
- ms == use Richard Hudson's m.s. to generate a sample of population mutation, exponential growth and specified amount of recombination (three parameters: θ, R, ρ). Note: m.s. is run once for *each* depth -- so you probably want your depth distribution to only include one depth to avoid (unintentional) independent samples.
-p '<parameters>' → [required] Parameters to frequency spectrum model -- use order specified above and enclose in quotes!
-s <filename> → File supplying score distribution; default is 'score_dist.dat'. Format: <score> <probability>.
-d <filename> → File supplying depth distribution; default is depth_dist.dat'. Format: <depth> <probability>.

License

  Copyright 2006 The Regents of the University of California.  All
  rights reserved.
 
  IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT,
  INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
  LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
  DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY
  OF SUCH DAMAGE.
 
  REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
  FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING
  DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS
  IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT,
  UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 
  This software is distributed under the terms of Version 2 of the
  GNU General Public License as published by the Free Software
  Foundation.  If you have not received a copy of Version 2 of the
  GNU General Public License with this program, you may download the
  GPL from the following url: http://www.gnu.org/copyleft/gpl.html.