PIIM: Population genetic Inference In Metagenomics

Philip Johnson

Documentation for version 1.0

Contents:

About

PIIM uses the site-frequency spectrum to estimate the population genetic parameters θ=2Nu (where u is the per-site mutation rate) and R=Nr (where r is the exponential growth rate) from metagenomic-style data (i.e. genome-level population data of variable sample depth and quality). For details on the method, see: Johnson, PLF and Slatkin M. 2006. "Inference of population genetic parameters in metagenomics: a clean look at messy data." Genome Research 16, 1320-1327.

Questions or bug reports may be e-mailed to me (plfjohnson) at my institution (@berkeley.edu).



Files

The PIIM distribution includes this documentation and source code for the following programs:

Compiling the source code requires the two external dependencies listed below. Depending on the command-line options, running simulate_data may require Richard Hudson's program m.s. along with PERL to run the script ms2spectrum.pl.

Unpacking & compiling


Preparing your input data

New: The PERL script ace2xml.pl will convert from the output format used by the phrap assembler to the XML format used by PIIM.

Unfortunately, there are as many assembly output formats as assembly programs. As a result, I chose to implement a simple XML format that includes only the information required for PIIM: namely, for each site (column) in the assembly, the list of quality scores and nucleotide category for each base. This specification is formalized in "piim.dtd", but a simple example should suffice:

<Contig>
  <Column pos="1">
    <Chr score="6"/>
    <Chr score="26"/>
    <Chr score="25" minor=""/>
  </Column>
  <Column pos="2">
    <Chr score="30"/>
    <Chr score="40" minor=""/>
    <Chr score="35" other="1"/>
    <Chr score="29" other="1"/>
    <Chr score="37" other="2"/>
  </Column>
</Contig>
Here, the chromosomes without an attribute have the major (or ancestral) nucleotide, the ones with the "minor" attribute have the minor (or derived) nucleotide while the "other" attributes are used to signal either the 3rd (other="1") or 4th (other="2") nucleotides. As discussed in the Methods section of the paper cited above, chromosomes from sites with 3 or 4 different nucleotides are split into two groups so they can be treated as "biallelic" by the likelihood function.

Running piim

IMPORTANT: This model makes the assumption that adjacent sites have independent evolutionary histories (i.e. undergo free recombination). Depending on the recombination rate in your microbial population, this assumption may be inapplicable. Non-independence will increase the variance of the parameter estimates and potentially lead to bias.

Running './piim' by itself will produce a short list of the command-line parameters. The typical usage would be something like:

   ./piim -i my_input_data.xml -p -f growth

Here, we take our input file (my_input_data.xml) and search for the maximum likelihood parameters (-p) to the frequency spectrum under exponential population growth (-f growth). The output simply consists of the maximum likelihood parameters and the log likelihood of the data under these parameters:

   LogLikelihood: -195549.67 / Growth w/ 0.014967083	48.701707

Details about parameters:


Running simulate_data

As with piim, running './simulate_data' by itself will produce a short list of the command-line parameters. Typical usage:
   ./simulate_data -n 100000 -s score_dist.dat -d depth_dist.dat -f growth -p '0.01  10'  > my_simulated_data.xml
Here we ask for 100000 sites (-n 100000) using the supplied score and depth distributions (-s score_dist.dat -d depth_dist.dat) under exponential population growth (-f growth) with parameters θ=0.01 and R=10 (-p '0.01  10'). Output consists of xml-formatted data suitable for input into piim.

Details about parameters:


License

  Copyright 2006 The Regents of the University of California.  All
  rights reserved.
 
  IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT,
  INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
  LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
  DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY
  OF SUCH DAMAGE.
 
  REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
  FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING
  DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS
  IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT,
  UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 
  This software is distributed under the terms of Version 2 of the
  GNU General Public License as published by the Free Software
  Foundation.  If you have not received a copy of Version 2 of the
  GNU General Public License with this program, you may download the
  GPL from the following url: http://www.gnu.org/copyleft/gpl.html.