PIIM: Population genetic Inference In Metagenomics
Philip Johnson
Documentation for version 1.0
Contents:
About
PIIM uses the site-frequency spectrum to estimate the population
genetic parameters θ=2Nu
(where u is the per-site mutation rate) and R=Nr (where
r is the exponential growth rate) from metagenomic-style data
(i.e. genome-level population data of variable sample depth and
quality). For details on the method, see: Johnson, PLF and Slatkin M. 2006. "Inference of population genetic parameters in metagenomics: a clean look at messy data." Genome Research 16, 1320-1327.
Questions or bug reports may be e-mailed to me (plfjohnson) at my institution (@berkeley.edu).
Files
The PIIM distribution includes this documentation and source code for the following programs:
- piim: performs the maximum likelihood estimation incorporating quality scores into the likelihood function
- simulate_data: simulates metagenomic-style data under a range of scenarios -- particularly useful for calculating the empirical distribution of the composition likelihood ratio test for growth.
- ms2spectrum.pl: used by simulate_data if using m.s. (see below)
- ace2xml.pl: converts from the ACE file format (used by PHRAP, CONSED) to the XML used by PIIM (added as of 6 Aug 2007)
Compiling the source code requires the two external dependencies listed below. Depending on the command-line options, running simulate_data may require Richard Hudson's program m.s. along with PERL to run the script ms2spectrum.pl.
Unpacking & compiling
- If you downloaded the source package:
- Install the GNU Scientific Library (http://www.gnu.org/software/gsl/)
- Install libxml++ (http://libxmlplusplus.sourceforge.net/)
- Unpack the archive by typing:
tar xvzf piim.tar.gz
- Change to the newly-created 'piim' directory and compile:
cd piim
make release
- If you downloaded a binary package:
- Unpack the archive by typing:
tar xvzf piim.tar.gz
- While the GSL and libxml++ libraries have already been linked into the binaries you downloaded, you may still be missing some other, system-level libraries. Change to the newly-created 'piim' directory and see if the output of the "ldd" command reports anything "not found":
cd piim
ldd piim
- If all libraries are present, the program will probably run. If not, find a linux guru to help you install them.
New: The PERL script ace2xml.pl will convert from the output format used by the phrap assembler to the XML format used by PIIM.
Unfortunately, there are as many assembly output formats as assembly programs. As a result, I chose to implement a simple XML format that includes only the information required for PIIM: namely, for each site (column) in the assembly, the list of quality scores and nucleotide category for each base. This specification is formalized in "piim.dtd", but a simple example should suffice:
<Contig>
<Column pos="1">
<Chr score="6"/>
<Chr score="26"/>
<Chr score="25" minor=""/>
</Column>
<Column pos="2">
<Chr score="30"/>
<Chr score="40" minor=""/>
<Chr score="35" other="1"/>
<Chr score="29" other="1"/>
<Chr score="37" other="2"/>
</Column>
</Contig>
Here, the chromosomes without an attribute have the major (or ancestral) nucleotide, the ones with the "minor" attribute have the minor (or derived) nucleotide while the "other" attributes are used to signal either the 3rd (other="1") or 4th (other="2") nucleotides. As discussed in the Methods section of the paper cited above, chromosomes from sites with 3 or 4 different nucleotides are split into two groups so they can be treated as "biallelic" by the likelihood function.
Running piim
IMPORTANT: This model makes the assumption that adjacent
sites have independent evolutionary histories (i.e. undergo free
recombination). Depending on the recombination rate in your microbial
population, this assumption may be inapplicable. Non-independence
will increase the variance of the parameter estimates and potentially
lead to bias.
Running './piim' by itself will produce a short list of the
command-line parameters. The typical usage would be something like:
./piim -i my_input_data.xml -p -f growth
Here, we take our input file (my_input_data.xml) and search for
the maximum likelihood parameters (-p) to the frequency spectrum
under exponential population growth (-f growth). The output simply
consists of the maximum likelihood parameters and the log likelihood
of the data under these parameters:
LogLikelihood: -195549.67 / Growth w/ 0.014967083 48.701707
Details about parameters:
- -i <filename> → [required] Reads input date from the
supplied filename.
- -p → use the Nelder-Mead simplex algorithm (as implemented in
the GNU Scientific Library) to find the maximum likelihood
parameters. If this parameter is not provided, the program will
either read the parameters from the input file (if found there) or
will use the defaults.
- -f <neutral|growth> → which frequency spectrum model to
use.
- neutral == constant population + neutral with respect to
natural selection.
- growth == exponentially growing population +
neutral with respect to natural selection. If this parameter is not
supplied, the program will either read from the input file (if found
there) or will use the default.
- -a → use all of the spectrum instead of folding it.
- -o → produce an output XML file containing the input data
plus the results of the most time-consuming step of the calculation
(basically the results of eqn 3 from the paper). This file can then
be fed back in as input to the program, perhaps to find the
parameters to a different frequency spectrum. However, this second
run will be much quicker as a result.
- -g → if you don't trust the maximum likelihood point given to
you by the '-p' algorithm, try this – it performs an exhaustive
search over a grid across parameter space.
- -1 → output the empirical Bayes probability of a site being
truly fixed for apparent singleton sites.
- -2 → output the empirical Bayes probability of a site being
truly fixed for apparent doubleton sites.
- -3 → output the empirical Bayes probability of a site being
truly fixed for apparent tripleton sites.
Running simulate_data
As with piim, running './simulate_data' by itself will produce a short list of the command-line parameters. Typical usage:
./simulate_data -n 100000 -s score_dist.dat -d depth_dist.dat -f growth -p '0.01 10' > my_simulated_data.xml
Here we ask for 100000 sites (-n 100000) using the supplied score and depth distributions (-s score_dist.dat -d depth_dist.dat) under exponential population growth (-f growth) with parameters θ=0.01 and R=10 (-p '0.01 10'). Output consists of xml-formatted data suitable for input into piim.
Details about parameters:
- -n <# sites> → [required] number of sites to generate
- -f <neutral|growth|ms> → [required] which frequency
spectrum model to use.
- neutral == constant population + neutral with
respect to natural selection (takes 1 parameter: θ).
- growth == exponentially growing population + neutral with respect to
natural selection (two parameters: θ, R).
- ms == use Richard Hudson's m.s. to generate a sample of population mutation, exponential growth and specified amount of recombination (three parameters: θ, R, ρ). Note: m.s. is run once for *each* depth -- so you probably want your depth distribution to only include one depth to avoid (unintentional) independent samples.
- -p '<parameters>' → [required] Parameters to frequency spectrum model -- use order specified above and enclose in quotes!
- -s <filename> → File supplying score distribution; default is 'score_dist.dat'. Format: <score> <probability>.
- -d <filename> → File supplying depth distribution; default is depth_dist.dat'. Format: <depth> <probability>.
License
Copyright 2006 The Regents of the University of California. All
rights reserved.
IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT,
INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE.
REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING
DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS
IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT,
UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
This software is distributed under the terms of Version 2 of the
GNU General Public License as published by the Free Software
Foundation. If you have not received a copy of Version 2 of the
GNU General Public License with this program, you may download the
GPL from the following url: http://www.gnu.org/copyleft/gpl.html.