PIIM: Population genetic Inference In Metagenomics

Philip Johnson

Documentation for version 2

Contents:

About
Changes
File list
Unpacking/compiling
Input format
Running piim
License (GNU General Public License)

About

PIIM uses the site-frequency spectrum to estimate the population genetic parameters θ=2Nu (where u is the per-site mutation rate) and R=Nr (where r is the exponential growth rate). Version 2 adds the ability to look at the patterns of pairs of sites to estimate ρ=2Nc (where c is the per-site recombination rate). Note that the ρ estimation routine assumes a constant-size population (i.e. R=0). For details on the method, see:

Johnson, PLF and Slatkin M. 2006. "Inference of population genetic parameters in metagenomics: a clean look at messy data." Genome Research 16, 1320-1327.
Johnson, PLF and Slatkin M. 2009. "Inference of microbial recombination rates from metagenomic data." PLoS Genetics (in press).

Questions or bug reports may be e-mailed to me (plfjohnson) at my institution (@berkeley.edu; soon to be @emory.edu).

Changes

Version 2.0.3 (10 April 2013)

ace2xml.pl now handles case if a RD record starts on a line immediately following a AF record (Thanks to Matthew DeMaere for reporting & patching).

Version 2.0.2 (13 March 2013)

Check that read length in bases matches read length in q-scores (and throw error if not).
Bug fix: last q-score in every read had been ignored, meaning that reads were treated as one base shorter than they actually were.
Quality scores can now be input as ASCII-33 (fastq-style)
Obey -trust-obs for frequency spectrum inference.
Added -version command line option
Added θ=0.1 lookup table to distribution

Version 2.0.1

Added option to compile without tr1/unordered_map for older compilers lacking that header.

Version 2.0

adds recombination estimation (ρ)
new XML format, incompatible with the previous version of PIIM

Files

The PIIM distribution includes this documentation and source code for the following programs:

piim: performs the maximum likelihood estimation incorporating quality scores into the likelihood function
lk_n50_t0.01_allsmooth: table of likelihood values for maximum sample size of n=50 and θ=0.01. This was generated from a similar file distributed by Gil McVean as part of the LDhat package.
lk_n50_t0.1_allsmooth: table of likelihood values for maximum sample size of n=50 and θ=0.1. This was generated from a similar file distributed by Gil McVean as part of the LDhat package.
ace2xml.pl: converts from the ACE file format (used by PHRAP, CONSED ) to the XML used by PIIM

Unpacking & compiling

If you downloaded the source package:

Install the GNU Scientific Library (http://www.gnu.org/software/gsl/)
Install headers for libxml2 (probably you need to install a package like "libxml2-dev")
Unpack the archive by typing:
tar xvzf piim.tar.gz

Change to the newly-created 'piim' directory and compile:
cd piim make release

If you have an older compiler (i.e. < gcc 4), it may not contain the C++ technical release 1 unordered_map library and thus the last line above may fail with an error message. You can work around this by running the following command instead (although the resulting piim binary will take perhaps twice as long to rung on data as the default tr1 version):
make release OPTFLAGS='-DSYSTEM_MISSING_TR1'
If you downloaded a binary package:
1. Unpack the archive by typing:
```
tar xvzf piim.tar.gz
```
2. While the GSL and libxml2 libraries have already been linked into the binaries you downloaded, you may still be missing some other, system-level libraries. Change to the newly-created 'piim' directory and see if the output of the "ldd" command reports anything "not found":
```
cd piim
ldd piim
```
3. If all libraries are present, the program will probably run. If not, find a linux guru to help you install them.

Preparing your input data

Unfortunately, there are as many assembly output formats as assembly programs. As a result, I chose to implement a simple XML format that includes all the information required for PIIM: reads, assembly, information about read pairs and quality scores. The PERL script ace2xml.pl will convert from the output format used by the phrap assembler to the XML format used by PIIM. If you need the gory details, the XML specification is formalized in "piim_r.dtd", and an example can be found in "example.xml".

Please note that ace2xml.pl must find "FF_Index.pm"; the easiest way to make sure this happens is to keep the two files in the same directory. Running './ace2xml.pl' by itself will produce a short description of its usage. Typical usage looks like:

   ./ace2xml.pl -i my_assembly.ace -q read_qualities.qfa

The quality file should be a single fasta-style file with qualities instead of bases.

Running piim

Running './piim_r' by itself will produce a short list of the command-line parameters. The typical usage would be something like:

   ./piim_r -i example.xml -l lk_n50_t0.01_allsmooth -joint-est

Here, we take our input file (example.xml) and search for the maximum likelihood recombination rate using the likelihood in the specified table. Not that the maximum read depth in your input data should be no more than the maximum n in the likelihood table (otherwise those positions will be ignored and a warning issued). The output is the maximum likelihood parameter estimates separated by tabs (θ, ρ, R [growth rate], λ [gene-conversion parameter]):

  0.009375	0.0031875	0	500

Note that the λ parameter is fixed at 500 (never estimated) and the growth parameter can only estimated if using the -freqspec version of piim (in which case ρ is NOT estimated).

By default, piim uses the Nelder-Mead simplex algorithm to efficiently search parameter space. However, my implementation requires the caching all pairs of SNPs and uses the hard disk extensively. An alternative is to calculate the entire likelihood surface at fixed grid points in a single pass (option "-surface") and then interpolate to find the MLE. The former (default) is generally faster if you are not running anything else on the same computer that uses the hard disk extensively. However, the latter ("-surface") will be faster if you have a multi-core computer and are trying to run multiple instances simultaneously.

Details about command line parameters:

-i <filename> → [required] Reads input date from the supplied filename.
-lk <filename> → [required] Reads likelihood values from a table using the supplied filename.
-surface → calculate the likelihood surface in a single pass at a fixed array of grid point; MLE is calculated by interpolation from surface.
-model <gc|crossover> → use a gene-conversion type model (default) or a cross-over type model
-theta <> → specify θ to use when estimating ρ
-joint-est → First estimate θ using the frequency spectrum, then estimate ρ using that value of theta. Perhaps this would be better called "sequential-est".
-start <> → Only estimate starting from this position (default at beginning of assembly)
-stop <> → Only estimate up to this position (default at end of assembly)
Efficiency & heuristics
-bad-q <> → quality scores below this level are considered "bad" (default is 20)
-maxbad <> → allow no more than "maxbad" bases below the "bad" level in a given pair of sites by discarding the lowest qualities until we meet this criteria (default 5)
-small <> → ignore configurations with probability less that "small" times as likely as the most probable configuration (default 1e-4)
-mindist <> → only make pairs for every "mindist" site (default 5)
-site-incr <> → only make pairs for sites separated by at least "site-incr" bases (default 10)
-trust-obs → ignore quality scores.. not recommended unless you're really confident in your data! But it is a lot faster.
Version 1.0 options
-freqspec <growth|neutral> → essentially asks for version 1.0 of PIIM, which looks at the single-site frequency spectrum. NOTE: this older model assumes adjacent sites have independent evolutionary histories (i.e. undergo free recombination). Depending on the recombination rate in your microbial population, this assumption may be inapplicable. Non-independence will increase the variance of the parameter estimates and potentially lead to bias.
-fs-load <filename> → only relevant to -freqspec or -joint-est; load cached probabilities incorporating error from file
-fs-save <filename> → only relevant to -freqspec or -joint-est; save probabilities incorporating error to file for caching (if you wanted to rerun the same data through multiple times)

License

  Copyright 2009 The Regents of the University of California.  All
  rights reserved.
 
  IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT,
  INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
  LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
  DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY
  OF SUCH DAMAGE.
 
  REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
  FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING
  DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS
  IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT,
  UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 
  This software is distributed under the terms of Version 2 of the
  GNU General Public License as published by the Free Software
  Foundation.  If you have not received a copy of Version 2 of the
  GNU General Public License with this program, you may download the
  GPL from the following url: http://www.gnu.org/copyleft/gpl.html.