We develop statistical and mathematical models to make sense of large-scale population genomic data at multiple levels. New types of data from collaborators inspire new types of theory and vice versa. Population genetics provides an incredible tool to uncover the past to infer the key times and places at which natural selection acted or demography changed.
Specific research areas:
- Investigating the dynamics of adaptive immune systems through the lens of evolution in:
- vertebrates whereby the diversity of the T cell repertoire in a single individual evolves on both short timescales (during and shortly after infections/vaccinations) and long timescales (during aging).
- microbes where the CRISPR adaptive immune system is widespread and apparently highly effective yet viral pathogens (e.g., phages) persist.
- Inference of error and contamination in ancient DNA — how can we properly account for these challenges when making inferences based on population genetic theory? How can we use population genetic approaches to detect (or ideally, rule out) the presence of contamination?
- Understanding mutation rate evolution at both the level of the hominid phylogeny and the level of pathogens moving through our populations. How much variation do we see across the human genome? How does our epidemiological response to a pathogen affect its evolutionary trajectory?
All software is distributed under the GNU General Public License.
- PIIM: Population Inference In Metagenomics, with recombination (version 2)
This program calculates maximum likelihood estimates
of θ=2Nu (where u is the per-site mutation rate)
and ρ=2Nc (where c is the per-site rate of
initiation of recombination). emt also reproduces the
frequency-spectrum functionality from the previous version to
estimate R=Nr (where r is the exponential growth rate).
Input data is genome-level population data of variable sample depth and quality (e.g. metagenomic data).
For details on the method, see:
Johnson, PLF and Slatkin M. 2009. "Inference of microbial recombination rates from metagenomic data." PLoS Genetics.
Previous version can be found here.
R package for approximating stochastic simulations (continuous-time Markovian processes) that implements the adaptive tau leaping algorithm of Cao et al. (2007) The Journal of Chemical Physics.
Think of differential equations forced to take integer values and allowing for stochastic effects at low numbers. Similar in spirit to GillespieSSA but a bazillion times faster (± a zillion) thanks to implementing in C instead of pure R.
Download from CRAN.
Useful tools, BibTeX styles, etc.
Sometimes I feel like I spend most of my time shuffling data about and fighting with computers, so I've written many a tool to make my life easier. Perhaps these will be useful to someone else. I use Linux, so most tools will run on Mac without trouble, but Windows could be a headache.
All tools are distributed under the GNU General Public License. Give me a shout if you find a bug or if you find a tool particularly useful. The extent of documentation varies, but everything displays at least a brief usage statement if you run it without parameters.
- FASTA manipulation
- Scripts for manipulating fasta files in descending order of bugfreeness / awesomeness:
- FaIndex.pm -- Perl module that creates an index of sequences in fasta file(s) and uses it to extract subregions. Disk access is via memory mapping, which is extremely fast. Requires the File::map package from CPAN.
- fa_extract_many -- very quickly extracts regions from fasta files using the above FaIndex.pm module (will look for module in standard directories and in ~/bin).
- fa_wrap -- wrap fasta sequence to specified width
- fa_length -- list sequence ids and lengths
- Improvements to standard bioinformatic tools
- UCSC liftOver is an great tool for mapping coordinates between genome assemblies, but the command line version is ridiculously slow if you have many isolated coordinates. Two scripts: sortChains performs a one-time inelegant, slow sort of the over.chain files. Then you can use fastLift to perform fast coordinate conversion on the sorted chain file.
- ms patch that:
- outputs the position of segregating sites with higher precision (8 instead of 4 decimal places)
- changes the random number seed to use /dev/random instead of a file. The seed file can be Bad News if you're running in parallel on a shared filesystem.
patch -p0 < ms.patch from the directory that contains "msdir".
- LDhat patch that:
- adds -oSites and -oLoc command line options to
convert (original hardcoded filenames "sites.txt" and "locs.txt")
- fixes a 1-byte read off the end of an array in "pairdip.c"
- silences a few compiler warnings
patch -p0 < LDhat.patch from the directory that contains the "LDhat" directory.
- Flat file manipulation
- FF_Index.pm -- A clever (if I do say so myself) Perl module for indexing flat files for quick data retrieval. Crucially, this is easy to use and creates a separate index file instead of mucking about with the original file.
- groupby -- approximates "group by" functionality of SQL, but takes tab-delimited flat files with one line per record (must already be sorted according to grouping keys).
- Queueing scripts
- Condor provides an elegant queueing system for running programs on a cluster of machines (either dedicated compute nodes or temporarily unused desktops). However, the supplied interface makes submitting jobs a pain*. Submitting should be as easy as the supplying the exact same command line that you would use if executing locally, i.e.:
./my_program -f some_options > my_output
qsub './my_program -f some_options > my_output'
I have a suite of scripts that does exactly this for Condor.
- devEMF is an R package that provides an EMF (enhanced metafile) graphics driver to make producing EMF graphics as easy as EPS/PDF/PNG/etc. EMF is a vector based format, so it will always look good no matter how much you enlarge it. I wrote this driver out of frustration with both LibreOffice and Microsoft Office's lousy importation of EPS graphics (they both import EMF files seamlessly).
- BibTeX style files (bst) for biology journals
Philip Johnson (PI)
Assistant professor in the Department of Biology excited about all the projects below and more that he hasn't found the time to work on. His background is a mix of biophysics, computational biology, theoretical population genetics and mathematical immunology. For the gory details, see his cv
Arvind Jaya Shankar (CBBG grad student)
Co-advised by Sridhar Hannenhalli
at the NIH-NCI in the Cancer Data Science Laboratory. Interested in topics at the intersection of cancer genomics and immunology.
Thomas Pranzatelli (CBBG grad student)
Co-advised by John Chiorini
at the NIH/NIDCR. Thomas specializes in gene expression regulation through changes in chromatin accessibility to transcription factors. He currently works on an autoimmune disease of the salivary glands. Other scientific interests of his include evolution of life history strategies and information-theoretic approaches to aging.
Nick Rachmaninoff (CBBG grad student)
Co-advised by John Tsang
at the NIH/NIAID. Using transcriptomics, proteomics, and epigenomics, Nick is interested in studying how human immune systems respond to perturbations ranging from monogenic immune disorders to childhood development.
Shauna Rasband (BEES grad student)
Co-advised by Michael Braun
at the Smithsonian National Museum of Natural History. Broadly interested in systematics/genomics/ecology applied to avians.
Wei Xiao (CBBG grad student)
Wei is intrigued by the ecology and evolution of microbial immune systems, with a focus on CRISPR.
Hao Yiu (BEES grad student)
Hao seeks to use empirical and theoretical methods to understand adaptive immune systems in the context of evolution.
- Jake Weissman (PhD fall 2019), Simons Foundation postdoc at UCLA
- Aidan Bissell-Siders (undergrad), 2017-2019
- Vinay Velovolu (undergrad), 2018-2019
- Rohan Laljani (undergrad), 2017-2019
- Brian Liu (undergrad), 2016-2017
Interested in applying quantitative methods to biological problems? Contact Philip
! Potential graduate students should look into applying to the BEES or CBBG concentration areas within the BISI