
Singlecell lineage tracking strategies enabled by recent experimental
technologies have produced significant insights into cell fate decisions, but
lack the quantitative framework necessary for rigorous statistical analysis of
mechanistic models describing cell division and differentiation. In this paper,
we develop such a framework with corresponding momentbased parameter
estimation techniques for continuoustime, multitype branching processes. Such
processes provide a probabilistic model of how cells divide and differentiate,
and we apply our method to study hematopoiesis, the mechanism of blood cell
production. We derive closedform expressions for higher moments in a general
class of such models. These analytical results allow us to efficiently estimate
parameters of much richer statistical models of hematopoiesis than those used
in previous statistical studies. After validating the methodology in simulation
studies, we apply our estimator to hematopoietic lineage tracking data from
rhesus macaques. Our analysis provides a more complete understanding of cell
fate decisions during hematopoiesis in nonhuman primates, which may be more
relevant to human biology and clinical strategies than previous findings from
murine studies. For example, in addition to previously estimated hematopoietic
stem cell selfrenewal rate, we are able to estimate fate decision
probabilities and to compare structurally distinct models of hematopoiesis
using cross validation. These estimates of fate decision probabilities and our
model selection results should help biologists compare competing hypotheses
about how progenitor cells differentiate. The methodology is transferrable to a
large class of stochastic compartmental models and multitype branching models,
commonly used in studies of cancer progression, epidemiology, and many other
fields.

By providing a framework of accounting for the shared ancestry inherent to
all life, phylogenetics is becoming the statistical foundation of biology. The
importance of model choice continues to grow as phylogenetic models continue to
increase in complexity to better capture micro and macroevolutionary processes.
In a Bayesian framework, the marginal likelihood is how data update our prior
beliefs about models, which gives us an intuitive measure of comparing model
fit that is grounded in probability theory. Given the rapid increase in the
number and complexity of phylogenetic models, methods for approximating
marginal likelihoods are increasingly important. Here we try to provide an
intuitive description of marginal likelihoods and why they are important in
Bayesian model testing. We also categorize and review methods for estimating
marginal likelihoods of phylogenetic models. In doing so, we use simulations to
evaluate the performance of one such method based on approximateBayesian
computation (ABC) and find that it is biased as predicted by theory.
Furthermore, we review some applications of marginal likelihoods to
phylogenetics, highlighting how they can be used to learn about models of
evolution from biological data. We conclude by discussing the challenges of
Bayesian model choice and future directions that promise to improve the
approximation of marginal likelihoods and Bayesian phylogenetics as a whole.

Modern biological techniques enable very dense genetic sampling of unfolding
evolutionary histories, and thus frequently sample some genotypes multiple
times. This motivates strategies to incorporate genotype abundance information
in phylogenetic inference. In this paper, we synthesize a stochastic process
model with standard sequencebased phylogenetic optimality, and show that tree
estimation is substantially improved by doing so. Our method is validated with
extensive simulations and an experimental singlecell lineage tracing study of
germinal center B cell receptor affinity maturation.

B cells develop high affinity receptors during the course of affinity
maturation, a cyclic process of mutation and selection. At the end of affinity
maturation, a number of cells sharing the same ancestor (i.e. in the same
"clonal family") are released from the germinal center, their amino acid
frequency profile reflects the allowed and disallowed substitutions at each
position. These clonalfamilyspecific frequency profiles, called "substitution
profiles", are useful for studying the course of affinity maturation as well as
for antibody engineering purposes. However, most often only a single sequence
is recovered from each clonal family in a sequencing experiment, making it
impossible to construct a clonalfamilyspecific substitution profile. Given
the public release of many highquality large B cell receptor datasets, one may
ask whether it is possible to use such data in a prediction model for
clonalfamilyspecific substitution profiles. In this paper, we present the
method "Substitution Profiles Using Related Families" (SPURF), a penalized
tensor regression framework that integrates information from a rich assemblage
of datasets to predict the clonalfamilyspecific substitution profile for any
single input sequence. Using this framework, we show that substitution profiles
from similar clonal families can be leveraged together with simulated
substitution profiles and germline gene sequence information to improve
prediction. We fit this model on a large public dataset and validate the
robustness of our approach on an external dataset. Furthermore, we provide a
commandline tool in an opensource software package
(https://github.com/krdav/SPURF) implementing these ideas and providing easy
prediction using our prefit models.

Birthdeath processes track the size of a univariate population, but many
biological systems involve interaction between populations, necessitating
models for two or more populations simultaneously. A lack of efficient methods
for evaluating finitetime transition probabilities of bivariate processes,
however, has restricted statistical inference in these models. Researchers rely
on computationally expensive methods such as matrix exponentiation or Monte
Carlo approximation, restricting likelihoodbased inference to small systems,
or indirect methods such as approximate Bayesian computation. In this paper, we
introduce the birth(death)/birthdeath process, a tractable bivariate extension
of the birthdeath process. We develop an efficient and robust algorithm to
calculate the transition probabilities of birth(death)/birthdeath processes
using a continued fraction representation of their Laplace transforms. Next, we
identify several exemplary models arising in molecular epidemiology,
macroparasite evolution, and infectious disease modeling that fall within this
class, and demonstrate advantages of our proposed method over existing
approaches to inference in these models. Notably, the ubiquitous stochastic
susceptibleinfectiousremoved (SIR) model falls within this class, and we
emphasize that computable transition probabilities newly enable direct
inference of parameters in the SIR model. We also propose a very fast method
for approximating the transition probabilities under the SIR model via a novel
branching process simplification, and compare it to the continued fraction
representation method with application to the 17th century plague in Eyam.
Although the two methods produce similar maximum a posteriori estimates, the
branching process approximation fails to capture the correlation structure in
the joint posterior distribution.

Stochastic epidemic models describe the dynamics of an epidemic as a disease
spreads through a population. Typically, only a fraction of cases are observed
at a set of discrete times. The absence of complete information about the time
evolution of an epidemic gives rise to a complicated latent variable problem in
which the state space size of the epidemic grows large as the population size
increases. This makes analytically integrating over the missing data infeasible
for populations of even moderate size. We present a data augmentation Markov
chain Monte Carlo (MCMC) framework for Bayesian estimation of stochastic
epidemic model parameters, in which measurements are augmented with
subjectlevel disease histories. In our MCMC algorithm, we propose each new
subjectlevel path, conditional on the data, using a timeinhomogeneous
continuoustime Markov process with rates determined by the infection histories
of other individuals. The method is general, and may be applied, with minimal
modifications, to a broad class of stochastic epidemic models. We present our
algorithm in the context of multiple stochastic epidemic models in which the
data are binomially sampled prevalence counts, and apply our method to data
from an outbreak of influenza in a British boarding school.

Stochastic mapping is a simulationbased method for probabilistically mapping
substitution histories onto phylogenies according to continuoustime Markov
models of evolution. This technique can be used to infer properties of the
evolutionary process on the phylogeny and, unlike parsimonybased mapping,
conditions on the observed data to randomly draw substitution mappings that do
not necessarily require the minimum number of events on a tree. Most stochastic
mapping applications simulate substitution mappings only to estimate the mean
and/or variance of two commonly used mapping summaries: the number of
particular types of substitutions (labeled substitution counts) and the time
spent in a particular group of states (labeled dwelling times) on the tree.
Fast, simulationfree algorithms for calculating the mean of stochastic mapping
summaries exist. Importantly, these algorithms scale linearly in the number of
tips/leaves of the phylogenetic tree. However, to our knowledge, no such
algorithm exists for calculating higherorder moments of stochastic mapping
summaries. We present one such simulationfree dynamic programming algorithm
that calculates prior and posterior mapping variances and scales linearly in
the number of phylogeny tips. Our procedure suggests a general framework that
can be used to efficiently compute higherorder moments of stochastic mapping
summaries without simulations. We demonstrate the usefulness of our algorithm
by extending previously developed statistical tests for rate variation across
sites and for detecting evolutionarily conserved regions in genomic sequences.

We present a locally adaptive nonparametric curve fitting method that
operates within a fully Bayesian framework. This method uses shrinkage priors
to induce sparsity in orderk differences in the latent trend function,
providing a combination of local adaptation and global control. Using a scale
mixture of normals representation of shrinkage priors, we make explicit
connections between our method and kth order Gaussian Markov random field
smoothing. We call the resulting processes shrinkage prior Markov random fields
(SPMRFs). We use Hamiltonian Monte Carlo to approximate the posterior
distribution of model parameters because this method provides superior
performance in the presence of the high dimensionality and strong parameter
correlations exhibited by our models. We compare the performance of three prior
formulations using simulated data and find the horseshoe prior provides the
best compromise between bias and precision. We apply SPMRF models to two
benchmark data examples frequently used to test nonparametric methods. We find
that this method is flexible enough to accommodate a variety of data generating
models and offers the adaptive properties and computational tractability to
make it a useful addition to the Bayesian nonparametric toolbox.

We introduce phylodyn, an R package for phylodynamic analysis based on gene
genealogies. The package main functionality is Bayesian nonparametric
estimation of effective population size fluctuations over time. Our
implementation includes several Markov chain Monte Carlobased methods and an
integrated nested Laplace approximationbased approach for phylodynamic
inference that have been developed in recent years. Genealogical data describe
the timed ancestral relationships of individuals sampled from a population of
interest. Here, individuals are assumed to be sampled at the same point in time
(isochronous sampling) or at different points in time (heterochronous
sampling); in addition, sampling events can be modeled with preferential
sampling, which means that the intensity of sampling events is allowed to
depend on the effective population size trajectory. We assume the coalescent
and the sequentially Markov coalescent processes as generative models of
genealogies. We include several coalescent simulation functions that are useful
for testing our phylodynamics methods via simulation studies. We compare the
performance and outputs of various methods implemented in phylodyn and outline
their strengths and weaknesses. R package phylodyn is available at
https://github.com/mdkarcher/phylodyn.

Phylodynamics seeks to estimate effective population size fluctuations from
molecular sequences of individuals sampled from a population of interest. One
way to accomplish this task formulates an observed sequence data likelihood
exploiting a coalescent model for the sampled individuals' genealogy and then
integrating over all possible genealogies via Monte Carlo or, less efficiently,
by conditioning on one genealogy estimated from the sequence data. However,
when analyzing sequences sampled serially through time, current methods
implicitly assume either that sampling times are fixed deterministically by the
data collection protocol or that their distribution does not depend on the size
of the population. Through simulation, we first show that, when sampling times
do probabilistically depend on effective population size, estimation methods
may be systematically biased. To correct for this deficiency, we propose a new
model that explicitly accounts for preferential sampling by modeling the
sampling times as an inhomogeneous Poisson process dependent on effective
population size. We demonstrate that in the presence of preferential sampling
our new model not only reduces bias, but also improves estimation precision.
Finally, we compare the performance of the currently used phylodynamic methods
with our proposed model through clinicallyrelevant, seasonal human influenza
examples.

The antibody repertoire of each individual is continuously updated by the
evolutionary process of B cell receptor mutation and selection. It has recently
become possible to gain detailed information concerning this process through
highthroughput sequencing. Here, we develop modern statistical molecular
evolution methods for the analysis of B cell sequence data, and then apply them
to a very deep shortread data set of B cell receptors. We find that the
substitution process is conserved across individuals but varies significantly
across gene segments. We investigate selection on B cell receptors using a
novel method that sidesteps the difficulties encountered by previous work in
differentiating between selection and motifdriven mutation; this is done
through stochastic mapping and empirical Bayes estimators that compare the
evolution of inframe and outofframe rearrangements. We use this new method
to derive a perresidue map of selection, which provides a more nuanced view of
the constraints on framework and variable regions.

Branching processes are a class of continuoustime Markov chains (CTMCs) with
ubiquitous applications. A general difficulty in statistical inference under
partially observed CTMC models arises in computing transition probabilities
when the discrete state space is large or uncountable. Classical methods such
as matrix exponentiation are infeasible for large or countably infinite state
spaces, and samplingbased alternatives are computationally intensive,
requiring a large integration step to impute over all possible hidden events.
Recent work has successfully applied generating function techniques to
computing transition probabilities for linear multitype branching processes.
While these techniques often require significantly fewer computations than
matrix exponentiation, they also become prohibitive in applications with large
populations. We propose a compressed sensing framework that significantly
accelerates the generating function method, decreasing computational cost up to
a logarithmic factor by only assuming the probability mass of transitions is
sparse. We demonstrate accurate and efficient transition probability
computations in branching process models for hematopoiesis and transposable
element evolution.

Despite seasonal cholera outbreaks in Bangladesh, little is known about the
relationship between environmental conditions and cholera cases. We seek to
develop a predictive model for cholera outbreaks in Bangladesh based on
environmental predictors. To do this, we estimate the contribution of
environmental variables, such as water depth and water temperature, to cholera
outbreaks in the context of a disease transmission model. We implement a method
which simultaneously accounts for disease dynamics and environmental variables
in a SusceptibleInfectedRecoveredSusceptible (SIRS) model. The entire system
is treated as a continuoustime hidden Markov model, where the hidden Markov
states are the numbers of people who are susceptible, infected, or recovered at
each time point, and the observed states are the numbers of cholera cases
reported. We use a Bayesian framework to fit this hidden SIRS model,
implementing particle Markov chain Monte Carlo methods to sample from the
posterior distribution of the environmental and transmission parameters given
the observed data. We test this method using both simulation and data from
Mathbaria, Bangladesh. Parameter estimates are used to make shortterm
predictions that capture the formation and decline of epidemic peaks. We
demonstrate that our model can successfully predict an increase in the number
of infected individuals in the population weeks before the observed number of
cholera cases increases, which could allow for early notification of an
epidemic and timely allocation of resources.

Phylodynamics focuses on the problem of reconstructing past population size
dynamics from current genetic samples taken from the population of interest.
This technique has been extensively used in many areas of biology, but is
particularly useful for studying the spread of quickly evolving infectious
diseases agents, e.g.,\ influenza virus. Phylodynamics inference uses a
coalescent model that defines a probability density for the genealogy of
randomly sampled individuals from the population. When we assume that such a
genealogy is known, the coalescent model, equipped with a Gaussian process
prior on population size trajectory, allows for nonparametric Bayesian
estimation of population size dynamics. While this approach is quite powerful,
large data sets collected during infectious disease surveillance challenge the
stateoftheart of Bayesian phylodynamics and demand computationally more
efficient inference framework. To satisfy this demand, we provide a
computationally efficient Bayesian inference framework based on Hamiltonian
Monte Carlo for coalescent process models. Moreover, we show that by splitting
the Hamiltonian function we can further improve the efficiency of this
approach. Using several simulated and real datasets, we show that our method
provides accurate estimates of population size dynamics and is substantially
faster than alternative methods based on elliptical slice sampler and
Metropolisadjusted Langevin algorithm.

Continuoustime birthdeathshift (BDS) processes are frequently used in
stochastic modeling, with many applications in ecology and epidemiology. In
particular, such processes can model evolutionary dynamics of transposable
elements  important genetic markers in molecular epidemiology. Estimation of
the effects of individual covariates on the birth, death, and shift rates of
the process can be accomplished by analyzing patient data, but inferring these
rates in a discretely and unevenly observed setting presents computational
challenges. We propose a mutlitype branching process approximation to BDS
processes and develop a corresponding expectation maximization (EM) algorithm,
where we use spectral techniques to reduce calculation of expected sufficient
statistics to low dimensional integration. These techniques yield an efficient
and robust optimization routine for inferring the rates of the BDS process, and
apply more broadly to multitype branching processes where rates can depend on
many covariates. After rigorously testing our methodology in simulation
studies, we apply our method to study intrapatient time evolution of IS6110
transposable element, a frequently used element during estimation of
epidemiological clusters of Mycobacterium tuberculosis infections.

When estimating a phylogeny from a multiple sequence alignment, researchers
often assume the absence of recombination. However, if recombination is
present, then tree estimation and all downstream analyses will be impacted,
because different segments of the sequence alignment support different
phylogenies. Similarly, convergent selective pressures at the molecular level
can also lead to phylogenetic tree incongruence across the sequence alignment.
Current methods for detection of phylogenetic incongruence are not equipped to
distinguish between these two different mechanisms and assume that the
incongruence is a result of recombination or other horizontal transfer of
genetic information. We propose a new recombination detection method that can
make this distinction, based on synonymous codon substitution distances.
Although some power is lost by discarding the information contained in the
nonsynonymous substitutions, our new method has lower false positive
probabilities than the comparable recombination detection method when the
phylogenetic incongruence signal is due to convergent evolution. We apply our
method to three empirical examples, where we analyze: 1) sequences from a
transmission network of the human immunodeficiency virus, 2) tlpB gene
sequences from a geographically diverse set of 38 Helicobacter pylori strains,
and 3) Hepatitis C virus sequences sampled longitudinally from one patient.

Phylogenetic stochastic mapping is a method for reconstructing the history of
trait changes on a phylogenetic tree relating species/organisms carrying the
trait. Stateoftheart methods assume that the trait evolves according to a
continuoustime Markov chain (CTMC) and work well for small state spaces. The
computations slow down considerably for larger state spaces (e.g. space of
codons), because current methodology relies on exponentiating CTMC
infinitesimal rate matrices  an operation whose computational complexity
grows as the size of the CTMC state space cubed. In this work, we introduce a
new approach, based on a CTMC technique called uniformization, that does not
use matrix exponentiation for phylogenetic stochastic mapping. Our method is
based on a new Markov chain Monte Carlo (MCMC) algorithm that targets the
distribution of trait histories conditional on the trait data observed at the
tips of the tree. The computational complexity of our MCMC method grows as the
size of the CTMC state space squared. Moreover, in contrast to competing matrix
exponentiation methods, if the rate matrix is sparse, we can leverage this
sparsity and increase the computational efficiency of our algorithm further.
Using simulated data, we illustrate advantages of our MCMC algorithm and
investigate how large the state space needs to be for our method to outperform
matrix exponentiation approaches. We show that even on the moderately large
state space of codons our MCMC method can be significantly faster than
currently used matrix exponentiation methods.

Continuoustime linear birthdeathimmigration (BDI) processes are frequently
used in ecology and epidemiology to model stochastic dynamics of the population
of interest. In clinical settings, multiple birthdeath processes can describe
disease trajectories of individual patients, allowing for estimation of the
effects of individual covariates on the birth and death rates of the process.
Such estimation is usually accomplished by analyzing patient data collected at
unevenly spaced time points, referred to as panel data in the biostatistics
literature. Fitting linear BDI processes to panel data is a nontrivial
optimization problem because birth and death rates can be functions of many
parameters related to the covariates of interest. We propose a novel
expectationmaximization (EM) algorithm for fitting linear BDI models with
covariates to panel data. We derive a closedform expression for the joint
generating function of some of the BDI process statistics and use this
generating function to reduce the Estep of the EM algorithm, as well as
calculation of the Fisher information, to onedimensional integration. This
analytical technique yields a computationally efficient and robust optimization
algorithm that we implemented in an opensource R package. We apply our method
to DNA fingerprinting of Mycobacterium tuberculosis, the causative agent of
tuberculosis, to study intrapatient time evolution of IS6110 copy number, a
genetic marker frequently used during estimation of epidemiological clusters of
Mycobacterium tuberculosis infections. Our analysis reveals previously
undocumented differences in IS6110 birthdeath rates among three major lineages
of Mycobacterium tuberculosis, which has important implications for
epidemiologists that use IS6110 for DNA fingerprinting of Mycobacterium
tuberculosis.

The goal of phylodynamics, an area on the intersection of phylogenetics and
population genetics, is to reconstruct population size dynamics from genetic
data. Recently, a series of nonparametric Bayesian methods have been proposed
for such demographic reconstructions. These methods rely on prior
specifications based on Gaussian processes and proceed by approximating the
posterior distribution of population size trajectories via Markov chain Monte
Carlo (MCMC) methods. In this paper, we adapt an integrated nested Laplace
approximation (INLA), a recently proposed approximate Bayesian inference for
latent Gaussian models, to the estimation of population size trajectories. We
show that when a genealogy of sampled individuals can be reliably estimated
from genetic data, INLA enjoys high accuracy and can replace MCMC entirely. We
demonstrate significant computational efficiency over the stateoftheart MCMC
methods. We illustrate INLAbased population size inference using simulations
and genealogies of hepatitis C and human influenza viruses.

Changes in population size influence genetic diversity of the population and,
as a result, leave a signature of these changes in individual genomes in the
population. We are interested in the inverse problem of reconstructing past
population dynamics from genomic data. We start with a standard framework based
on the coalescent, a stochastic process that generates genealogies connecting
randomly sampled individuals from the population of interest. These genealogies
serve as a glue between the population demographic history and genomic
sequences. It turns out that only the times of genealogical lineage
coalescences contain information about population size dynamics. Viewing these
coalescent times as a point process, estimating population size trajectories is
equivalent to estimating a conditional intensity of this point process.
Therefore, our inverse problem is similar to estimating an inhomogeneous
Poisson process intensity function. We demonstrate how recent advances in
Gaussian processbased nonparametric inference for Poisson processes can be
extended to Bayesian nonparametric estimation of population size dynamics under
the coalescent. We compare our Gaussian process (GP) approach to one of the
state of the art Gaussian Markov random field (GMRF) methods for estimating
population trajectories. Using simulated data, we demonstrate that our method
has better accuracy and precision. Next, we analyze two genealogies
reconstructed from real sequences of hepatitis C and human Influenza A viruses.
In both cases, we recover more believed aspects of the viral demographic
histories than the GMRF approach. We also find that our GP method produces more
reasonable uncertainty estimates than the GMRF method.

Birthdeath processes (BDPs) are continuoustime Markov chains that track the
number of "particles" in a system over time. While widely used in population
biology, genetics and ecology, statistical inference of the instantaneous
particle birth and death rates remains largely limited to restrictive linear
BDPs in which perparticle birth and death rates are constant. Researchers
often observe the number of particles at discrete times, necessitating data
augmentation procedures such as expectationmaximization (EM) to find maximum
likelihood estimates. The Estep in the EM algorithm is available in
closedform for some linear BDPs, but otherwise previous work has resorted to
approximation or simulation. Remarkably, the Estep conditional expectations
can also be expressed as convolutions of computable transition probabilities
for any general BDP with arbitrary rates. This important observation, along
with a convenient continued fraction representation of the Laplace transforms
of the transition probabilities, allows novel and efficient computation of the
conditional expectations for all BDPs, eliminating the need for approximation
or costly simulation. We use this insight to derive EM algorithms that yield
maximum likelihood estimation for general BDPs characterized by various rate
models, including generalized linear models. We show that our Laplace
convolution technique outperforms competing methods when available and
demonstrate a technique to accelerate EM algorithm convergence. Finally, we
validate our approach using synthetic data and then apply our methods to
estimation of mutation parameters in microsatellite evolution.

Inference problems with incomplete observations often aim at estimating
population properties of unobserved quantities. One simple way to accomplish
this estimation is to impute the unobserved quantities of interest at the
individual level and then take an empirical average of the imputed values. We
show that this simple imputation estimator can provide partial protection
against model misspecification. We illustrate imputation estimators' robustness
to model specification on three examples: mixture modelbased clustering,
estimation of genotype frequencies in population genetics, and estimation of
Markovian evolutionary distances. In the final example, using a representative
model misspecification, we demonstrate that in nondegenerate cases, the
imputation estimator dominates the plugin estimate asymptotically. We conclude
by outlining a Bayesian implementation of the imputationbased estimation.