• We describe a strategy for constructing codes for DNA-based information storage by serial composition of weighted finite-state transducers. The resulting state machines can integrate correction of substitution errors; synchronization by interleaving watermark and periodic marker signals; conversion from binary to ternary, quaternary or mixed-radix sequences via an efficient block code; encoding into a DNA sequence that avoids homopolymer, dinucleotide, or trinucleotide runs and other short local repeats; and detection/correction of errors (including local duplications, burst deletions, and substitutions) that are characteristic of DNA sequencing technologies. We present software implementing these codes, available at github.com/ihh/dnastore, with simulation results demonstrating that the generated DNA is free of short repeats and can be accurately decoded even in the presence of substitutions, short duplications and deletions.
  • We present an extension of Felsenstein's algorithm to indel models defined on entire sequences, without the need to condition on one multiple alignment. The algorithm makes use of a generalization from probabilistic substitution matrices to weighted finite-state transducers. Our approach may equivalently be viewed as a probabilistic formulation of progressive multiple sequence alignment, using partial-order graphs to represent ensemble profiles of ancestral sequences. We present a hierarchical stochastic approximation technique which makes this algorithm tractable for alignment analyses of reasonable size.
  • Neutral models which assume ecological equivalence between species provide null models for community assembly. In Hubbell's Unified Neutral Theory of Biodiversity (UNTB), many local communities are connected to a single metacommunity through differing immigration rates. Our ability to fit the full multi-site UNTB has hitherto been limited by the lack of a computationally tractable and accurate algorithm. We show that a large class of neutral models with this mainland-island structure but differing local community dynamics converge in the large population limit to the hierarchical Dirichlet process. Using this approximation we developed an efficient Bayesian fitting strategy for the multi-site UNTB. We can also use this approach to distinguish between neutral local community assembly given a non-neutral metacommunity distribution and the full UNTB where the metacommunity too assembles neutrally. We applied this fitting strategy to both tropical trees and a data set comprising 570,851 sequences from 278 human gut microbiomes. The tropical tree data set was consistent with the UNTB but for the human gut neutrality was rejected at the whole community level. However, when we applied the algorithm to gut microbial species within the same taxon at different levels of taxonomic resolution, we found that species abundances within some genera were almost consistent with local community assembly. This was not true at higher taxonomic ranks. This suggests that the gut microbiota is more strongly niche constrained than macroscopic organisms, with different groups adopting different functional roles, but within those groups diversity may at least partially be maintained by neutrality.We also observed a negative correlation between body mass index and immigration rates within the family Ruminococcaceae.
  • Continuous-time linear birth-death-immigration (BDI) processes are frequently used in ecology and epidemiology to model stochastic dynamics of the population of interest. In clinical settings, multiple birth-death processes can describe disease trajectories of individual patients, allowing for estimation of the effects of individual covariates on the birth and death rates of the process. Such estimation is usually accomplished by analyzing patient data collected at unevenly spaced time points, referred to as panel data in the biostatistics literature. Fitting linear BDI processes to panel data is a nontrivial optimization problem because birth and death rates can be functions of many parameters related to the covariates of interest. We propose a novel expectation--maximization (EM) algorithm for fitting linear BDI models with covariates to panel data. We derive a closed-form expression for the joint generating function of some of the BDI process statistics and use this generating function to reduce the E-step of the EM algorithm, as well as calculation of the Fisher information, to one-dimensional integration. This analytical technique yields a computationally efficient and robust optimization algorithm that we implemented in an open-source R package. We apply our method to DNA fingerprinting of Mycobacterium tuberculosis, the causative agent of tuberculosis, to study intrapatient time evolution of IS6110 copy number, a genetic marker frequently used during estimation of epidemiological clusters of Mycobacterium tuberculosis infections. Our analysis reveals previously undocumented differences in IS6110 birth-death rates among three major lineages of Mycobacterium tuberculosis, which has important implications for epidemiologists that use IS6110 for DNA fingerprinting of Mycobacterium tuberculosis.
  • Modeling sequence evolution on phylogenetic trees is a useful technique in computational biology. Especially powerful are models which take account of the heterogeneous nature of sequence evolution according to the "grammar" of the encoded gene features. However, beyond a modest level of model complexity, manual coding of models becomes prohibitively labor-intensive. We demonstrate, via a set of case studies, the new built-in model-prototyping capabilities of XRate (macros and Scheme extensions). These features allow rapid implementation of phylogenetic models which would have previously been far more labor-intensive. XRate's new capabilities for lineage-specific models, ancestral sequence reconstruction, and improved annotation output are also discussed. XRate's flexible model-specification capabilities and computational efficiency make it well-suited to developing and prototyping phylogenetic grammar models. XRate is available as part of the DART software package: http://biowiki.org/DART .
  • The Multiple Sequence Alignment (MSA) is a computational abstraction that represents a partial summary either of indel history, or of structural similarity. Taking the former view (indel history), it is possible to use formal automata theory to generalize the phylogenetic likelihood framework for finite substitution models (Dayhoff's probability matrices and Felsenstein's pruning algorithm) to arbitrary-length sequences. In this paper, we report results of a simulation-based benchmark of several methods for reconstruction of indel history. The methods tested include a relatively new algorithm for statistical marginalization of MSAs that sums over a stochastically-sampled ensemble of the most probable evolutionary histories. For mammalian evolutionary parameters on several different trees, the single most likely history sampled by our algorithm appears less biased than histories reconstructed by other MSA methods. The algorithm can also be used for alignment-free inference, where the MSA is explicitly summed out of the analysis. As an illustration of our method, we discuss reconstruction of the evolutionary histories of human protein-coding genes.