• Evolution has fascinated quantitative and physical scientists for decades: how can the random process of mutation, recombination, and duplication of genetic information generate the diversity of life? What determines the rate of evolution? Are there quantitative laws that govern and constrain evolution? Is evolution repeatable or predictable? Historically, the study of evolution involved classifying and comparing species, typically based on morphology. In addition to phenotypes on the organismal and molecular scales, we now use whole-genome sequencing to uncover not only the differences between species but also to characterize genetic diversity within-species in unprecedented detail. This diversity can be compared to predictions of quantitative models of evolutionary dynamics. Here, we review key theoretical models of population genetics and evolution along with examples of data from lab evolution experiments, longitudinal sampling of viral populations, microbial communities and the studies of immune repertoires. In all these systems, evolution is shaped by often variable biological and physical environments. While these variable environments can be modeled implicitly in cases such as host-pathogen co-evolution, the dynamic environment, and emerging ecology often cannot be ignored. Integrating dynamics on different scales, both in terms of observation and theoretical models, remains a major challenge towards a better understanding of evolution.
  • HIV-1 infection currently cannot be cured because the virus persists as integrated proviral DNA in long-lived cells despite years of suppressive antiretroviral therapy (ART). To characterize establishment, turnover, and evolution of viral DNA reservoirs we deep-sequenced the p17gag region of the HIV-1 genome from samples obtained after 3-18 years of suppressive ART from 10 patients. For each of these patients, whole genome deep-sequencing data of HIV-1 RNA populations before onset of ART were available from 6-12 longitudinal plasma samples spanning 5-8 years of untreated infection. This enabled a detailed analysis of the dynamics and origin of proviral DNA during ART. A median of 14% (range 0-42%) of the p17gag DNA sequences were overtly defective due to G-to-A hypermutation. The remaining sequences were remarkably similar to previously observed RNA sequences and showed no evidence of evolution over many years of suppressive ART. Most sequences from the DNA reservoirs were very similar to viruses actively replicating in plasma (RNA sequences) shortly before start of ART. The results do not support persistent HIV-1 replication as a mechanism to maintain the HIV-1 reservoir during suppressive therapy. Rather, the data indicate that viral DNA variants are turning over as long as patients are untreated and that suppressive ART halts this turnover.
  • Human seasonal influenza viruses evolve rapidly, enabling the virus population to evade immunity and re-infect previously infected individuals. Antigenic properties are largely determined by the surface glycoprotein hemagglutinin (HA) and amino acid substitutions at exposed epitope sites in HA mediate loss of recognition by antibodies. Here, we show that antigenic differences measured through serological assay data are well described by a sum of antigenic changes along the path connecting viruses in a phylogenetic tree. This mapping onto the tree allows prediction of antigenicity from HA sequence data alone. The mapping can further be used to make predictions about the makeup of the future seasonal influenza virus population, and we compare predictions between models with serological and sequence data. To make timely model output readily available, we developed a web browser based application that visualizes antigenic data on a continuously updated phylogeny.
  • Many microbial populations rapidly adapt to changing environments with multiple variants competing for survival. To quantify such complex evolutionary dynamics in vivo, time resolved and genome wide data including rare variants are essential. We performed whole-genome deep sequencing of HIV-1 populations in 9 untreated patients, with 6-12 longitudinal samples per patient spanning 5-8 years of infection. We show that patterns of minor diversity are reproducible between patients and mirror global HIV-1 diversity, suggesting a universal landscape of fitness costs that control diversity. Reversions towards the ancestral HIV-1 sequence are observed throughout infection and account for almost one third of all sequence changes. Reversion rates depend strongly on conservation. Frequent recombination limits linkage disequilibrium to about 100bp in most of the genome, but strong hitch-hiking due to short range linkage limits diversity.
  • Given a sample of genome sequences from an asexual population, can one predict its evolutionary future? Here we demonstrate that the branching patterns of reconstructed genealogical trees contains information about the relative fitness of the sampled sequences and that this information can be used to predict successful strains. Our approach is based on the assumption that evolution proceeds by accumulation of small effect mutations, does not require species specific input and can be applied to any asexual population under persistent selection pressure. We demonstrate its performance using historical data on seasonal influenza A/H3N2 virus. We predict the progenitor lineage of the upcoming influenza season with near optimal performance in 30% of cases and make informative predictions in 16 out of 19 years. Beyond providing a tool for prediction, our ability to make informative predictions implies persistent fitness variation among circulating influenza A/H3N2 viruses.
  • Cytotoxic T-lymphocytes (CTLs) recognize viral protein fragments displayed by major histocompatibility complex (MHC) molecules on the surface of virally infected cells and generate an anti-viral response that can kill the infected cells. Virus variants whose protein fragments are not efficiently presented on infected cells or whose fragments are presented but not recognized by CTLs therefore have a competitive advantage and spread rapidly through the population. We present a method that allows a more robust estimation of these escape rates from serially sampled sequence data. The proposed method accounts for competition between multiple escapes by explicitly modeling the accumulation of escape mutations and the stochastic effects of rare multiple mutants. Applying our method to serially sampled HIV sequence data, we estimate rates of HIV escape that are substantially larger than those previously reported. The method can be extended to complex escapes that require compensatory mutations. We expect our method to be applicable in other contexts such as cancer evolution where time series data is also available.
  • In sexual populations, selection operates neither on the whole genome, which is repeatedly taken apart and reassembled by recombination, nor on individual alleles that are tightly linked to the chromosomal neighborhood. The resulting interference between linked alleles reduces the efficiency of selection and distorts patterns of genetic diversity. Inference of evolutionary history from diversity shaped by linked selection requires an understanding of these patterns. Here, we present a simple but powerful scaling analysis identifying the unit of selection as the genomic "linkage block" with a characteristic length determined in a self-consistent manner by the condition that the rate of recombination within the block is comparable to the fitness differences between different alleles of the block. We find that an asexual model with the strength of selection tuned to that of the linkage block provides an excellent description of genetic diversity and the site frequency spectra when compared to computer simulations. This linkage block approximation is accurate for the entire spectrum of strength of selection and is particularly powerful in scenarios with many weakly selected loci. The latter limit allows us to characterize coalescence, genetic diversity, and the speed of adaptation in the infinitesimal model of quantitative genetics.
  • Pervasive natural selection can strongly influence observed patterns of genetic variation, but these effects remain poorly understood when multiple selected variants segregate in nearby regions of the genome. Classical population genetics fails to account for interference between linked mutations, which grows increasingly severe as the density of selected polymorphisms increases. Here, we describe a simple limit that emerges when interference is common, in which the fitness effects of individual mutations play a relatively minor role. Instead, molecular evolution is determined by the variance in fitness within the population, defined over an effectively asexual segment of the genome (a ``linkage block''). We exploit this insensitivity in a new ``coarse-grained'' coalescent framework, which approximates the effects of many weakly selected mutations with a smaller number of strongly selected mutations with the same variance in fitness. This approximation generates accurate and efficient predictions for the genetic diversity that cannot be summarized by a simple reduction in effective population size. However, these results suggest a fundamental limit on our ability to resolve individual selection pressures from contemporary sequence data alone, since a wide range of parameters yield nearly identical patterns of sequence variability.
  • Intrapatient HIV-1 evolution is dominated by selection on the protein level in the arms race with the adaptive immune system. When cytotoxic CD8+ T-cells or neutralizing antibodies target a new epitope, the virus often escapes via nonsynonymous mutations that impair recognition. Synonymous mutations do not affect this interplay and are often assumed to be neutral. We analyze longitudinal intrapatient data from the C2-V5 part of the envelope gene (env) and observe that synonymous derived alleles rarely fix even though they often reach high frequencies in the viral population. We find that synonymous mutations that disrupt base pairs in RNA stems flanking the variable loops of gp120 are more likely to be lost than other synonymous changes, hinting at a direct fitness effect of these stem-loop structures in the HIV-1 RNA. Computational modeling indicates that these synonymous mutations have a (Malthusian) selection coefficient of the order of -0.002 and that they are brought up to high frequency by hitchhiking on neighboring beneficial nonsynonymous alleles. The patterns of fixation of nonsynonymous mutations estimated from the longitudinal data and comparisons with computer models suggest that escape mutations in C2-V5 are only transiently beneficial, either because the immune system is catching up or because of competition between equivalent escapes.
  • To learn about the past from a sample of genomic sequences, one needs to understand how evolutionary processes shape genetic diversity. Most population genetic inference is based on frameworks assuming adaptive evolution is rare. But if positive selection operates on many loci simultaneously, as has recently been suggested for many species including animals such as flies, a different approach is necessary. In this review, I discuss recent progress in characterizing and understanding evolution in rapidly adapting populations where random associations of mutations with genetic backgrounds of different fitness, i.e., genetic draft, dominate over genetic drift. As a result, neutral genetic diversity depends weakly on population size, but strongly on the rate of adaptation or more generally the variance in fitness. Coalescent processes with multiple mergers, rather than Kingman's coalescent, are appropriate genealogical models for rapidly adapting populations with important implications for population genetic inference.
  • The genetic diversity of a species is shaped by its recent evolutionary history and can be used to infer demographic events or selective sweeps. Most inference methods are based on the null hypothesis that natural selection is a weak or infrequent evolutionary force. However, many species, particularly pathogens, are under continuous pressure to adapt in response to changing environments. A statistical framework for inference from diversity data of such populations is currently lacking. Toward this goal, we explore the properties of genealogies in a model of continual adaptation in asexual populations. We show that lineages trace back to a small pool of highly fit ancestors, in which almost simultaneous coalescence of more than two lineages frequently occurs. While such multiple mergers are unlikely under the neutral coalescent, they create a unique genetic footprint in adapting populations. The site frequency spectrum of derived neutral alleles, for example, is non-monotonic and has a peak at high frequencies, whereas Tajima's D becomes more and more negative with increasing sample size. Since multiple merger coalescents emerge in many models of rapid adaptation, we argue that they should be considered as a null-model for adapting populations.
  • The analysis of the evolutionary dynamics of a population with many polymorphic loci is challenging since a large number of possible genotypes needs to be tracked. In the absence of analytical solutions, forward computer simulations are an important tool in multi-locus population genetics. The run time of standard algorithms to simulate sexual populations increases as 8^L with the number L of loci, or with the square of the population size N. We have developed algorithms that allow to simulate large populations with a run-time that scales as 3^L. The algorithm is based on an analog of the Fast-Fourier Transform (FFT) and allows for arbitrary fitness functions (i.e. any epistasis) and genetic maps. The algorithm is implemented as a collection of C++ classes and a Python interface.
  • Human immunodeficiency virus (HIV-1 or simply HIV) induces a persistent infection, which in the absence of treatment leads to AIDS and death in almost all infected individuals. HIV infection elicits a vigorous immune response starting about 2-3 weeks post infection that can lower the amount of virus in the body, but which cannot eradicate the virus. How HIV establishes a chronic infection in the face of a strong immune response remains poorly understood. It has been shown that HIV is able to rapidly change its proteins via mutation to evade recognition by virus-specific cytotoxic T lymphocytes (CTLs). Typically, an HIV-infected patient will generate 4-12 CTL responses specific for parts of viral proteins called epitopes. Such CTL responses lead to strong selective pressure to change the viral sequences encoding these epitopes so as to avoid CTL recognition. Here we review experimental data on HIV evolution in response to CTL pressure, mathematical models developed to explain this evolution, and highlight problems associated with the data and previous modeling efforts. We show that estimates of the strength of the epitope-specific CTL response depend on the method used to fit models to experimental data and on the assumptions made regarding how mutants are generated during infection. We illustrate that allowing CTL responses to decay over time may improve the fit to experimental data and provides higher estimates of the killing efficacy of HIV-specific CTLs. We also propose a novel method for simultaneously estimating the killing efficacy of multiple CTL populations specific for different epitopes of HIV using stochastic simulations. Lastly, we show that current estimates of the efficacy at which HIV-specific CTLs clear virus-infected cells can be improved by more frequent sampling of viral sequences and by combining data on sequence evolution with experimentally measured CTL dynamics.
  • In sexual population, recombination reshuffles genetic variation and produces novel combinations of existing alleles, while selection amplifies the fittest genotypes in the population. If recombination is more rapid than selection, populations consist of a diverse mixture of many genotypes, as is observed in many populations. In the opposite regime, which is realized for example in the facultatively sexual populations that outcross in only a fraction of reproductive cycles, selection can amplify individual genotypes into large clones. Such clones emerge when the fitness advantage of some of the genotypes is large enough that they grow to a significant fraction of the population despite being broken down by recombination. The occurrence of this "clonal condensation" depends, in addition to the outcrossing rate, on the heritability of fitness. Clonal condensation leads to a strong genetic heterogeneity of the population which is not adequately described by traditional population genetics measures, such as Linkage Disequilibrium. Here we point out the similarity between clonal condensation and the freezing transition in the Random Energy Model of spin glasses. Guided by this analogy we explicitly calculate the probability, Y, that two individuals are genetically identical as a function of the key parameters of the model. While Y is the analog of the spin-glass order parameter, it is also closely related to rate of coalescence in population genetics: Two individuals that are part of the same clone have a recent common ancestor.
  • Selective sweeps are typically associated with a local reduction of genetic diversity around the adaptive site. However, selective sweeps can also quickly carry neutral mutations to observable population frequencies if they arise early in a sweep and hitchhike with the adaptive allele. We show that the interplay between mutation and exponential amplification through hitchhiking results in a characteristic frequency spectrum of the resulting novel haplotype variation that depends only on the ratio of the mutation rate and the selection coefficient of the sweep. Based on this result, we develop an estimator for the selection coefficient driving a sweep. Since this estimator utilizes the novel variation arising from mutations during a sweep, it does not rely on preexisting variation and can also be applied to loci that lack recombination. Compared with standard approaches that infer selection coefficients from the size of dips in genetic diversity around the adaptive site, our estimator requires much shorter sequences but sampled at high population depth in order to capture low-frequency variants; given such data, it consistently outperforms standard approaches. We investigate analytically and numerically how the accuracy of our estimator is affected by the decay of the sweep pattern over time as a consequence of random genetic drift and discuss potential effects of recombination, soft sweeps, and demography. As an example for its use, we apply our estimator to deep sequencing data from HIV populations.
  • The accumulation of deleterious mutations is driven by rare fluctuations which lead to the loss of all mutation free individuals, a process known as Muller's ratchet. Even though Muller's ratchet is a paradigmatic process in population genetics, a quantitative understanding of its rate is still lacking. The difficulty lies in the nontrivial nature of fluctuations in the fitness distribution which control the rate of extinction of the fittest genotype. We address this problem using the simple but classic model of mutation selection balance with deleterious mutations all having the same effect on fitness. We show analytically how fluctuations among the fittest individuals propagate to individuals of lower fitness and have a dramatically amplified effects on the bulk of the population at a later time. If a reduction in the size of the fittest class reduces the mean fitness only after a delay, selection opposing this reduction is also delayed. This delayed restoring force speeds up Muller's ratchet. We show how the delayed response can be accounted for using a path integral formulation of the stochastic dynamics and provide an expression for the rate of the ratchet that is accurate across a broad range of parameters.
  • The vast majority of mutations are deleterious, and are eliminated by purifying selection. Yet in finite asexual populations, purifying selection cannot completely prevent the accumulation of deleterious mutations due to Muller's ratchet: once lost by stochastic drift, the most-fit class of genotypes is lost forever. If deleterious mutations are weakly selected, Muller's ratchet turns into a mutational "meltdown" leading to a rapid degradation of population fitness. Evidently, the long term stability of an asexual population requires an influx of beneficial mutations that continuously compensate for the accumulation of the weakly deleterious ones. Here we propose that the stable evolutionary state of a population in a static environment is a dynamic mutation-selection balance, where accumulation of deleterious mutations is on average offset by the influx of beneficial mutations. We argue that this state exists for any population size N and mutation rate $U$. Assuming that beneficial and deleterious mutations have the same fitness effect s, we calculate the fraction of beneficial mutations, \epsilon, that maintains the balanced state. We find that a surprisingly low \epsilon suffices to maintain stability, even in small populations in the face of high mutation rates and weak selection. This may explain the maintenance of mitochondria and other asexual genomes, and has implications for the expected statistics of genetic diversity in these populations.
  • We study a protein-DNA target search model with explicit DNA dynamics applicable to in vitro experiments. We show that the DNA dynamics plays a crucial role for the effectiveness of protein "jumps" between sites distant along the DNA contour but close in 3D space. A strongly binding protein that searches by 1D sliding and jumping alone, explores the search space less redundantly when the DNA dynamics is fast on the timescale of protein jumps than in the opposite "frozen DNA" limit. We characterize the crossover between these limits using simulations and scaling theory. We also rationalize the slow exploration in the frozen limit as a subtle interplay between long jumps and long trapping times of the protein in "islands" within random DNA configurations in solution.
  • Adaptation often involves the acquisition of a large number of genomic changes which arise as mutations in single individuals. In asexual populations, combinations of mutations can fix only when they arise in the same lineage, but for populations in which genetic information is exchanged, beneficial mutations can arise in different individuals and be combined later. In large populations, when the product of the population size N and the total beneficial mutation rate U_b is large, many new beneficial alleles can be segregating in the population simultaneously. We calculate the rate of adaptation, v, in several models of such sexual populations and show that v is linear in NU_b only in sufficiently small populations. In large populations, v increases much more slowly as log NU_b. The prefactor of this logarithm, however, increases as the square of the recombination rate. This acceleration of adaptation by recombination implies a strong evolutionary advantage of sex.
  • The distribution and heritability of many traits depends on numerous loci in the genome. In general, the astronomical number of possible genotypes makes the system with large numbers of loci difficult to describe. Multilocus evolution, however, greatly simplifies in the limit of weak selection and frequent recombination. In this limit, populations rapidly reach Quasi-Linkage Equilibrium (QLE) in which the dynamics of the full genotype distribution, including correlations between alleles at different loci, can be parameterized by the allele frequencies. This review provides a simplified exposition of the concept and mathematics of QLE which is central to the statistical description of genotypes in sexual populations. We show how key results of Quantitative Genetics such as the generalized Fisher's "Fundamental Theorem", along with Wright's Adaptive Landscape, emerge within QLE from the dynamics of the genotype distribution. We then discuss under what circumstances QLE is applicable, and what the breakdown of QLE implies for the population structure and the dynamics of selection. Understanding of the fundamental aspects of multilocus evolution obtained through simplified models may be helpful in providing conceptual and computational tools to address the challenges arising in the studies of complex quantitative phenotypes of practical interest.
  • Large populations may contain numerous simultaneously segregating polymorphisms subject to natural selection. Since selection acts on individuals whose fitness depends on many loci, different loci affect each other's dynamics. This leads to stochastic fluctuations of allele frequencies above and beyond genetic drift - an effect known as genetic draft. Since recombination disrupts associations between alleles, draft is strong when recombination is rare. Here, we study a facultatively outcrossing population in a regime where the frequency of out-crossing and recombination, r, is small compared to the characteristic scale of fitness differences \sigma. In this regime, fit genotypes expand clonally, leading to large fluctuations in the number of recombinant offspring genotypes. The power law tail in the distribution of the latter makes it impossible to capture the dynamics of draft by an effective neutral model. Instead, we find that the fixation time of a neutral allele increases only slowly with the population size but depends sensitively on the ratio r/\sigma. The efficacy of selection is reduced dramatically and alleles behave "quasi-neutrally" even for Ns>> 1, provided that |s|< s_c, where s_c depends strongly on r/\sigma, but only weakly on population size N. In addition, the anomalous fluctuations due to draft change the spectrum of (quasi)-neutral alleles from f(\nu)\sim 1/\nu, corresponding to drift, to \sim1/\nu^2. Finally, draft accelerates the rate of two step adaptations through deleterious intermediates.
  • Biochemical and regulatory interactions central to biological networks are expected to cause extensive genetic interactions or epistasis affecting the heritability of complex traits and the distribution of genotypes in populations. However, the inference of epistasis from the observed phenotype-genotype correlation is impeded by statistical difficulties, while the theoretical understanding of the effects of epistasis remains limited, in turn limiting our ability to interpret data. Of particular interest is the biologically relevant situation of numerous interacting genetic loci with small individual contributions to fitness. Here, we present a computational model of selection dynamics involving many epistatic loci in a recombining population. We demonstrate that a large number of polymorphic interacting loci can, despite frequent recombination, exhibit cooperative behavior that locks alleles into favorable genotypes leading to a population consisting of a set of competing clones. When the recombination rate exceeds a certain critical value that depends on the strength of epistasis, this "genotype selection" regime disappears in an abrupt transition, giving way to "allele selection"-the regime where different loci are only weakly correlated as expected in sexually reproducing populations. We show that large populations attain highest fitness at a recombination rate just below critical. Clustering of interacting sets of genes on a chromosome leads to the emergence of an intermediate regime, where blocks of cooperating alleles lock into genetic modules. These haplotype blocks disappear in a second transition to pure allele selection. Our results demonstrate that the collective effect of many weak epistatic interactions can have dramatic effects on the population structure.
  • The evolutionary dynamics of HIV during the chronic phase of infection is driven by the host immune response and by selective pressures exerted through drug treatment. To understand and model the evolution of HIV quantitatively, the parameters governing genetic diversification and the strength of selection need to be known. While mutation rates can be measured in single replication cycles, the relevant effective recombination rate depends on the probability of coinfection of a cell with more than one virus and can only be inferred from population data. However, most population genetic estimators for recombination rates assume absence of selection and are hence of limited applicability to HIV, since positive and purifying selection are important in HIV evolution. Here, we estimate the rate of recombination and the distribution of selection coefficients from time-resolved sequence data tracking the evolution of HIV within single patients. By examining temporal changes in the genetic composition of the population, we estimate the effective recombination to be r=1.4e-5 recombinations per site and generation. Furthermore, we provide evidence that selection coefficients of at least 15% of the observed non-synonymous polymorphisms exceed 0.8% per generation. These results provide a basis for a more detailed understanding of the evolution of HIV. A particularly interesting case is evolution in response to drug treatment, where recombination can facilitate the rapid acquisition of multiple resistance mutations. With the methods developed here, more precise and more detailed studies will be possible, as soon as data with higher time resolution and greater sample sizes is available.
  • Global physical properties of random media change qualitatively at a percolation threshold, where isolated clusters merge to form one infinite connected component. The precise knowledge of percolation thresholds is thus of paramount importance. For two dimensional lattice graphs, we use the universal scaling form of the cluster size distributions to derive a relation between the mean Euler characteristic of the critical percolation patterns and the threshold density $p_c$. From this relation, we deduce a simple rule to estimate $p_c$, which is remarkably accurate. We present some evidence that similar relations might hold for continuum percolation and percolation in higher dimensions.
  • This is supplementary material for the article arxiv:0708.3250. We provide an alternative introduction of the mean Euler Characteristic, additional examples and the percolation thresholds for 2-uniform lattices.