• Biologists have long sought a way to explain how statistical properties of genetic sequences emerged and are maintained through evolution. On the one hand, non-random structures at different scales indicate a complex genome organisation. On the other hand, single-strand symmetry has been scrutinised using neutral models in which correlations are not considered or irrelevant, contrary to empirical evidence. Different studies investigated these two statistical features separately, reaching minimal consensus despite sustained efforts. Here we unravel previously unknown symmetries in genetic sequences, which are organized hierarchically through scales in which non-random structures are known to be present. These observations are confirmed through the statistical analysis of the human genome and explained through a simple domain model. These results suggest that domain models which account for the cumulative action of mobile elements can explain simultaneously non-random structures and symmetries in genetic sequences.
  • One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.
  • We introduce a Monte Carlo algorithm to efficiently compute transport properties of chaotic dynamical systems. Our method exploits the importance sampling technique that favors trajectories in the tail of the distribution of displacements, where deviations from a diffusive process are most prominent. We search for initial conditions using a proposal that correlates states in the Markov chain constructed via a Metropolis-Hastings algorithm. We show that our method outperforms the direct sampling method and also Metropolis-Hastings methods with alternative proposals. We test our general method through numerical simulations in 1D (box-map) and 2D (Lorentz gas) systems.
  • We introduce and implement an importance-sampling Monte Carlo algorithm to study systems of globally-coupled oscillators. Our computational method efficiently obtains estimates of the tails of the distribution of various measures of dynamical trajectories corresponding to states occurring with (exponentially) small probabilities. We demonstrate the general validity of our results by applying the method to two contrasting cases: the driven-dissipative Kuramoto model, a paradigm in the study of spontaneous synchronization; and the conservative Hamiltonian mean-field model, a prototypical system of long-range interactions. We present results for the distribution of the finite-time Lyapunov exponent and a time-averaged order parameter. Among other features, our results show most notably that the distributions exhibit a vanishing standard deviation but a skewness that is increasing in magnitude with the number of oscillators, implying that non-trivial asymmetries and states yielding rare/atypical values of the observables persist even for a large number of oscillators.
  • We use an information-theoretic measure of linguistic similarity to investigate the organization and evolution of scientific fields. An analysis of almost 20M papers from the past three decades reveals that the linguistic similarity is related but different from experts and citation-based classifications, leading to an improved view on the organization of science. A temporal analysis of the similarity of fields shows that some fields (e.g., computer science) are becoming increasingly central, but that on average the similarity between pairs has not changed in the last decades. This suggests that tendencies of convergence (e.g., multi-disciplinarity) and divergence (e.g., specialization) of disciplines are in balance.
  • The competition for the attention of users is a central element of the Internet. Crucial issues are the origin and predictability of big hits, the few items that capture a big portion of the total attention. We address these issues analyzing 10 million time series of videos' views from YouTube. We find that the average gain of views is linearly proportional to the number of views a video already has, in agreement with usual rich-get-richer mechanisms and Gibrat's law, but this fails to explain the prevalence of big hits. The reason is that the fluctuations around the average views are themselves heavy tailed. Based on these empirical observations, we propose a stochastic differential equation with L\'evy noise as a model of the dynamics of videos. We show how this model is substantially better in estimating the probability of an ordinary item becoming a big hit, which is considerably underestimated in the traditional proportional-growth models.
  • Finding and sampling rare trajectories in dynamical systems is a difficult computational task underlying numerous problems and applications. In this paper we show how to construct Metropolis- Hastings Monte Carlo methods that can efficiently sample rare trajectories in the (extremely rough) phase space of chaotic systems. As examples of our general framework we compute the distribution of finite-time Lyapunov exponents (in different chaotic maps) and the distribution of escape times (in transient-chaos problems). Our methods sample exponentially rare states in polynomial number of samples (in both low- and high-dimensional systems). An open-source software that implements our algorithms and reproduces our results can be found in https://github.com/jorgecarleitao/chaospp
  • We show how generalized Gibbs-Shannon entropies can provide new insights on the statistical properties of texts. The universal distribution of word frequencies (Zipf's law) implies that the generalized entropies, computed at the word level, are dominated by words in a specific range of frequencies. Here we show that this is the case not only for the generalized entropies but also for the generalized (Jensen-Shannon) divergences, used to compute the similarity between different texts. This finding allows us to identify the contribution of specific words (and word frequencies) for the different generalized entropies and also to estimate the size of the databases needed to obtain a reliable estimation of the divergences. We test our results in large databases of books (from the Google n-gram database) and scientific papers (indexed by Web of Science).
  • We investigate how textual properties of scientific papers relate to the number of citations they receive. Our main finding is that correlations are non-linear and affect differently most-cited and typical papers. For instance, we find that in most journals short titles correlate positively with citations only for the most cited papers, for typical papers the correlation is in most cases negative. Our analysis of 6 different factors, calculated both at the title and abstract level of 4.3 million papers in over 1500 journals, reveals the number of authors, and the length and complexity of the abstract, as having the strongest (positive) influence on the number of citations.
  • Quantifying the similarity between symbolic sequences is a traditional problem in Information Theory which requires comparing the frequencies of symbols in different sequences. In numerous modern applications, ranging from DNA over music to texts, the distribution of symbol frequencies is characterized by heavy-tailed distributions (e.g., Zipf's law). The large number of low-frequency symbols in these distributions poses major difficulties to the estimation of the similarity between sequences, e.g., they hinder an accurate finite-size estimation of entropies. Here we show analytically how the systematic (bias) and statistical (fluctuations) errors in these estimations depend on the sample size~$N$ and on the exponent~$\gamma$ of the heavy-tailed distribution. Our results are valid for the Shannon entropy $(\alpha=1)$, its corresponding similarity measures (e.g., the Jensen-Shanon divergence), and also for measures based on the generalized entropy of order $\alpha$. For small $\alpha$'s, including $\alpha=1$, the errors decay slower than the $1/N$-decay observed in short-tailed distributions. For $\alpha$ larger than a critical value $\alpha^* = 1+1/\gamma \leq 2$, the $1/N$-decay is recovered. We show the practical significance of our results by quantifying the evolution of the English language over the last two centuries using a complete $\alpha$-spectrum of measures. We find that frequent words change more slowly than less frequent words and that $\alpha=2$ provides the most robust measure to quantify language change.
  • The statistical significance of network properties is conditioned on null models which satisfy spec- ified properties but that are otherwise random. Exponential random graph models are a principled theoretical framework to generate such constrained ensembles, but which often fail in practice, either due to model inconsistency, or due to the impossibility to sample networks from them. These problems affect the important case of networks with prescribed clustering coefficient or number of small connected subgraphs (motifs). In this paper we use the Wang-Landau method to obtain a multicanonical sampling that overcomes both these problems. We sample, in polynomial time, net- works with arbitrary degree sequences from ensembles with imposed motifs counts. Applying this method to social networks, we investigate the relation between transitivity and homophily, and we quantify the correlation between different types of motifs, finding that single motifs can explain up to 60% of the variation of motif profiles.
  • We consider networks in which random walkers are removed because of the failure of specific nodes. We interpret the rate of loss as a measure of the importance of nodes, a notion we denote as failure-centrality. We show that the degree of the node is not sufficient to determine this measure and that, in a first approximation, the shortest loops through the node have to be taken into account. We propose approximations of the failure-centrality which are valid for temporal-varying failures and we dwell on the possibility of externally changing the relative importance of nodes in a given network, by exploiting the interference between the loops of a node and the cycles of the temporal pattern of failures. In the limit of long failure cycles we show analytically that the escape in a node is larger than the one estimated from a stochastic failure with the same failure probability. We test our general formalism in two real-world networks (air-transportation and e-mail users) and show how communities lead to deviations from predictions for failures in hubs.
  • We investigate the effects of random perturbations on fully chaotic open systems. Perturbations can be applied to each trajectory independently (white noise) or simultaneously to all trajectories (random map). We compare these two scenarios by generalizing the theory of open chaotic systems and introducing a time-dependent conditionally-map-invariant measure. For the same perturbation strength we show that the escape rate of the random map is always larger than that of the noisy map. In random maps we show that the escape rate $\kappa$ and dimensions $D$ of the relevant fractal sets often depend nonmonotonically on the intensity of the random perturbation. We discuss the accuracy (bias) and precision (variance) of finite-size estimators of $\kappa$ and $D$, and show that the improvement of the precision of the estimations with the number of trajectories $N$ is extremely slow ($\propto 1/\ln N$). We also argue that the finite-size $D$ estimators are typically biased. General theoretical results are combined with analytical calculations and numerical simulations in area-preserving baker maps.
  • Zipf's law is just one out of many universal laws proposed to describe statistical regularities in language. Here we review and critically discuss how these laws can be statistically interpreted, fitted, and tested (falsified). The modern availability of large databases of written text allows for tests with an unprecedent statistical accuracy and also a characterization of the fluctuations around the typical behavior. We find that fluctuations are usually much larger than expected based on simplifying statistical assumptions (e.g., independence and lack of correlations between observations).These simplifications appear also in usual statistical tests so that the large fluctuations can be erroneously interpreted as a falsification of the law. Instead, here we argue that linguistic laws are only meaningful (falsifiable) if accompanied by a model for which the fluctuations can be computed (e.g., a generative model of the text). The large fluctuations we report show that the constraints imposed by linguistic laws on the creativity process of text generation are not as tight as one could expect.
  • We investigate chaotic dynamical systems for which the intensity of trajectories might grow unlimited in time. We show that (i) the intensity grows exponentially in time and is distributed spatially according to a fractal measure with an information dimension smaller than that of the phase space,(ii) such exploding cases can be described by an operator formalism similar to the one applied to chaotic systems with absorption (decaying intensities), but (iii) the invariant quantities characterizing explosion and absorption are typically not directly related to each other, e.g., the decay rate and fractal dimensions of absorbing maps typically differ from the ones computed in the corresponding inverse (exploding) maps. We illustrate our general results through numerical simulation in the cardioid billiard mimicking a lasing optical cavity, and through analytical calculations in the baker map.
  • A clear signature of classical chaoticity in the quantum regime is the fractal Weyl law, which connects the density of eigenstates to the dimension $D_0$ of the classical invariant set of open systems. Quantum systems of interest are often {\it partially} open (e.g., cavities in which trajectories are partially reflected/absorbed). In the corresponding classical systems $D_0$ is trivial (equal to the phase-space dimension), and the fractality is manifested in the (multifractal) spectrum of R\'enyi dimensions $D_q$. In this paper we investigate the effect of such multifractality on the Weyl law. Our numerical simulations in area-preserving maps show for a wide range of configurations and system sizes $M$ that (i) the Weyl law is governed by a dimension different from $D_0=2$ and (ii) the observed dimension oscillates as a function of $M$ and other relevant parameters. We propose a classical model which considers an undersampled measure of the chaotic invariant set, explains our two observations, and predicts that the Weyl law is governed by a non-trivial dimension $D_\mathrm{asymptotic} < D_0$ in the semi-classical limit $M\rightarrow\infty$.
  • It is part of our daily social-media experience that seemingly ordinary items (videos, news, publications, etc.) unexpectedly gain an enormous amount of attention. Here we investigate how unexpected these events are. We propose a method that, given some information on the items, quantifies the predictability of events, i.e., the potential of identifying in advance the most successful items defined as the upper bound for the quality of any prediction based on the same information. Applying this method to different data, ranging from views in YouTube videos to posts in Usenet discussion groups, we invariantly find that the predictability increases for the most extreme events. This indicates that, despite the inherently stochastic collective dynamics of users, efficient prediction is possible for the most extreme events.
  • In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps' law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of word-frequencies is fat tailed (Zipf's law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and ii) correlations in the co-occurrence of words lead to an increase in the variance and the vocabulary size becomes a non-self-averaging quantity. We address the implications of these observations to the measurement of lexical richness. We test our results in three large text databases (Google-ngram, Enlgish Wikipedia, and a collection of scientific articles).
  • It is well accepted that adoption of innovations are described by S-curves (slow start, accelerating period, and slow end). In this paper, we analyze how much information on the dynamics of innovation spreading can be obtained from a quantitative description of S-curves. We focus on the adoption of linguistic innovations for which detailed databases of written texts from the last 200 years allow for an unprecedented statistical precision. Combining data analysis with simulations of simple models (e.g., the Bass dynamics on complex networks) we identify signatures of endogenous and exogenous factors in the S-curves of adoption. We propose a measure to quantify the strength of these factors and three different methods to estimate it from S-curves. We obtain cases in which the exogenous factors are dominant (in the adoption of German orthographic reforms and of one irregular verb) and cases in which endogenous factors are dominant (in the adoption of conventions for romanization of Russian names and in the regularization of most studied verbs). These results show that the shape of S-curve is not universal and contains information on the adoption mechanism. (published at "J. R. Soc. Interface, vol. 11, no. 101, (2014) 1044"; DOI: http://dx.doi.org/10.1098/rsif.2014.1044)
  • In this paper we investigate how the complexity of chaotic phase spaces affect the efficiency of importance sampling Monte Carlo simulations. We focus on a flat-histogram simulation of the distribution of finite-time Lyapunov exponent in a simple chaotic system and obtain analytically that the computational effort of the simulation: (i) scales polynomially with the finite-time, a tremendous improvement over the exponential scaling obtained in usual uniform sampling simulations; and (ii) the polynomial scaling is sub-optimal, a phenomenon known as critical slowing down. We show that critical slowing down appears because of the limited possibilities to issue a local proposal on the Monte Carlo procedure in chaotic systems. These results remain valid in other methods and show how generic properties of chaotic systems limit the efficiency of Monte Carlo simulations.
  • Motivated by applications in optics and acoustics we develop a dynamical-system approach to describe absorption in chaotic systems. We introduce an operator formalism from which we obtain (i) a general formula for the escape rate $\kappa$ in terms of the natural conditionally-invariant measure of the system; (ii) an increased multifractality when compared to the spectrum of dimensions $D_q$ obtained without taking absorption and return times into account; and (iii) a generalization of the Kantz-Grassberger formula that expresses $D_1$ in terms of $\kappa$, the positive Lyapunov exponent, the average return time, and a new quantity, the reflection rate. Simulations in the cardioid billiard confirm these results.
  • There are numerous physical situations in which a hole or leak is introduced in an otherwise closed chaotic system. The leak can have a natural origin, it can mimic measurement devices, and it can also be used to reveal dynamical properties of the closed system. In this paper we provide an unified treatment of leaking systems and we review applications to different physical problems, both in the classical and quantum pictures. Our treatment is based on the transient chaos theory of open systems, which is essential because real leaks have finite size and therefore estimations based on the closed system differ essentially from observations. The field of applications reviewed is very broad, ranging from planetary astronomy and hydrodynamical flows, to plasma physics and quantum fidelity. The theory is expanded and adapted to the case of partial leaks (partial absorption/transmission) with applications to room acoustics and optical microcavities in mind. Simulations in the lima .con family of billiards illustrate the main text. Regarding billiard dynamics, we emphasize that a correct discrete time representation can only be given in terms of the so- called true-time maps, while traditional Poincar \'e maps lead to erroneous results. We generalize Perron-Frobenius-type operators so that they describe true-time maps with partial leaks.
  • We propose a flat-histogram Monte Carlo method to efficiently sample fractal landscapes such as escape time functions of open chaotic systems. This is achieved by using a random-walk step which depends on the height of the landscape via the largest Lyapunov exponent of the associated chaotic system. By generalizing the Wang-Landau algorithm, we obtain a method which simultaneously constructs the density of states (escape time distribution) and the correct step-length distribution. As a result, averages are obtained in polynomial computational time, a dramatic improvement over the exponential scaling of traditional uniform sampling. Our results are not limited by the dimensionality of the phase space and are confirmed numerically for dimensions as large as 30.
  • We study the effect of spatial heterogeneity on the collective motion of self-propelled particles (SPPs). The heterogeneity is modeled as a random distribution of either static or diffusive obstacles, which the SPPs avoid while trying to align their movements. We find that such obstacles have a dramatic effect on the collective dynamics of usual SPP models. In particular, we report about the existence of an optimal (angular) noise amplitude that maximizes collective motion. We also show that while at low obstacle densities the system exhibits long-range order, in strongly heterogeneous media collective motion is quasi-long-range and exists only for noise values in between two critical noise values, with the system being disordered at both, large and low noise amplitudes. Since most real system have spatial heterogeneities, the finding of an optimal noise intensity has immediate practical and fundamental implications for the design and evolution of collective motion strategies.
  • We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core-words which have higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining virtually infinite number of noncore-words which have lower frequency and once used reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the google-ngram database of books published in the last centuries and its main consequence is the generalization of Zipf's and Heaps' law to two scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model the main change on historical time scales is the composition of the specific words included in the finite list of core-words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.