
Biologists have long sought a way to explain how statistical properties of
genetic sequences emerged and are maintained through evolution. On the one
hand, nonrandom structures at different scales indicate a complex genome
organisation. On the other hand, singlestrand symmetry has been scrutinised
using neutral models in which correlations are not considered or irrelevant,
contrary to empirical evidence. Different studies investigated these two
statistical features separately, reaching minimal consensus despite sustained
efforts. Here we unravel previously unknown symmetries in genetic sequences,
which are organized hierarchically through scales in which nonrandom
structures are known to be present. These observations are confirmed through
the statistical analysis of the human genome and explained through a simple
domain model. These results suggest that domain models which account for the
cumulative action of mobile elements can explain simultaneously nonrandom
structures and symmetries in genetic sequences.

One of the main computational and scientific challenges in the modern age is
to extract useful information from unstructured texts. Topic models are one
popular machinelearning approach which infers the latent topical structure of
a collection of documents. Despite their success  in particular of its most
widely used variant called Latent Dirichlet Allocation (LDA)  and numerous
applications in sociology, history, and linguistics, topic models are known to
suffer from severe conceptual and practical problems, e.g. a lack of
justification for the Bayesian priors, discrepancies with statistical
properties of real texts, and the inability to properly choose the number of
topics. Here we obtain a fresh view on the problem of identifying topical
structures by relating it to the problem of finding communities in complex
networks. This is achieved by representing text corpora as bipartite networks
of documents and words. By adapting existing communitydetection methods 
using a stochastic block model (SBM) with nonparametric priors  we obtain a
more versatile and principled framework for topic modeling (e.g., it
automatically detects the number of topics and hierarchically clusters both the
words and documents). The analysis of artificial and real corpora demonstrates
that our SBM approach leads to better topic models than LDA in terms of
statistical model selection. More importantly, our work shows how to formally
relate methods from community detection and topic modeling, opening the
possibility of crossfertilization between these two fields.

We introduce a Monte Carlo algorithm to efficiently compute transport
properties of chaotic dynamical systems. Our method exploits the importance
sampling technique that favors trajectories in the tail of the distribution of
displacements, where deviations from a diffusive process are most prominent. We
search for initial conditions using a proposal that correlates states in the
Markov chain constructed via a MetropolisHastings algorithm. We show that our
method outperforms the direct sampling method and also MetropolisHastings
methods with alternative proposals. We test our general method through
numerical simulations in 1D (boxmap) and 2D (Lorentz gas) systems.

We introduce and implement an importancesampling Monte Carlo algorithm to
study systems of globallycoupled oscillators. Our computational method
efficiently obtains estimates of the tails of the distribution of various
measures of dynamical trajectories corresponding to states occurring with
(exponentially) small probabilities. We demonstrate the general validity of our
results by applying the method to two contrasting cases: the drivendissipative
Kuramoto model, a paradigm in the study of spontaneous synchronization; and the
conservative Hamiltonian meanfield model, a prototypical system of longrange
interactions. We present results for the distribution of the finitetime
Lyapunov exponent and a timeaveraged order parameter. Among other features,
our results show most notably that the distributions exhibit a vanishing
standard deviation but a skewness that is increasing in magnitude with the
number of oscillators, implying that nontrivial asymmetries and states
yielding rare/atypical values of the observables persist even for a large
number of oscillators.

We use an informationtheoretic measure of linguistic similarity to
investigate the organization and evolution of scientific fields. An analysis of
almost 20M papers from the past three decades reveals that the linguistic
similarity is related but different from experts and citationbased
classifications, leading to an improved view on the organization of science. A
temporal analysis of the similarity of fields shows that some fields (e.g.,
computer science) are becoming increasingly central, but that on average the
similarity between pairs has not changed in the last decades. This suggests
that tendencies of convergence (e.g., multidisciplinarity) and divergence
(e.g., specialization) of disciplines are in balance.

The competition for the attention of users is a central element of the
Internet. Crucial issues are the origin and predictability of big hits, the few
items that capture a big portion of the total attention. We address these
issues analyzing 10 million time series of videos' views from YouTube. We find
that the average gain of views is linearly proportional to the number of views
a video already has, in agreement with usual richgetricher mechanisms and
Gibrat's law, but this fails to explain the prevalence of big hits. The reason
is that the fluctuations around the average views are themselves heavy tailed.
Based on these empirical observations, we propose a stochastic differential
equation with L\'evy noise as a model of the dynamics of videos. We show how
this model is substantially better in estimating the probability of an ordinary
item becoming a big hit, which is considerably underestimated in the
traditional proportionalgrowth models.

Finding and sampling rare trajectories in dynamical systems is a difficult
computational task underlying numerous problems and applications. In this paper
we show how to construct Metropolis Hastings Monte Carlo methods that can
efficiently sample rare trajectories in the (extremely rough) phase space of
chaotic systems. As examples of our general framework we compute the
distribution of finitetime Lyapunov exponents (in different chaotic maps) and
the distribution of escape times (in transientchaos problems). Our methods
sample exponentially rare states in polynomial number of samples (in both low
and highdimensional systems). An opensource software that implements our
algorithms and reproduces our results can be found in
https://github.com/jorgecarleitao/chaospp

We show how generalized GibbsShannon entropies can provide new insights on
the statistical properties of texts. The universal distribution of word
frequencies (Zipf's law) implies that the generalized entropies, computed at
the word level, are dominated by words in a specific range of frequencies. Here
we show that this is the case not only for the generalized entropies but also
for the generalized (JensenShannon) divergences, used to compute the
similarity between different texts. This finding allows us to identify the
contribution of specific words (and word frequencies) for the different
generalized entropies and also to estimate the size of the databases needed to
obtain a reliable estimation of the divergences. We test our results in large
databases of books (from the Google ngram database) and scientific papers
(indexed by Web of Science).

We investigate how textual properties of scientific papers relate to the
number of citations they receive. Our main finding is that correlations are
nonlinear and affect differently mostcited and typical papers. For instance,
we find that in most journals short titles correlate positively with citations
only for the most cited papers, for typical papers the correlation is in most
cases negative. Our analysis of 6 different factors, calculated both at the
title and abstract level of 4.3 million papers in over 1500 journals, reveals
the number of authors, and the length and complexity of the abstract, as having
the strongest (positive) influence on the number of citations.

Quantifying the similarity between symbolic sequences is a traditional
problem in Information Theory which requires comparing the frequencies of
symbols in different sequences. In numerous modern applications, ranging from
DNA over music to texts, the distribution of symbol frequencies is
characterized by heavytailed distributions (e.g., Zipf's law). The large
number of lowfrequency symbols in these distributions poses major difficulties
to the estimation of the similarity between sequences, e.g., they hinder an
accurate finitesize estimation of entropies. Here we show analytically how the
systematic (bias) and statistical (fluctuations) errors in these estimations
depend on the sample size~$N$ and on the exponent~$\gamma$ of the heavytailed
distribution. Our results are valid for the Shannon entropy $(\alpha=1)$, its
corresponding similarity measures (e.g., the JensenShanon divergence), and
also for measures based on the generalized entropy of order $\alpha$. For small
$\alpha$'s, including $\alpha=1$, the errors decay slower than the $1/N$decay
observed in shorttailed distributions. For $\alpha$ larger than a critical
value $\alpha^* = 1+1/\gamma \leq 2$, the $1/N$decay is recovered. We show the
practical significance of our results by quantifying the evolution of the
English language over the last two centuries using a complete $\alpha$spectrum
of measures. We find that frequent words change more slowly than less frequent
words and that $\alpha=2$ provides the most robust measure to quantify language
change.

The statistical significance of network properties is conditioned on null
models which satisfy spec ified properties but that are otherwise random.
Exponential random graph models are a principled theoretical framework to
generate such constrained ensembles, but which often fail in practice, either
due to model inconsistency, or due to the impossibility to sample networks from
them. These problems affect the important case of networks with prescribed
clustering coefficient or number of small connected subgraphs (motifs). In this
paper we use the WangLandau method to obtain a multicanonical sampling that
overcomes both these problems. We sample, in polynomial time, net works with
arbitrary degree sequences from ensembles with imposed motifs counts. Applying
this method to social networks, we investigate the relation between
transitivity and homophily, and we quantify the correlation between different
types of motifs, finding that single motifs can explain up to 60% of the
variation of motif profiles.

We consider networks in which random walkers are removed because of the
failure of specific nodes. We interpret the rate of loss as a measure of the
importance of nodes, a notion we denote as failurecentrality. We show that the
degree of the node is not sufficient to determine this measure and that, in a
first approximation, the shortest loops through the node have to be taken into
account. We propose approximations of the failurecentrality which are valid
for temporalvarying failures and we dwell on the possibility of externally
changing the relative importance of nodes in a given network, by exploiting the
interference between the loops of a node and the cycles of the temporal pattern
of failures. In the limit of long failure cycles we show analytically that the
escape in a node is larger than the one estimated from a stochastic failure
with the same failure probability. We test our general formalism in two
realworld networks (airtransportation and email users) and show how
communities lead to deviations from predictions for failures in hubs.

We investigate the effects of random perturbations on fully chaotic open
systems. Perturbations can be applied to each trajectory independently (white
noise) or simultaneously to all trajectories (random map). We compare these two
scenarios by generalizing the theory of open chaotic systems and introducing a
timedependent conditionallymapinvariant measure. For the same perturbation
strength we show that the escape rate of the random map is always larger than
that of the noisy map. In random maps we show that the escape rate $\kappa$ and
dimensions $D$ of the relevant fractal sets often depend nonmonotonically on
the intensity of the random perturbation. We discuss the accuracy (bias) and
precision (variance) of finitesize estimators of $\kappa$ and $D$, and show
that the improvement of the precision of the estimations with the number of
trajectories $N$ is extremely slow ($\propto 1/\ln N$). We also argue that the
finitesize $D$ estimators are typically biased. General theoretical results
are combined with analytical calculations and numerical simulations in
areapreserving baker maps.

Zipf's law is just one out of many universal laws proposed to describe
statistical regularities in language. Here we review and critically discuss how
these laws can be statistically interpreted, fitted, and tested (falsified).
The modern availability of large databases of written text allows for tests
with an unprecedent statistical accuracy and also a characterization of the
fluctuations around the typical behavior. We find that fluctuations are usually
much larger than expected based on simplifying statistical assumptions (e.g.,
independence and lack of correlations between observations).These
simplifications appear also in usual statistical tests so that the large
fluctuations can be erroneously interpreted as a falsification of the law.
Instead, here we argue that linguistic laws are only meaningful (falsifiable)
if accompanied by a model for which the fluctuations can be computed (e.g., a
generative model of the text). The large fluctuations we report show that the
constraints imposed by linguistic laws on the creativity process of text
generation are not as tight as one could expect.

We investigate chaotic dynamical systems for which the intensity of
trajectories might grow unlimited in time. We show that (i) the intensity grows
exponentially in time and is distributed spatially according to a fractal
measure with an information dimension smaller than that of the phase space,(ii)
such exploding cases can be described by an operator formalism similar to the
one applied to chaotic systems with absorption (decaying intensities), but
(iii) the invariant quantities characterizing explosion and absorption are
typically not directly related to each other, e.g., the decay rate and fractal
dimensions of absorbing maps typically differ from the ones computed in the
corresponding inverse (exploding) maps. We illustrate our general results
through numerical simulation in the cardioid billiard mimicking a lasing
optical cavity, and through analytical calculations in the baker map.

A clear signature of classical chaoticity in the quantum regime is the
fractal Weyl law, which connects the density of eigenstates to the dimension
$D_0$ of the classical invariant set of open systems. Quantum systems of
interest are often {\it partially} open (e.g., cavities in which trajectories
are partially reflected/absorbed). In the corresponding classical systems $D_0$
is trivial (equal to the phasespace dimension), and the fractality is
manifested in the (multifractal) spectrum of R\'enyi dimensions $D_q$. In this
paper we investigate the effect of such multifractality on the Weyl law. Our
numerical simulations in areapreserving maps show for a wide range of
configurations and system sizes $M$ that (i) the Weyl law is governed by a
dimension different from $D_0=2$ and (ii) the observed dimension oscillates as
a function of $M$ and other relevant parameters. We propose a classical model
which considers an undersampled measure of the chaotic invariant set, explains
our two observations, and predicts that the Weyl law is governed by a
nontrivial dimension $D_\mathrm{asymptotic} < D_0$ in the semiclassical limit
$M\rightarrow\infty$.

It is part of our daily socialmedia experience that seemingly ordinary items
(videos, news, publications, etc.) unexpectedly gain an enormous amount of
attention. Here we investigate how unexpected these events are. We propose a
method that, given some information on the items, quantifies the predictability
of events, i.e., the potential of identifying in advance the most successful
items defined as the upper bound for the quality of any prediction based on the
same information. Applying this method to different data, ranging from views in
YouTube videos to posts in Usenet discussion groups, we invariantly find that
the predictability increases for the most extreme events. This indicates that,
despite the inherently stochastic collective dynamics of users, efficient
prediction is possible for the most extreme events.

In this paper we combine statistical analysis of large text databases and
simple stochastic models to explain the appearance of scaling laws in the
statistics of word frequencies. Besides the sublinear scaling of the vocabulary
size with database size (Heaps' law), here we report a new scaling of the
fluctuations around this average (fluctuation scaling analysis). We explain
both scaling laws by modeling the usage of words by simple stochastic processes
in which the overall distribution of wordfrequencies is fat tailed (Zipf's
law) and the frequency of a single word is subject to fluctuations across
documents (as in topic models). In this framework, the mean and the variance of
the vocabulary size can be expressed as quenched averages, implying that: i)
the inhomogeneous dissemination of words cause a reduction of the average
vocabulary size in comparison to the homogeneous case, and ii) correlations in
the cooccurrence of words lead to an increase in the variance and the
vocabulary size becomes a nonselfaveraging quantity. We address the
implications of these observations to the measurement of lexical richness. We
test our results in three large text databases (Googlengram, Enlgish
Wikipedia, and a collection of scientific articles).

It is well accepted that adoption of innovations are described by Scurves
(slow start, accelerating period, and slow end). In this paper, we analyze how
much information on the dynamics of innovation spreading can be obtained from a
quantitative description of Scurves. We focus on the adoption of linguistic
innovations for which detailed databases of written texts from the last 200
years allow for an unprecedented statistical precision. Combining data analysis
with simulations of simple models (e.g., the Bass dynamics on complex networks)
we identify signatures of endogenous and exogenous factors in the Scurves of
adoption. We propose a measure to quantify the strength of these factors and
three different methods to estimate it from Scurves. We obtain cases in which
the exogenous factors are dominant (in the adoption of German orthographic
reforms and of one irregular verb) and cases in which endogenous factors are
dominant (in the adoption of conventions for romanization of Russian names and
in the regularization of most studied verbs). These results show that the shape
of Scurve is not universal and contains information on the adoption mechanism.
(published at "J. R. Soc. Interface, vol. 11, no. 101, (2014) 1044"; DOI:
http://dx.doi.org/10.1098/rsif.2014.1044)

In this paper we investigate how the complexity of chaotic phase spaces
affect the efficiency of importance sampling Monte Carlo simulations. We focus
on a flathistogram simulation of the distribution of finitetime Lyapunov
exponent in a simple chaotic system and obtain analytically that the
computational effort of the simulation: (i) scales polynomially with the
finitetime, a tremendous improvement over the exponential scaling obtained in
usual uniform sampling simulations; and (ii) the polynomial scaling is
suboptimal, a phenomenon known as critical slowing down. We show that critical
slowing down appears because of the limited possibilities to issue a local
proposal on the Monte Carlo procedure in chaotic systems. These results remain
valid in other methods and show how generic properties of chaotic systems limit
the efficiency of Monte Carlo simulations.

Motivated by applications in optics and acoustics we develop a
dynamicalsystem approach to describe absorption in chaotic systems. We
introduce an operator formalism from which we obtain (i) a general formula for
the escape rate $\kappa$ in terms of the natural conditionallyinvariant
measure of the system; (ii) an increased multifractality when compared to the
spectrum of dimensions $D_q$ obtained without taking absorption and return
times into account; and (iii) a generalization of the KantzGrassberger formula
that expresses $D_1$ in terms of $\kappa$, the positive Lyapunov exponent, the
average return time, and a new quantity, the reflection rate. Simulations in
the cardioid billiard confirm these results.

There are numerous physical situations in which a hole or leak is introduced
in an otherwise closed chaotic system. The leak can have a natural origin, it
can mimic measurement devices, and it can also be used to reveal dynamical
properties of the closed system. In this paper we provide an unified treatment
of leaking systems and we review applications to different physical problems,
both in the classical and quantum pictures. Our treatment is based on the
transient chaos theory of open systems, which is essential because real leaks
have finite size and therefore estimations based on the closed system differ
essentially from observations. The field of applications reviewed is very
broad, ranging from planetary astronomy and hydrodynamical flows, to plasma
physics and quantum fidelity. The theory is expanded and adapted to the case of
partial leaks (partial absorption/transmission) with applications to room
acoustics and optical microcavities in mind. Simulations in the lima .con
family of billiards illustrate the main text. Regarding billiard dynamics, we
emphasize that a correct discrete time representation can only be given in
terms of the so called truetime maps, while traditional Poincar \'e maps lead
to erroneous results. We generalize PerronFrobeniustype operators so that
they describe truetime maps with partial leaks.

We propose a flathistogram Monte Carlo method to efficiently sample fractal
landscapes such as escape time functions of open chaotic systems. This is
achieved by using a randomwalk step which depends on the height of the
landscape via the largest Lyapunov exponent of the associated chaotic system.
By generalizing the WangLandau algorithm, we obtain a method which
simultaneously constructs the density of states (escape time distribution) and
the correct steplength distribution. As a result, averages are obtained in
polynomial computational time, a dramatic improvement over the exponential
scaling of traditional uniform sampling. Our results are not limited by the
dimensionality of the phase space and are confirmed numerically for dimensions
as large as 30.

We study the effect of spatial heterogeneity on the collective motion of
selfpropelled particles (SPPs). The heterogeneity is modeled as a random
distribution of either static or diffusive obstacles, which the SPPs avoid
while trying to align their movements. We find that such obstacles have a
dramatic effect on the collective dynamics of usual SPP models. In particular,
we report about the existence of an optimal (angular) noise amplitude that
maximizes collective motion. We also show that while at low obstacle densities
the system exhibits longrange order, in strongly heterogeneous media
collective motion is quasilongrange and exists only for noise values in
between two critical noise values, with the system being disordered at both,
large and low noise amplitudes. Since most real system have spatial
heterogeneities, the finding of an optimal noise intensity has immediate
practical and fundamental implications for the design and evolution of
collective motion strategies.

We propose a stochastic model for the number of different words in a given
database which incorporates the dependence on the database size and historical
changes. The main feature of our model is the existence of two different
classes of words: (i) a finite number of corewords which have higher frequency
and do not affect the probability of a new word to be used; and (ii) the
remaining virtually infinite number of noncorewords which have lower frequency
and once used reduce the probability of a new word to be used in the future.
Our model relies on a careful analysis of the googlengram database of books
published in the last centuries and its main consequence is the generalization
of Zipf's and Heaps' law to two scaling regimes. We confirm that these
generalizations yield the best simple description of the data among generic
descriptive models and that the two free parameters depend only on the language
but not on the database. From the point of view of our model the main change on
historical time scales is the composition of the specific words included in the
finite list of corewords, which we observe to decay exponentially in time with
a rate of approximately 30 words per year for English.