
Information theory is a mathematical theory of learning with deep connections
with topics as diverse as artificial intelligence, statistical physics, and
biological evolution. Many primers on information theory paint a broad picture
with relatively little mathematical sophistication, while many others develop
specific application areas in detail. In contrast, these informal notes aim to
outline some elements of the informationtheoretic "way of thinking," by
cutting a rapid and interesting path through some of the theory's foundational
concepts and results. They are aimed at practicing systems scientists who are
interested in exploring potential connections between information theory and
their own fields. The main mathematical prerequisite for the notes is comfort
with elementary probability, including sample spaces, conditioning, and
expectations. We take the KullbackLeibler divergence as our most basic
concept, and then proceed to develop the entropy and mutual information. We
discuss some of the main results, including the Chernoff bounds as a
characterization of the divergence; Gibbs' Theorem; and the Data Processing
Inequality. A recurring theme is that the definitions of information theory
support natural theorems that sound ``obvious'' when translated into English.
More pithily, ``information theory makes common sense precise.'' Since the
focus of the notes is not primarily on technical details, proofs are provided
only where the relevant techniques are illustrative of broader themes.
Otherwise, proofs and intriguing tangents are referenced in liberallysprinkled
footnotes. The notes close with a highly nonexhaustive list of references to
resources and other perspectives on the field.

Silicon drift detectors (SDDs) revolutionized spectroscopy in fields as
diverse as geology and dentistry. For a subset of experiments at ultrafast,
xray freeelectron lasers (FELs), SDDs can make substantial contributions.
Often the unknown spectrum is interesting, carrying science data, or the
background measurement is useful to identify unexpected signals. Many
measurements involve only several discrete photon energies known a priori,
allowing single event decomposition of pileup and spectroscopic photon
counting. We designed a pulse function and demonstrated that the signal
amplitude and rise time are obtained for each pulse by fitting, thus removing
the need for pulse shaping. By avoiding pulse shaping, rise times of tens of
nanoseconds resulted in reduced pulse pileup and allowed decomposition of
remaining pulse pileup at photon separation times down to hundreds of
nanoseconds while yielding timeofarrival information with precision of 10
nanoseconds. Waveform fitting yields simultaneously high energy resolution and
high counting rates (2 orders of magnitude higher than current digital pulse
processors). We showed that pileup spectrum fitting is relatively simple and
preferable to pileup spectrum deconvolution. We developed a photon pileup
statistical model for constant intensity sources, extended it to variable
intensity sources (typical for FELs) and used it to fit a complex pileup
spectrum. We subsequently developed a Bayesian pileup decomposition method
that allows decomposing pileup of single events with up to 6 photons from 6
monochromatic lines with 99% accuracy. The usefulness of SDDs will continue
into the xray FEL era of science. Their successors, the ePixS hybrid pixel
detectors, already offer hundreds of pixels, each with similar performance to
an SDD, in a compact, robust and affordable package

Cryoelectron microscopy provides 2D projection images of the 3D electron
scattering intensity of many instances of the particle under study (e.g., a
virus). Both symmetry (rotational point groups) and heterogeneity are important
aspects of biological particles and both aspects can be combined by describing
the electron scattering intensity of the particle as a stochastic process with
a symmetric probability law and therefore symmetric moments. A maximum
likelihood estimator implemented by an expectationmaximization algorithm is
described which estimates the unknown statistics of the electron scattering
intensity stochastic process from images of instances of the particle. The
algorithm is demonstrated on the bacteriophage HK97 and the virus N$\omega$V.
The results are contrasted with existing algorithms which assume that each
instance of the particle has the symmetry rather than the less restrictive
assumption that the probability law has the symmetry.

The "backward simulation" of a stochastic process is defined as the
stochastic dynamics that trace a timereversed path from the target region to
the initial configuration. If the probabilities calculated by the original
simulation are easily restored from those obtained by backward dynamics, we can
use it as a computational tool. It is shown that the naive approach to backward
simulation does not work as expected. As a remedy, the Time Reverse Monte Carlo
method (TRMC) based on the ideas of Sequential Importance Sampling (SIS) and
Sequential Monte Carlo (SMC) is proposed and successfully tested with a
stochastic typhoon model and the Lorenz 96 model. TRMC with SMC, which contains
resampling steps, is shown to be more efficient for simulations with a larger
number of time steps. A limitation of TRMC and its relation to the Bayes
formula are also discussed.

We introduce an infinite family of quantifiers of quantum correlations beyond
entanglement which vanish on both classicalquantum and quantumclassical
states and are in onetoone correspondence with the metricadjusted skew
informations. The `quantum $f$correlations' are defined as the maximum
metricadjusted $f$correlations between pairs of local observables with the
same fixed equispaced spectrum. We show that these quantifiers are entanglement
monotones when restricted to pure states of qubitqudit systems. We also
evaluate the quantum $f$correlations in closed form for twoqubit systems and
discuss their behaviour under local commutativity preserving channels. We
finally provide a physical interpretation for the quantifier corresponding to
the average of the WignerYanaseDyson skew informations.

We present a Bayesian analysis of the epistemology of analogue experiments
with particular reference to Hawking radiation. First, we prove that such
experiments can be confirmatory in Bayesian terms based upon appeal to
'universality arguments'. Second, we provide a formal model for the scaling
behaviour of the confirmation measure for multiple distinct realisations of the
analogue system and isolate a generic saturation feature. Finally, we
demonstrate that different potential analogue realisations could provide
different levels of confirmation. Our results provide a basis both to formalise
the epistemic value of analogue experiments that have been conducted and to
advise scientists as to the respective epistemic value of future analogue
experiments.

Manifoldlearning techniques are routinely used in mining complex
spatiotemporal data to extract useful, parsimonious data
representations/parametrizations; these are, in turn, useful in nonlinear model
identification tasks. We focus here on the case of time series data that can
ultimately be modelled as a spatially distributed system (e.g. a partial
differential equation, PDE), but where we do not know the space in which this
PDE should be formulated. Hence, even the spatial coordinates for the
distributed system themselves need to be identified  to emerge from  the data
mining process. We will first validate this emergent space reconstruction for
time series sampled without space labels in known PDEs; this brings up the
issue of observability of physical space from temporal observation data, and
the transition from spatially resolved to lumped (orderparameterbased)
representations by tuning the scale of the data mining kernels. We will then
present actual emergent space discovery illustrations. Our illustrative
examples include chimera states (states of coexisting coherent and incoherent
dynamics), and chaotic as well as quasiperiodic spatiotemporal dynamics,
arising in partial differential equations and/or in heterogeneous networks. We
also discuss how datadriven spatial coordinates can be extracted in ways
invariant to the nature of the measuring instrument. Such gaugeinvariant data
mining can go beyond the fusion of heterogeneous observations of the same
system, to the possible matching of apparently different systems.

We study whether a depth two neural network can learn another depth two
network using gradient descent. Assuming a linear output node, we show that the
question of whether gradient descent converges to the target function is
equivalent to the following question in electrodynamics: Given $k$ fixed
protons in $\mathbb{R}^d,$ and $k$ electrons, each moving due to the attractive
force from the protons and repulsive force from the remaining electrons,
whether at equilibrium all the electrons will be matched up with the protons,
up to a permutation. Under the standard electrical force, this follows from the
classic Earnshaw's theorem. In our setting, the force is determined by the
activation function and the input distribution. Building on this equivalence,
we prove the existence of an activation function such that gradient descent
learns at least one of the hidden nodes in the target network. Iterating, we
show that gradient descent can be used to learn the entire network one node at
a time.

This chapter provides a selfcontained introduction to the use of Bayesian
inference to extract largescale modular structures from network data, based on
the stochastic blockmodel (SBM), as well as its degreecorrected and
overlapping generalizations. We focus on nonparametric formulations that allow
their inference in a manner that prevents overfitting, and enables model
selection. We discuss aspects of the choice of priors, in particular how to
avoid underfitting via increased Bayesian hierarchies, and we contrast the task
of sampling network partitions from the posterior distribution with finding the
single point estimate that maximizes it, while describing efficient algorithms
to perform either one. We also show how inferring the SBM can be used to
predict missing and spurious links, and shed light on the fundamental
limitations of the detectability of modular structures in networks.

The problem of information fusion from multiple datasets acquired by
multimodal sensors has drawn significant research attention over the years. In
this paper, we focus on a particular problem setting consisting of a physical
phenomenon or a system of interest observed by multiple sensors. We assume that
all sensors measure some aspects of the system of interest with additional
sensorspecific and irrelevant components. Our goal is to recover the variables
relevant to the observed system and to filter out the nuisance effects of the
sensorspecific variables. We propose an approach based on manifold learning,
which is particularly suitable for problems with multiple modalities, since it
aims to capture the intrinsic structure of the data and relies on minimal prior
model knowledge. Specifically, we propose a nonlinear filtering scheme, which
extracts the hidden sources of variability captured by two or more sensors,
that are independent of the sensorspecific components. In addition to
presenting a theoretical analysis, we demonstrate our technique on real
measured data for the purpose of sleep stage assessment based on multiple,
multimodal sensor measurements. We show that without prior knowledge on the
different modalities and on the measured system, our method gives rise to a
datadriven representation that is well correlated with the underlying sleep
process and is robust to noise and sensorspecific effects.

We introduce a statistical physics inspired supervised machine learning
algorithm for classification and regression problems. The method is based on
the invariances or stability of predicted results when known data is
represented as expansions in terms of various stochastic functions. The
algorithm predicts the classification/regression values of new data by
combining (via voting) the outputs of these numerous linear expansions in
randomly chosen functions. The few parameters (typically only one parameter is
used in all studied examples) that this model has may be automatically
optimized. The algorithm has been tested on 10 diverse training data sets of
various types and feature space dimensions. It has been shown to consistently
exhibit high accuracy and readily allow for optimization of parameters, while
simultaneously avoiding pitfalls of existing algorithms such as those
associated with class imbalance. We very briefly speculate on whether spatial
coordinates in physical theories may be viewed as emergent "features" that
enable a robust machine learning type description of data with generic low
order smooth functions.

Considerable efforts were made in recent years in devising optimization
algorithms for influence maximization in networks. Here we ask: "When do we
need optimization?" We use results from statistical mechanics and direct
simulations on ER networks, smallworld networks, powerlaw networks and a
dataset of realworld networks to characterize the parameterspace region where
optimization is required. We show that in both synthetic and realworld
networks this optimization region is due to a well known physical phase
transition of the network, and that it vanishes as a powerlaw with the network
size. We then show that also from a utilitymaximization perspective (when
considering the costs of the optimization process), for large networks standard
optimization is profitable only in a vanishing parameter region near the phase
transition. Finally, we introduce a novel constanttime optimization approach,
and demonstrate it through a simple algorithm that manages to give similar
results to standard optimization methods in terms of the influencedset size,
while improving the results in terms of the net utility.

This paper has incorporated the stochasticity into the Newell car following
model. Three stochastic driving factors have been considered: (i) Driver's
acceleration is bounded. (ii) Driver's deceleration includes stochastic
component, which is depicted by a deceleration with the randomization
probability that is assumed to increase with the speed. (iii) Vehicles in the
jam state have a larger randomization probability. Two simulation scenarios are
conducted to test the model. In the first scenario, traffic flow on a circular
road is investigated. In the second scenario, empirical traffic flow patterns
in the NGSIM data induced by a rubberneck bottleneck is studied, and the
simulated traffic oscillations and synchronized traffic flow are consistent
with the empirical patterns. Moreover, two experiments of model calibration and
validation are conducted. The first is to calibrate and validate using
experimental data, which illustrates that the concave growth pattern has been
quantitatively simulated. The second is to calibrate and cross validate
vehicles' trajectories using NGSIM data, which exhibits that the car following
behaviors of single vehicles can be well described. Therefore, our study
highlights the importance of speed dependent stochasticity in traffic flow
modeling, which cannot be ignored as in most carfollowing studies.

Biologists have long sought a way to explain how statistical properties of
genetic sequences emerged and are maintained through evolution. On the one
hand, nonrandom structures at different scales indicate a complex genome
organisation. On the other hand, singlestrand symmetry has been scrutinised
using neutral models in which correlations are not considered or irrelevant,
contrary to empirical evidence. Different studies investigated these two
statistical features separately, reaching minimal consensus despite sustained
efforts. Here we unravel previously unknown symmetries in genetic sequences,
which are organized hierarchically through scales in which nonrandom
structures are known to be present. These observations are confirmed through
the statistical analysis of the human genome and explained through a simple
domain model. These results suggest that domain models which account for the
cumulative action of mobile elements can explain simultaneously nonrandom
structures and symmetries in genetic sequences.

Many realworld systems are profitably described as complex networks that
grow over time. Preferential attachment and node fitness are two simple growth
mechanisms that not only explain certain structural properties commonly
observed in realworld systems, but are also tied to a number of applications
in modeling and inference. While there are statistical packages for estimating
various parametric forms of the preferential attachment function, there is no
such package implementing nonparametric estimation procedures. The
nonparametric approach to the estimation of the preferential attachment
function allows for comparatively finergrained investigations of the
`richgetricher' phenomenon that could lead to novel insights in the search to
explain certain nonstandard structural properties observed in realworld
networks. This paper introduces the R package PAFit, which implements
nonparametric procedures for estimating the preferential attachment function
and node fitnesses in a growing network, as well as a number of functions for
generating complex networks from these two mechanisms. The main computational
part of the package is implemented in C++ with OpenMP to ensure scalability to
largescale networks. We first introduce the main functionalities of PAFit
through simulated examples, and then use the package to analyze a collaboration
network between scientists in the field of complex networks. The results
indicate the joint presence of `richgetricher' and `fitgetricher' phenomena
in the collaboration network. The estimated attachment function is observed to
be nearlinear, which we interpret as meaning that the chance an author gets a
new collaborator is proportional to their current number of collaborators.
Furthermore, the estimated author fitnesses reveal a host of familiar faces
from the complex networks community among the field's topmost fittest network
scientists.

So far, the problem of unmixing large or multitemporal hyperspectral datasets
has been specifically addressed in the remote sensing literature only by a few
dedicated strategies. Among them, some attempts have been made within a
distributed estimation framework, in particular relying on the alternating
direction method of multipliers (ADMM). In this paper, we propose to study the
interest of a partially asynchronous distributed unmixing procedure based on a
recently proposed asynchronous algorithm. Under standard assumptions, the
proposed algorithm inherits its convergence properties from recent
contributions in nonconvex optimization, while allowing the problem of
interest to be efficiently addressed. Comparisons with a distributed
synchronous counterpart of the proposed unmixing procedure allow its interest
to be assessed on synthetic and real data. Besides, thanks to its genericity
and flexibility, the procedure investigated in this work can be implemented to
address various matrix factorization problems.

Trends change rapidly in today's world, prompting this key question: What is
the mechanism behind the emergence of new trends? By representing realworld
dynamic systems as complex networks, the emergence of new trends can be
symbolized by vertices that "shine." That is, at a specific time interval in a
network's life, certain vertices become increasingly connected to other
vertices. This process creates new highdegree vertices, i.e., network stars.
Thus, to study trends, we must look at how networks evolve over time and
determine how the stars behave. In our research, we constructed the largest
publicly available network evolution dataset to date, which contains 38,000
realworld networks and 2.5 million graphs. Then, we performed the first
precise widescale analysis of the evolution of networks with various scales.
Three primary observations resulted: (a) links are most prevalent among
vertices that join a network at a similar time; (b) the rate that new vertices
join a network is a central factor in molding a network's topology; and (c) the
emergence of network stars (highdegree vertices) is correlated with
fastgrowing networks. We applied our learnings to develop a flexible
networkgeneration model based on largescale, realworld data. This model
gives a better understanding of how stars rise and fall within networks, and is
applicable to dynamic systems both in nature and society.

In this paper, we investigate the crossing statistics of speckle patterns
formed in Fresnel diffraction region by a laser beam scattering through a
nanofluid. We extend $zerocrossing$ statistics to assess dynamical properties
of nanofluid. According to joint probability density function of laser beam
fluctuation and its time derivative, theoretical framework for Gaussian and
nonGaussian regimes are revisited. We count number of crossings not only at
{\it zero} level but also for all available thresholds to determine the average
speed of moving particles. Using probabilistic framework in determining
crossing statistics, {\it a priori} Gaussianity is not essentially considered,
therefore even in presence of deviation from Gaussian fluctuation, this
modified approach is capable to compute relevant quantities such as mean value
of speed more precisely. Generalized total crossing which represents the
weighted summation of crossings for all thresholds to quantify small deviation
from Gaussian statistics is introduced. This criterion can also manipulate the
contribution of noises and trends to infer reliable physical quantities. The
characteristic time scale for having successive crossings at a given threshold
is defined. In our experimental setup, we find that increasing sample
temperature leads to more consistency between Gaussian and perturbative
nonGaussian predictions. The maximum number of crossing does not necessarily
occur at mean level indicating that we should take into account other levels in
addition to {\it zero} level to achieve more accurate assessments.

We introduce a pipeline including multifractal detrended crosscorrelation
analysis (MFDXA) modified by either singular value decomposition or the
adaptive method to examine the statistical properties of the pulsar timing
residual ($PTR$) induced by a gravitational wave (GW) signal. We propose a new
algorithm, the socalled irregularMFDXA, to deal with irregular data
sampling. Inspired by the quadrupolar nature of the spatial crosscorrelation
function of a gravitational wave background, a new crosscorrelation function,
$\bar{\sigma}_{\times}$, derived from irregularMFDXA will be introduced. We
show that, this measure reveals the quadrupolar signature in the $PTRs$ induced
by stochastic GWB. We propose four strategies based on the $y$intercept of
fluctuation functions, the generalized Hurst exponent, and the width of the
singularity spectrum to determine the dimensionless amplitude and powerlaw
exponent of the characteristic strain spectrum as
$\mathcal{H}_c(f)\sim\mathcal{A}_{yr}(f/f_{yr})^{\zeta}$ for stochastic GWB.
Using the value of Hurst exponent, one can clarify the type of GWs. We apply
our pipeline to explore 20 millisecond pulsars observed by Parkes Pulsar Timing
Array. The computed scaling exponents confirm that all data are classified into
a nonstationary class implying the universality feature. The value of the Hurst
exponent is in the range $H\in [0.56,0.87]$. The $q$dependency of the
generalized Hurst exponent demonstrates that the observed $PTRs$ have
multifractal behavior, and the source of this multifractality is mainly
attributed to the correlation of data which is another universality of the
observed datasets. Multifractal analysis of available $PTRs$ datasets reveals
an upper bound on the dimensionless amplitude of the GWB, $\mathcal{A}_{yr}<
2.0\times 10^{15}$.

We introduce dynamic nested sampling: a generalisation of the nested sampling
algorithm in which the number of "live points" varies to allocate samples more
efficiently. In empirical tests the new method significantly improves
calculation accuracy compared to standard nested sampling with the same number
of samples; this increase in accuracy is equivalent to speeding up the
computation by factors of up to ~72 for parameter estimation and ~7 for
evidence calculations. We also show that the accuracy of both parameter
estimation and evidence calculations can be improved simultaneously. In
addition, unlike in standard nested sampling, more accurate results can be
obtained by continuing the calculation for longer. Popular standard nested
sampling implementations can be easily adapted to perform dynamic nested
sampling, and several dynamic nested sampling software packages are now
publicly available.

I review recent works showing that information geometry is a useful framework
to characterize quantum coherence and entanglement. Quantum systems exhibit
peculiar properties which cannot be justified by classical physics, e.g.
quantum coherence and quantum correlations. Once confined to thought
experiments, they are nowadays created and manipulated by exerting an exquisite
experimental control of atoms, molecules and photons. It is important to
identify and quantify such quantum features, as they are deemed to be key
resources to achieve supraclassical performances in computation and
communication protocols. The information geometry viewpoint elucidates the
advantage provided by quantum superpositions in phase estimation. Also, it
enables to link measures of coherence and entanglement to observables, which
can be evaluated in a laboratory by a limited number of measurements.

In order to isolate effects of nonstationarity from effects due to
nonlinearity and nonGaussianity, a doubly stochastic advectiondiffusiondecay
model (DSADM) is proposed. The model (defined on the 1D circular spatial
domain) is hierarchical: it is a linear stochastic partial differential
equation whose coefficients are transformed spatiotemporal random fields that
by themselves satisfy their own stochastic partial differential equations with
constant coefficients. The model generates conditionally Gaussian random fields
that have complex spatiotemporal covariances with the tunable degree of
nonstationarity in space and time. In numerical experiments with hybrid
ensemble filters and DSADM as the "model of truth", it is shown that the degree
of nonstationarity affects the optimal weights of ensemble vs. climatological
covariances in EnVar and the optimal weights of ensemble vs. timesmoothed
recent past covariances in the Hierarchical Bayes Ensemble Filter (HBEF) by
Tsyrulnikov and Rakitko, 2017. The stronger is the nonstationarity, the less
useful is the static covariance matrix and the more beneficial are the
timesmoothed recent past covariances as the building block of the filter's
analysis covariance matrix. A new hybridHBEF filter (HHBEF), which combines
EnVar and HBEF, is proposed. HHBEF is shown to outperform EnKF, EnVar, and HBEF
in nonstationary filtering regimes.

The data torrent unleashed by current and upcoming astronomical surveys
demands scalable analysis methods. Many machine learning approaches scale well,
but separating the instrument measurement from the physical effects of
interest, dealing with variable errors, and deriving parameter uncertainties is
often an afterthought. Classic forwardfolding analyses with Markov Chain
Monte Carlo or Nested Sampling enable parameter estimation and model
comparison, even for complex and slowtoevaluate physical models. However,
these approaches require independent runs for each data set, implying an
unfeasible number of model evaluations in the Big Data regime. Here I present a
new algorithm, collaborative nested sampling, for deriving parameter
probability distributions for each observation. Importantly, the number of
physical model evaluations scales sublinearly with the number of data sets,
and no assumptions about homogeneous errors, Gaussianity, the form of the model
or heterogeneity/completeness of the observations need to be made.
Collaborative nested sampling has immediate application in speeding up analyses
of large surveys, integralfieldunit observations, and Monte Carlo
simulations.

We consider the problem of detecting a random walk on a graph, based on
observations of the graph nodes. When visited by the walk, each node of the
graph observes a signal of elevated mean, which we assume can be different
across different nodes. Outside of the path of the walk, and also in its
absence, nodes measure only noise. Assuming the NeymanPearson setting, our
goal then is to characterize detection performance by computing the error
exponent for the probability of a miss, under a constraint on the probability
of false alarm. Since exact computation of the error exponent is known to be
difficult, equivalent to computation of the Lyapunov exponent, we approximate
its value by finding a tractable lower bound. The bound reveals an interesting
detectability condition: the walk is detectable whenever the entropy of the
walk is smaller than one half of the expected signaltonoise ratio. We derive
the bound by extending the notion of Markov types to GaussMarkov types. These
are sequences of stateobservation pairs with a given number of nodetonode
transition counts and the same average signal values across nodes, computed
from the measurements made during the times the random walk was visiting each
node's respective location. The lower bound has an intuitive interpretation:
among all GaussMarkov types that are asymptotically feasible in the absence of
the walk, the bound finds the most typical one under the presence of the walk.
Finally, we show by a sequence of judicious problem reformulations that
computing the bound reduces to solving a convex optimization problem, which is
a result in its own right.

Astronomical data often suffer from noise and incompleteness. We extend the
common mixturesofGaussians density estimation approach to account for
situations with a known sample incompleteness by simultaneous imputation from
the current model. The method, called GMMis, generalizes existing
ExpectationMaximization techniques for truncated data to arbitrary truncation
geometries and probabilistic rejection processes, as long as they can be
specified and do not depend on the density itself. The method accounts for
independent multivariate normal measurement errors for each of the observed
samples and recovers an estimate of the errorfree distribution from which both
observed and unobserved samples are drawn. It can perform a separation of a
mixturesofGaussian signal from a specified background distribution whose
amplitude may be unknown. We compare GMMis to the standard Gaussian mixture
model for simple test cases with different types of incompleteness, and apply
it to observational data from the NASA Chandra Xray telescope. The python code
is released as an opensource package at https://github.com/pmelchior/pyGMMis