
Glioblastoma multiforme (GBM) is an aggressive form of human brain cancer
that is under active study in the field of cancer biology. Its rapid
progression and the relative time cost of obtaining molecular data make other
readilyavailable forms of data, such as images, an important resource for
actionable measures in patients. Our goal is to utilize information given by
medical images taken from GBM patients in statistical settings. To do this, we
design a novel statisticthe smooth Euler characteristic transform
(SECT)that quantifies magnetic resonance images (MRIs) of tumors. Due to its
welldefined inner product structure, the SECT can be used in a wider range of
functional and nonparametric modeling approaches than other previously proposed
topological summary statistics. When applied to a cohort of GBM patients, we
find that the SECT is a better predictor of clinical outcomes than both
existing tumor shape quantifications and common molecular assays. Specifically,
we demonstrate that SECT features alone explain more of the variance in GBM
patient survival than gene expression, volumetric features, and morphometric
features. The main takeaways from our findings are thus twofold. First, they
suggest that images contain valuable information that can play an important
role in clinical prognosis and other medical decisions. Second, they show that
the SECT is a viable tool for the broader study of medical imaging informatics.

Algorithmic composition of music has a long history and with the development
of powerful deep learning methods, there has recently been increased interest
in exploring algorithms and models to create art. We explore the utility of
state space models, in particular hidden Markov models (HMMs) and variants, in
composing classical piano pieces from the Romantic era and consider the models'
ability to generate new pieces that sound like they were composed by a human.
We find that the models we explored are fairly successful at generating new
pieces that have largely consonant harmonies, especially when trained on
original pieces with simple harmonic structure. However, we conclude that the
major limitation in using these models to generate music that sounds like it
was composed by a human is the lack of melodic progression in the composed
pieces. We also examine the performance of the models in the context of music
theory.

Linear mixed models (LMMs) are used extensively to model dependecies of
observations in linear regression and are used extensively in many application
areas. Parameter estimation for LMMs can be computationally prohibitive on big
data. Stateoftheart learning algorithms require computational complexity
which depends at least linearly on the dimension $p$ of the covariates, and
often use heuristics that do not offer theoretical guarantees. We present
scalable algorithms for learning highdimensional LMMs with sublinear
computational complexity dependence on $p$. Key to our approach are novel dual
estimators which use only kernel functions of the data, and fast computational
techniques based on the subsampled randomized Hadamard transform. We provide
theoretical guarantees for our learning algorithms, demonstrating the
robustness of parameter estimation. Finally, we complement the theory with
experiments on large synthetic and real data.

We present a new Gaussian process (GP) regression model where the covariance
kernel is indexed or parameterized by a sufficient dimension reduction subspace
of a reproducing kernel Hilbert space. The covariance kernel will be lowrank
while capturing the statistical dependency of the response to the covariates,
this affords significant improvement in computational efficiency as well as
potential reduction in the variance of predictions. We develop a fast
ExpectationMaximization algorithm for estimating the parameters of the
subspaceinduced Gaussian process (SIGP). Extensive results on real data show
that SIGP can outperform the standard full GP even with a low rank$m$, $m\leq
3$, inducing subspace.

The Markov Chain Monte Carlo method is the dominant paradigm for posterior
computation in Bayesian analysis. It is common to control computation time by
making approximations to the Markov transition kernel. Comparatively little
attention has been paid to computational optimality in these approximating
Markov Chains, or when such approximations are justified relative to obtaining
shorter paths from the exact kernel. We give simple, sharp bounds for uniform
approximations of uniformly mixing Markov chains. We then suggest a notion of
optimality that incorporates computation time and approximation error, and use
our bounds to make generalizations about properties of good approximations in
the uniformly mixing setting. The relevance of these properties is demonstrated
in applications to a minibatchingbased approximate MCMC algorithm for large
$n$ logistic regression and lowrank approximations for Gaussian processes.

We present an efficient algorithm for learning mixed membership models when
the number of variables $p$ is much larger than the number of hidden components
$k$. This algorithm reduces the computational complexity of stateoftheart
tensor methods, which require decomposing an $O\left(p^3\right)$ tensor, to
factorizing $O\left(p/k\right)$ subtensors each of size $O\left(k^3\right)$.
In addition, we address the issue of negative entries in the empirical method
of moments based estimators. We provide sufficient conditions under which our
approach has provable guarantees. Our approach obtains competitive empirical
results on both simulated and real data.

Nonlinear kernel regression models are often used in statistics and machine
learning because they are more accurate than linear models. Variable selection
for kernel regression models is a challenge partly because, unlike the linear
regression setting, there is no clear concept of an effect size for regression
coefficients. In this paper, we propose a novel framework that provides an
effect size analog of each explanatory variable for Bayesian kernel regression
models when the kernel is shiftinvariant  for example, the Gaussian kernel.
We use function analytic properties of shiftinvariant reproducing kernel
Hilbert spaces (RKHS) to define a linear vector space that: (i) captures
nonlinear structure, and (ii) can be projected onto the original explanatory
variables. The projection onto the original explanatory variables serves as an
analog of effect sizes. The specific function analytic property we use is that
shiftinvariant kernel functions can be approximated via random Fourier bases.
Based on the random Fourier expansion we propose a computationally efficient
class of Bayesian approximate kernel regression (BAKR) models for both
nonlinear regression and binary classification for which one can compute an
analog of effect sizes. We illustrate the utility of BAKR by examining two
important problems in statistical genetics: genomic selection (i.e. phenotypic
prediction) and association mapping (i.e. inference of significant variants or
loci). Stateoftheart methods for genomic selection and association mapping
are based on kernel regression and linear models, respectively. BAKR is the
first method that is competitive in both settings.

We develop a geometric framework that characterizes the synchronization
problem  the problem of consistently registering or aligning a collection of
objects. The theory we formulate characterizes the cohomological nature of
synchronization based on the classical theory of fibre bundles. We first
establish the correspondence between synchronization problems in a topological
group $G$ over a connected graph $\Gamma$ and the moduli space of flat
principal $G$bundles over $\Gamma$, and develop a discrete analogy of the
renowned theorem of classifying flat principal bundles with fix base and
structural group using the representation variety. In particular, we show that
prescribing an edge potential on a graph is equivalent to specifying an
equivalence class of flat principal bundles, of which the triviality of
holonomy dictates the synchronizability of the edge potential. We then develop
a twisted cohomology theory for associated vector bundles of the flat principal
bundle arising from an edge potential, which is a discrete version of the
twisted cohomology in differential geometry. This theory realizes the
obstruction to synchronizability as a cohomology group of the twisted de Rham
cochain complex. We then build a discrete twisted Hodge theory  a fibre
bundle analog of the discrete Hodge theory on graphs  which geometrically
realizes the graph connection Laplacian as a Hodge Laplacian of degree zero.
Motivated by our geometric framework, we study the problem of learning group
actions  partitioning a collection of objects based on the local
synchronizability of pairwise correspondence relations. A dual interpretation
is to learn finitely generated subgroups of an ambient transformation group
from noisy observed group elements. A synchronizationbased algorithm is also
provided, and we demonstrate its efficacy using simulations and real data.

We introduce a consistent estimator for the homology (an algebraic structure
representing connected components and cycles) of level sets of both density and
regression functions. Our method is based on kernel estimation. We apply this
procedure to two problems: (1) inferring the homology structure of manifolds
from noisy observations, (2) inferring the persistent homology (a multiscale
extension of homology) of either density or regression functions. We prove
consistency for both of these problems. In addition to the theoretical results,
we demonstrate these methods on simulated data for binary regression and
clustering applications.

We develop a generalized method of moments (GMM) approach for fast parameter
estimation in a new class of Dirichlet latent variable models with mixed data
types. Parameter estimation via GMM has been demonstrated to have computational
and statistical advantages over alternative methods, such as expectation
maximization, variational inference, and Markov chain Monte Carlo. The key
computational advan tage of our method (MELD) is that parameter estimation
does not require instantiation of the latent variables. Moreover, a
representational advantage of the GMM approach is that the behavior of the
model is agnostic to distributional assumptions of the observations. We derive
population moment conditions after marginalizing out the samplespecific
Dirichlet latent variables. The moment conditions only depend on component mean
parameters. We illustrate the utility of our approach on simulated data,
comparing results from MELD to alternative methods, and we show the promise of
our approach through the application of MELD to several data sets.

Latent factor models are the canonical statistical tool for exploratory
analyses of lowdimensional linear structure for an observation matrix with p
features across n samples. We develop a structured Bayesian group factor
analysis model that extends the factor model to multiple coupled observation
matrices; in the case of two observations, this reduces to a Bayesian model of
canonical correlation analysis. The main contribution of this work is to
carefully define a structured Bayesian prior that encourages both elementwise
and columnwise shrinkage and leads to desirable behavior on highdimensional
data. In particular, our model puts a structured prior on the joint factor
loading matrix, regularizing at three levels, which enables elementwise
sparsity and unsupervised recovery of latent factors corresponding to
structured variance across arbitrary subsets of the observations. In addition,
our structured prior allows for both dense and sparse latent factors so that
covariation among either all features or only a subset of features can both be
recovered. We use fast parameterexpanded expectationmaximization for
parameter estimation in this model. We validate our method on both simulated
data with substantial structure and real data, comparing against a number of
stateoftheart approaches. These results illustrate useful properties of our
model, including i) recovering sparse signal in the presence of dense effects;
ii) the ability to scale naturally to large numbers of observations; iii)
flexible observation and factorspecific regularization to recover factors
with a wide variety of sparsity levels and percentage of variance explained;
and iv) tractable inference that scales to modern genomic and document data
sizes.

We introduce a Bayesian model for inferring mixtures of subspaces of
different dimensions. The key challenge in such a mixture model is
specification of prior distributions over subspaces of different dimensions. We
address this challenge by embedding subspaces or Grassmann manifolds into a
sphere of relatively low dimension and specifying priors on the sphere. We
provide an efficient sampling algorithm for the posterior distribution of the
model parameters. We illustrate that a simple extension of our mixture of
subspaces model can be applied to topic modeling. We also prove posterior
consistency for the mixture of subspaces model. The utility of our approach is
demonstrated with applications to real and simulated data.

Long term behavior of nonlinear deterministic continuous time signals can be
studied in terms of their reconstructed attractors. Reconstructed attractors of
a continuous signal are meant to be topologically equivalent representations of
the dynamics of the unknown dynamical system which generates the signal.
Sometimes, geometry of the attractor or its complexity may give important
information on the system of interest. However, if the trajectories of the
attractor behave as if they are not coming from continuous system or there
exists many spike like structures on the path of the system trajectories, then
there is no way to characterize the shape of the attractor. In this article,
the traditional attractor reconstruction method is first used for two types of
ECG signals: Normal healthy persons (NHP) and Congestive Heart failure patients
(CHFP). As common in such a framework, the reconstructed attractors are not at
all well formed and hence it is not possible to adequately characterize their
geometrical features. Thus, we incorporate frequency domain information to the
given time signals. This is done by transforming the signals to a time
frequency domain by means of suitable Wavelet transforms (WT). The transformed
signal concerns two non homogeneous variables and is still quite difficult to
use to reconstruct some dynamics out of it. By applying a suitable mapping,
this signal is further converted into integer domain and a new type of 3D plot,
called integer lag plot, which characterizes and distinguishes the ECG signals
of NHP and CHFP, is finally obtained.

The complexity of a signal can be measured by the Recurrence period density
entropy (RPDE) from the reconstructed phase space. We have chosen a window
based RPDE method for the classification of signals, as RPDE is an average
entropic measure of the whole phase space. We have observed the changes in the
complexity in cardiac signals of normal healthy person (NHP) and congestive
heart failure patients (CHFP). The results show that the cardiac dynamics of a
healthy subject is more complex and random compare to the same for a heart
failure patient, whose dynamics is more deterministic. We have constructed a
general threshold to distinguish the border line between a healthy and a
congestive heart failure dynamics. The results may be useful for wide range for
physiological and biomedical analysis.

The scalability of statistical estimators is of increasing importance in
modern applications. One approach to implementing scalable algorithms is to
compress data into a low dimensional latent space using dimension reduction
methods. In this paper we develop an approach for dimension reduction that
exploits the assumption of low rank structure in high dimensional data to gain
both computational and statistical advantages. We adapt recent randomized
lowrank approximation algorithms to provide an efficient solution to principal
component analysis (PCA), and we use this efficient solver to improve parameter
estimation in largescale linear mixed models (LMM) for association mapping in
statistical and quantitative genomics. A key observation in this paper is that
randomization serves a dual role, improving both computational and statistical
performance by implicitly regularizing the covariance matrix estimate of the
random effect in a LMM. These statistical and computational advantages are
highlighted in our experiments on simulated data and largescale genomic
studies.

A parametrization of hypergraphs based on the geometry of points in
$\mathbf{R}^d$ is developed. Informative prior distributions on hypergraphs are
induced through this parametrization by priors on point configurations via
spatial processes. This prior specification is used to infer conditional
independence models or Markov structure of multivariate distributions.
Specifically, we can recover both the junction tree factorization as well as
the hyper Markov law. This approach offers greater control on the distribution
of graph features than Erd\"osR\'enyi random graphs, supports inference of
factorizations that cannot be retrieved by a graph alone, and leads to new
Metropolis\slash Hastings Markov chain Monte Carlo algorithms with both local
and global moves in graph space. We illustrate the utility of this
parametrization and prior specification using simulations.

We consider the asymptotic consistency of maximum likelihood parameter
estimation for dynamical systems observed with noise. Under suitable conditions
on the dynamical systems and the observations, we show that maximum likelihood
parameter estimation is consistent. Our proof involves ideas from both
information theory and dynamical systems. Furthermore, we show how some
wellstudied properties of dynamical systems imply the general statistical
properties related to maximum likelihood estimation. Finally, we exhibit
classical families of dynamical systems for which maximum likelihood estimation
is consistent. Examples include shifts of finite type with Gibbs measures and
Axiom A attractors with SRB measures.

In order to use persistence diagrams as a true statistical tool, it would be
very useful to have a good notion of mean and variance for a set of diagrams.
In 2011, Mileyko and his collaborators made the first study of the properties
of the Fr\'echet mean in $(\mathcal{D}_p,W_p)$, the space of persistence
diagrams equipped with the pth Wasserstein metric. In particular, they showed
that the Fr\'echet mean of a finite set of diagrams always exists, but is not
necessarily unique. The means of a continuouslyvarying set of diagrams do not
themselves (necessarily) vary continuously, which presents obvious problems
when trying to extend the Fr\'echet mean definition to the realm of vineyards.
We fix this problem by altering the original definition of Fr\'echet mean so
that it now becomes a probability measure on the set of persistence diagrams;
in a nutshell, the mean of a set of diagrams will be a weighted sum of atomic
measures, where each atom is itself a persistence diagram determined using a
perturbation of the input diagrams. This definition gives for each $N$ a map
$(\mathcal{D}_p)^N \to \mathbb{P}(\mathcal{D}_p)$. We show that this map is
H\"older continuous on finite diagrams and thus can be used to build a useful
statistic on timevarying persistence diagrams, better known as vineyards.

Approximation of a continuous dynamics by discrete dynamics in the form of
Poincare map is one of the fascinating mathematical tool, which can describe
the approximate behaviour of the dynamics of the dynamical system in lesser
dimension than the embedding diemnsion. The present article considers a very
rare biomedical signal like Electromyography (EMG) signal. It determines
suitable time delay and reconstruct the attractor of embedding diemnsion three.
By measuring its Lyapunov exponent, the attractor so reconstructed is found to
be chaotic. Naturally the Poincare map obtained by corresponding Poincare
section is to be chaotic too. This may be verified by calculation of Lyapunov
exponent of the map. The main objective of this article is to show that
Poincare map exists in this case as a 2D map for a suitable Poincare section
only. In fact, the article considers two Poincare sections of the attractor for
construction of the Poincare map. It is seen that one such map is chaotic but
the other one is not so, both are verified by calculation of Lyapunov exponent
of the map.

Study of continuous dynamical system through Poincare map is one of the most
popular topics in nonlinear analysis. This is done by taking intersections of
the orbit of flow by a hyperplane parallel to one of the coordinate
hyperplanes of codimension one. Naturally for a 3Dattractor, the Poincare
map gives rise to 2D points, which can describe the dynamics of the attractor
properly. In a very special case, sometimes these 2D points are considered as
their 1Dprojections to obtain a 1D map. However, this is an artificial way of
reducing the 2D map by dropping one of the variables. Sometimes it is found
that the two coordinates of the points on the Poincare section are functionally
related. This also reduces the 2D Poincare map to a 1D map. This reduction is
natural, and not artificial as mentioned above. In the present study, this
issue is being highlighted. In fact, we find out some examples, which show that
even this natural reduction of the 2D Poincare map is not always justified,
because the resultant 1D map may fail to generate the original dynamics. This
proves that to describe the dynamics of the 3D chaotic attractor, the minimum
dimension of the Poincare map must be two, in general.

The paper introduces new types of nonlinear correlations between bivariate
data sets and derives nonlinear autocorrelations on the same data set. These
autocorrelations are of different types to match signals with different types
of nonlinearities. Examples are cited in all cases to make the definitions
meaningful. Next correlogram diagrams are drawn separately in all cases; from
these diagrams proper time lags/delays are determined. These give rise to
independent coordinates of the attractors. Finally three dimensional attractors
are reconstructed in each case separately with the help of these independent
coordinates. Moreover for the purpose of making proper distinction between the
signals, the attractors so reconstructed are quantified by a new technique
called ellipsoid fit.

For the purpose of phase space reconstruction from nonlinear time series,
delay selection is one of the most vital criteria. This is normally done by
using a general measure viz., mutual information (MI). However, in that case,
the delay selection is limited to the estimation of a single delay using MI
between two variables only. The corresponding reconstructed phase space is also
not satisfactory. To overcome the situation, a highdimensional estimator of
the MI is used; it selects more than one delay between more than two variables.
The quality of the reconstructed phase space is tested by shape distortion
parameter (SD), it is found that even this multidimensional MI sometimes fails
to produce a less distorted phase space. In this paper, an alternative
nonlinear measure cross autocorrelation (CAC) is introduced. A comparative
study is made between the reconstructed phase spaces of a known three
dimensional Neuro dynamical model, Lorenz dynamical model and a three
dimensional food web model under MI for two and higher dimensions and also
under cross autocorrelation separately. It is found that the least distorted
phase space is obtained only under the notion of cross autocorrelation.

Signal analysis is one of the finest scientific techniques in communication
theory. Some quantitative and qualitative measures describe the pattern of a
music signal, vary from one to another. Same musical recital, when played by
different instrumentalists, generates different types of music patterns. The
reason behind various patterns is the psychoacoustic measures  Dynamics,
Timber, Tonality and Rhythm, varies in each time. However, the psychoacoustic
study of the music signals does not reveal any idea about the similarity
between the signals. For such cases, study of synchronization of longterm
nonlinear dynamics may provide effective results. In this context, phase
synchronization (PS) is one of the measures to show synchronization between two
nonidentical signals. In fact, it is very critical to investigate any other
kind of synchronization for experimental condition, because those are
completely non identical signals. Also, there exists equivalence between the
phases and the distances of the diagonal line in Recurrence plot (RP) of the
signals, which is quantifiable by the recurrence quantification measure
taurecurrence rate. This paper considers two nonlinear music signals based on
same raga played by two eminent sitar instrumentalists as two nonidentical
sources. The psychoacoustic study shows how the Dynamics, Timber, Tonality and
Rhythm vary for the two music signals. Then, long term analysis in the form of
phase space reconstruction is performed, which reveals the chaotic phase spaces
for both the signals. From the RP of both the phase spaces, taurecurrence rate
is calculated. Finally by the correlation of normalized taurecurrence rate of
their 3D phase spaces and the PS of the two music signals has been established.
The numerical results well support the analysis.

In this paper we introduce a statistic, the persistent homology transform
(PHT), to model surfaces in $\mathbb{R}^3$ and shapes in $\mathbb{R}^2$. This
statistic is a collection of persistence diagrams  multiscale topological
summaries used extensively in topological data analysis. We use the PHT to
represent shapes and execute operations such as computing distances between
shapes or classifying shapes. We prove the map from the space of simplicial
complexes in $\mathbb{R}^3$ into the space spanned by this statistic is
injective. This implies that the statistic is a sufficient statistic for
probability densities on the space of piecewise linear shapes. We also show
that a variant of this statistic, the Euler Characteristic Transform (ECT),
admits a simple exponential family formulation which is of use in providing
likelihood based inference for shapes and surfaces. We illustrate the utility
of this statistic on simulated and real data.

Xray crystallography is the predominant method for obtaining atomicscale
information about biological macromolecules. Despite the success of the
technique, obtaining well diffracting crystals still critically limits going
from protein to structure. In practice, the crystallization process proceeds
through knowledgeinformed empiricism. Better physicochemical understanding
remains elusive because of the large number of variables involved, hence little
guidance is available to systematically identify solution conditions that
promote crystallization. To help determine relationships between macromolecular
properties and their crystallization propensity, we have trained statistical
models on samples for 182 proteins supplied by the Northeast Structural
Genomics consortium. Gaussian processes, which capture trends beyond the reach
of linear statistical models, distinguish between two main physicochemical
mechanisms driving crystallization. One is characterized by low levels of side
chain entropy and has been extensively reported in the literature. The other
identifies specific electrostatic interactions not previously described in the
crystallization context. Because evidence for two distinct mechanisms can be
gleaned both from crystal contacts and from solution conditions leading to
successful crystallization, the model offers future avenues for optimizing
crystallization screens based on partial structural information. The
availability of crystallization data coupled with structural outcomes analyzed
through stateoftheart statistical models may thus guide macromolecular
crystallization toward a more rational basis.