
Glioblastoma multiforme (GBM) is an aggressive form of human brain cancer
that is under active study in the field of cancer biology. Its rapid
progression and the relative time cost of obtaining molecular data make other
readilyavailable forms of data, such as images, an important resource for
actionable measures in patients. Our goal is to utilize information given by
medical images taken from GBM patients in statistical settings. To do this, we
design a novel statisticthe smooth Euler characteristic transform
(SECT)that quantifies magnetic resonance images (MRIs) of tumors. Due to its
welldefined inner product structure, the SECT can be used in a wider range of
functional and nonparametric modeling approaches than other previously proposed
topological summary statistics. When applied to a cohort of GBM patients, we
find that the SECT is a better predictor of clinical outcomes than both
existing tumor shape quantifications and common molecular assays. Specifically,
we demonstrate that SECT features alone explain more of the variance in GBM
patient survival than gene expression, volumetric features, and morphometric
features. The main takeaways from our findings are thus twofold. First, they
suggest that images contain valuable information that can play an important
role in clinical prognosis and other medical decisions. Second, they show that
the SECT is a viable tool for the broader study of medical imaging informatics.

We show that an embedding in Euclidean space based on tropical geometry
generates stable sufficient statistics for barcodes. In topological data
analysis, barcodes are multiscale summaries of algebraic topological
characteristics that capture the `shape' of data; however, in practice, they
have complex structures that make them difficult to use in statistical
settings. The sufficiency result presented in this work allows for classical
probability distributions to be assumed on the tropical geometric
representation of barcodes. This makes a variety of parametric statistical
inference methods amenable to barcodes, all while maintaining their initial
interpretations. More specifically, we show that exponential family
distributions may be assumed, and that likelihood functions for persistent
homology may be constructed. We conceptually demonstrate sufficiency and
illustrate its utility in persistent homology dimensions 0 and 1 with concrete
parametric applications to human immunodeficiency virus and avian influenza
data.

The central aim in this paper is to address variable selection questions in
nonlinear and nonparametric regression. Motivated by statistical genetics,
where nonlinear interactions are of particular interest, we introduce a novel,
interpretable, and computationally efficient way to summarize the relative
importance of predictor variables. Methodologically, we develop the "RelATive
cEntrality" (RATE) measure to prioritize candidate genetic variants that are
not just marginally important, but whose associations also stem from
significant covarying relationships with other variants in the data. We
illustrate RATE through Bayesian Gaussian process regression, but the
methodological innovations apply to other nonlinear methods. It is known that
nonlinear models often exhibit greater predictive accuracy than linear models,
particularly for phenotypes generated by complex genetic architectures. With
detailed simulations and an Arabidopsis thaliana QTL mapping study, we show
that applying RATE enables an explanation for this improved performance.

Nonlinear kernel regression models are often used in statistics and machine
learning because they are more accurate than linear models. Variable selection
for kernel regression models is a challenge partly because, unlike the linear
regression setting, there is no clear concept of an effect size for regression
coefficients. In this paper, we propose a novel framework that provides an
effect size analog of each explanatory variable for Bayesian kernel regression
models when the kernel is shiftinvariant  for example, the Gaussian kernel.
We use function analytic properties of shiftinvariant reproducing kernel
Hilbert spaces (RKHS) to define a linear vector space that: (i) captures
nonlinear structure, and (ii) can be projected onto the original explanatory
variables. The projection onto the original explanatory variables serves as an
analog of effect sizes. The specific function analytic property we use is that
shiftinvariant kernel functions can be approximated via random Fourier bases.
Based on the random Fourier expansion we propose a computationally efficient
class of Bayesian approximate kernel regression (BAKR) models for both
nonlinear regression and binary classification for which one can compute an
analog of effect sizes. We illustrate the utility of BAKR by examining two
important problems in statistical genetics: genomic selection (i.e. phenotypic
prediction) and association mapping (i.e. inference of significant variants or
loci). Stateoftheart methods for genomic selection and association mapping
are based on kernel regression and linear models, respectively. BAKR is the
first method that is competitive in both settings.