-
In the genomic era, the identification of gene signatures associated with
disease is of significant interest. Such signatures are often used to predict
clinical outcomes in new patients and aid clinical decision-making. However,
recent studies have shown that gene signatures are often not replicable. This
occurrence has practical implications regarding the generalizability and
clinical applicability of such signatures. To improve replicability, we
introduce a novel approach to select gene signatures from multiple datasets
whose effects are consistently non-zero and account for between-study
heterogeneity. We build our model upon some rank-based quantities, facilitating
integration over different genomic datasets. A high dimensional penalized
Generalized Linear Mixed Model (pGLMM) is used to select gene signatures and
address data heterogeneity. We compare our method to some commonly used
strategies that select gene signatures ignoring between-study heterogeneity. We
provide asymptotic results justifying the performance of our method and
demonstrate its advantage in the presence of heterogeneity through thorough
simulation studies. Lastly, we motivate our method through a case study
subtyping pancreatic cancer patients from four gene expression studies.
-
To disentangle the complex non-stationary dependence structure of
precipitation extremes over the entire contiguous U.S., we propose a flexible
local approach based on factor copula models. Our sub-asymptotic spatial
modeling framework yields non-trivial tail dependence structures, with a
weakening dependence strength as events become more extreme, a feature commonly
observed with precipitation data but not accounted for in classical asymptotic
extreme-value models. To estimate the local extremal behavior, we fit the
proposed model in small regional neighborhoods to high threshold exceedances,
under the assumption of local stationarity, which allows us to gain in
flexibility. Adopting a local censored likelihood approach, inference is made
on a fine spatial grid, and local estimation is performed by taking advantage
of distributed computing resources and the embarrassingly parallel nature of
this estimation procedure. The local model is efficiently fitted at all grid
points, and uncertainty is measured using a block bootstrap procedure. An
extensive simulation study shows that our approach can adequately capture
complex, non-stationary dependencies, while our study of U.S. winter
precipitation data reveals interesting differences in local tail structures
over space, which has important implications on regional risk assessment of
extreme precipitation events.
-
Calibration of hydrological time-series models is a challenging task since
these models give a wide spectrum of output series and calibration procedures
require significant amount of time. From a statistical standpoint, this model
parameter estimation problem simplifies to finding an inverse solution of a
computer model that generates pre-specified time-series output (i.e., realistic
output series). In this paper, we propose a modified history matching approach
for calibrating the time-series rainfall-runoff models with respect to the real
data collected from the state of Georgia, USA. We present the methodology and
illustrate the application of the algorithm by carrying a simulation study and
the two case studies. Several goodness-of-fit statistics were calculated to
assess the model performance. The results showed that the proposed history
matching algorithm led to a significant improvement, of 30% and 14% (in terms
of root mean squared error) and 26% and 118% (in terms of peak percent
threshold statistics), for the two case-studies with Matlab-Simulink and SWAT
models, respectively.
-
Current health policy calls for greater use of evidence based care delivery
services to improve patient quality and safety outcomes. Care delivery is
complex, with interacting and interdependent components that challenge
traditional statistical analytic techniques, in particular when modeling a time
series of outcomes data that might be "interrupted" by a change in a particular
method of health care delivery. Interrupted time series (ITS) is a robust
quasi-experimental design with the ability to infer the effectiveness of an
intervention that accounts for data dependency. Current standardized methods
for analyzing ITS data do not model changes in variation and correlation
following the intervention. This is a key limitation since it is plausible for
data variability and dependency to change because of the intervention.
Moreover, present methodology either assumes a pre-specified interruption time
point with an instantaneous effect or removes data for which the effect of
intervention is not fully realized. In this paper, we describe and develop a
novel `Robust-ITS' model that overcomes these omissions and limitations. The
Robust-ITS model formally performs inference on: (a) identifying the change
point; (b) differences in pre- and post-intervention correlation; (c)
differences in the outcome variance pre- and post-intervention; and (d)
differences in the mean pre- and post-intervention. We illustrate the proposed
method by analyzing patient satisfaction data from a hospital that implemented
and evaluated a new nursing care delivery model as the intervention of
interest. The Robust-ITS model is implemented in a R Shiny toolbox which is
freely available to the community.
-
OBJECTIVE. A computer program tells me that a mean value is 12.3456789012,
but how many of these digits are significant (the rest being random junk)?
Should I report: 12.3?, 12.3456?, or even 10 (if only the first digit is
significant)? There are several rules-of-thumb but, surprisingly (given that
the problem is so common in science), none seem to be evidence-based. RESULTS.
Here I show how the significance of a digit in a particular decade of a mean
depends on the standard error of the mean (SEM). I define an index, DM that can
be plotted in graphs. From these a simple evidence-based rule for the number of
significant digits ("sigdigs") is distilled: the last sigdig in the mean is in
the same decade as the first or second non-zero digit in the SEM. As example,
for mean 34.63 (SEM 25.62), with n = 17, the reported value should be 35 (SEM
26). Digits beyond these contain little or no useful information, and should
not be reported lest they damage your credibility.
-
Baseline correction plays an important role in past and current
methodological debates in ERP research (e.g. the Tanner v. Maess debate in
Journal of Neuroscience Methods), serving as a potential alternative to strong
highpass filtering. However, the very assumptions that underlie traditional
baseline also undermine it, making it statistically unnecessary and even
undesirable and reducing signal-to-noise ratio. Including the baseline interval
as a predictor in a GLM-based statistical approach allows the data to determine
how much baseline correction is needed, including both full traditional and no
baseline correction as subcases, while reducing the amount of variance in the
residual error term and thus potentially increasing statistical power.
-
This article presents a new method for estimating the amount of an artifact
class in use at a given moment in the past from a random assemblage of
archaeological finds. This method is based on the use of simulation, since an
analytical solution is computationally impractical. Estimating the number of
artifacts in use at any time $t$ is shown to follow a Poisson distribution,
which allows for credible intervals to be established using the Jeffreys prior.
This estimator works from minimal assumptions about the dating and duration of
finds, as well as the intensity of collection, and is applied to coinage from
four Roman-period sites excavated by the Roman Peasant Project (2009-2014). The
result provides for an estimation of the abundance of material according to an
interval of certainty.
-
Count data and recurrent events in clinical trials, such as the number of
lesions in magnetic resonance imaging in multiple sclerosis, the number of
relapses in multiple sclerosis, the number of hospitalizations in heart
failure, and the number of exacerbations in asthma or in chronic obstructive
pulmonary disease (COPD) are often modeled by negative binomial distributions.
In this manuscript we study planning and analyzing clinical trials with group
sequential designs for negative binomial outcomes. We propose a group
sequential testing procedure for negative binomial outcomes based on Wald
statistics using maximum likelihood estimators. The asymptotic distribution of
the proposed group sequential tests statistics are derived. The finite sample
size properties of the proposed group sequential test for negative binomial
outcomes and the methods for planning the respective clinical trials are
assessed in a simulation study. The simulation scenarios are motivated by
clinical trials in chronic heart failure and relapsing multiple sclerosis,
which cover a wide range of practically relevant settings. Our research assures
that the asymptotic normal theory of group sequential designs can be applied to
negative binomial outcomes when the hypotheses are tested using Wald statistics
and maximum likelihood estimators. We also propose two methods, one based on
Student's t-distribution and one based on resampling, to improve type I error
rate control in small samples. The statistical methods studied in this
manuscript are implemented in the R package \textit{gscounts}, which is
available for download on the Comprehensive R Archive Network (CRAN).
-
Prior information is often incorporated informally when planning a clinical
trial. Here, we present an approach on how to incorporate prior information,
such as data from historical clinical trials, into the nuisance parameter based
sample size re-estimation in a design with an internal pilot study. We focus on
trials with continuous endpoints in which the outcome variance is the nuisance
parameter. For planning and analyzing the trial frequentist methods are
considered. Moreover, the external information on the variance is summarized by
the Bayesian meta-analytic-predictive (MAP) approach. To incorporate external
information into the sample size re-estimation, we propose to update the MAP
prior based on the results of the internal pilot study and to re-estimate the
sample size using an estimator from the posterior. By means of a simulation
study, we compare the operating characteristics such as power and sample size
distribution of the proposed procedure with the traditional sample size
re-estimation approach which uses the pooled variance estimator. The simulation
study shows that, if no prior-data conflict is present, incorporating external
information into the sample size re-estimation improves the operating
characteristics compared to the traditional approach. In the case of a
prior-data conflict, that is when the variance of the ongoing clinical trial is
unequal to the prior location, the performance of the traditional sample size
re-estimation procedure is in general superior, even when the prior information
is robustified. When considering to include prior information in sample size
re-estimation, the potential gains should be balanced against the risks.
-
Goals are results of pin-point shots and it is a pivotal decision in soccer
when, how and where to shoot. The main contribution of this study is two-fold.
At first, after showing that there exists high spatial correlation in the data
of shots across games, we introduce a spatial process in the error structure to
model the probability of conversion from a shot depending on positional and
situational covariates. The model is developed using a full Bayesian framework.
Secondly, based on the proposed model, we define two new measures that can
appropriately quantify the impact of an individual in soccer, by evaluating the
positioning senses and shooting abilities of the players. As a practical
application, the method is implemented on Major League Soccer data from 2016/17
season.
-
The widespread use of generalized linear models in case-control genetic
studies has helped identify many disease-associated risk factors typically
defined as DNA variants, or single nucleotide polymorphisms (SNPs). Up to now,
most literature has focused on selecting a unique best subset of SNPs based on
some statistical perspectives. In the presence of pronounced noise, however,
multiple biological paths are often found to be equally supported by a given
dataset when dealing with complex genetic diseases. We address the ambiguity
related to SNP selection by constructing a list of models called variable
selection confidence set (VSCS), which contains the collection of all
well-supported SNP combinations at a user-specified confidence level. The VSCS
extends the familiar notion of confidence intervals in the variable selection
setting and provides the practitioner with new tools aiding the variable
selection activity beyond trusting a single model. Based on the VSCS, we
consider natural graphical and numerical statistics measuring the inclusion
importance of a SNP based on its frequency in the most parsimonious VSCS
models. This work is motivated by available case-control genetic data on
age-related macular degeneration, a widespread complex disease and leading
cause of vision loss.
-
A region under rainfall is a contiguous spatial area receiving positive
precipitation at a particular time. The probabilistic behavior of such a region
is an issue of interest in meteorological studies. A region under rainfall can
be viewed as a shape object of a special kind, where scale and rotational
invariance are not necessarily desirable attributes of a mathematical
representation. For modeling variation in objects of this type, we propose an
approximation of the boundary that can be represented as a real valued
function, and arrive at further approximation through functional principal
component analysis, after suitable adjustment for asymmetry and incompleteness
in the data. The analysis of an open access satellite data set on monsoon
precipitation over Eastern India leads to explanation of most of the variation
in shapes of the regions under rainfall through a handful of interpretable
functions that can be further approximated parametrically. The most important
aspect of shape is found to be the size followed by contraction/elongation,
mostly along two pairs of orthogonal axes. The different modes of variation are
remarkably stable across calendar years and across different thresholds for
minimum size of the region.
-
Quasi-Monte Carlo (QMC) method is a useful numerical tool for pricing and
hedging of complex financial derivatives. These problems are usually of high
dimensionality and discontinuities. The two factors may significantly
deteriorate the performance of the QMC method. This paper develops an
integrated method that overcomes the challenges of the high dimensionality and
discontinuities concurrently. For this purpose, a smoothing method is proposed
to remove the discontinuities for some typical functions arising from financial
engineering. To make the smoothing method applicable for more general
functions, a new path generation method is designed for simulating the paths of
the underlying assets such that the resulting function has the required form.
The new path generation method has an additional power to reduce the effective
dimension of the target function. Our proposed method caters for a large
variety of model specifications, including the Black-Scholes, exponential
normal inverse Gaussian L\'evy, and Heston models. Numerical experiments
dealing with these models show that in the QMC setting the proposed smoothing
method in combination with the new path generation method can lead to a
dramatic variance reduction for pricing exotic options with discontinuous
payoffs and for calculating options' Greeks. The investigation on the effective
dimension and the related characteristics explains the significant enhancement
of the combined procedure.
-
The statistics of the smallest eigenvalue of Wishart-Laguerre ensemble is
important from several perspectives. The smallest eigenvalue density is
typically expressible in terms of determinants or Pfaffians. These results are
of utmost significance in understanding the spectral behavior of
Wishart-Laguerre ensembles and, among other things, unveil the underlying
universality aspects in the asymptotic limits. However, obtaining exact and
explicit expressions by expanding determinants or Pfaffians becomes impractical
if large dimension matrices are involved. For the real matrices ($\beta=1$)
Edelman has provided an efficient recurrence scheme to work out exact and
explicit results for the smallest eigenvalue density which does not involve
determinants or matrices. Very recently, an analogous recurrence scheme has
been obtained for the complex matrices ($\beta=2$). In the present work we
extend this to $\beta$-Wishart-Laguerre ensembles for the case when exponent
$\alpha$ in the associated Laguerre weight function, $\lambda^\alpha
e^{-\beta\lambda/2}$, is a non-negative integer, while $\beta$ is positive
real. This also gives access to the smallest eigenvalue density of fixed trace
$\beta$-Wishart-Laguerre ensemble, as well as moments for both cases. Moreover,
comparison with earlier results for the smallest eigenvalue density in terms of
certain hypergeometric function of matrix argument results in an effective way
of evaluating these explicitly. Exact evaluations for large values of $n$ (the
matrix dimension) and $\alpha$ also enable us to compare with Tracy-Widom
density and large deviation results of Katzav and Castillo. We also use our
result to obtain the density of the largest of the proper delay times which are
eigenvalues of the Wigner-Smith matrix and are relevant to the problem of
quantum chaotic scattering.
-
Hypothesis testing in the linear regression model is a fundamental
statistical problem. We consider linear regression in the high-dimensional
regime where the number of parameters exceeds the number of samples ($p> n$).
In order to make informative inference, we assume that the model is
approximately sparse, that is the effect of covariates on the response can be
well approximated by conditioning on a relatively small number of covariates
whose identities are unknown. We develop a framework for testing very general
hypotheses regarding the model parameters. Our framework encompasses testing
whether the parameter lies in a convex cone, testing the signal strength, and
testing arbitrary functionals of the parameter. We show that the proposed
procedure controls the type I error, and also analyze the power of the
procedure. Our numerical experiments confirm our theoretical findings and
demonstrate that we control false positive rate (type I error) near the nominal
level, and have high power. By duality between hypotheses testing and
confidence intervals, the proposed framework can be used to obtain valid
confidence intervals for various functionals of the model parameters. For
linear functionals, the length of confidence intervals is shown to be minimax
rate optimal.
-
Data depth concept offers a variety of powerful and user friendly tools for
robust exploration and inference for multivariate socio-economic phenomena. The
offered techniques may be successfully used in cases of lack of our knowledge
on parametric models generating data due to their nonparametric nature. This
paper presents the R package DepthProc, which is available under GPL-2 licence
on CRAN and R-forge servers for Windows, Linux and OS X platform. The package
consist of among others successful implementations of several data depth
techniques involving multivariate quantile-quantile plots, multivariate scatter
estimators, local Wilcoxon tests for multivariate as well as for functional
data, robust regressions. In order to show the package capabilities, real
datasets concerning United Nations Fourth Millennium Goal and the Internet
users activity are used.
-
Glioblastoma multiforme (GBM) is an aggressive form of human brain cancer
that is under active study in the field of cancer biology. Its rapid
progression and the relative time cost of obtaining molecular data make other
readily-available forms of data, such as images, an important resource for
actionable measures in patients. Our goal is to utilize information given by
medical images taken from GBM patients in statistical settings. To do this, we
design a novel statistic---the smooth Euler characteristic transform
(SECT)---that quantifies magnetic resonance images (MRIs) of tumors. Due to its
well-defined inner product structure, the SECT can be used in a wider range of
functional and nonparametric modeling approaches than other previously proposed
topological summary statistics. When applied to a cohort of GBM patients, we
find that the SECT is a better predictor of clinical outcomes than both
existing tumor shape quantifications and common molecular assays. Specifically,
we demonstrate that SECT features alone explain more of the variance in GBM
patient survival than gene expression, volumetric features, and morphometric
features. The main takeaways from our findings are thus twofold. First, they
suggest that images contain valuable information that can play an important
role in clinical prognosis and other medical decisions. Second, they show that
the SECT is a viable tool for the broader study of medical imaging informatics.
-
We consider Bayesian inference for stochastic differential equation mixed
effects models (SDEMEMs) exemplifying tumor response to treatment and regrowth
in mice. We produce an extensive study on how a SDEMEM can be fitted using both
exact inference based on pseudo-marginal MCMC and approximate inference via
Bayesian synthetic likelihoods (BSL). We investigate a two-compartments SDEMEM,
these corresponding to the fractions of tumor cells killed by and survived to a
treatment, respectively. Case study data considers a tumor xenography study
with two treatment groups and one control, each containing 5-8 mice. Results
from the case study and from simulations indicate that the SDEMEM is able to
reproduce the observed growth patterns and that BSL is a robust tool for
inference in SDEMEMs. Finally, we compare the fit of the SDEMEM to a similar
ordinary differential equation model. Due to small sample sizes, strong prior
information is needed to identify all model parameters in the SDEMEM and it
cannot be determined which of the two models is the better in terms of
predicting tumor growth curves. In a simulation study we find that with a
sample of 17 mice per group BSL is able to identify all model parameters and
distinguish treatment groups.
-
Facing increasing domestic energy consumption from population growth and
industrialization, Saudi Arabia is aiming to reduce its reliance on fossil
fuels and to broaden its energy mix by expanding investment in renewable energy
sources, including wind energy. A preliminary task in the development of wind
energy infrastructure is the assessment of wind energy potential, a key aspect
of which is the characterization of its spatio-temporal behavior. In this study
we examine the impact of internal climate variability on seasonal wind power
density fluctuations over Saudi Arabia using 30 simulations from the Large
Ensemble Project (LENS) developed at the National Center for Atmospheric
Research. Furthermore, a spatio-temporal model for daily wind speed is proposed
with neighbor-based cross-temporal dependence, and a multivariate skew-t
distribution to capture the spatial patterns of higher order moments. The model
can be used to generate synthetic time series over the entire spatial domain
that adequately reproduce the internal variability of the LENS dataset.
-
Brain function is organized in coordinated modes of spatio-temporal activity
(functional networks) exhibiting an intrinsic baseline structure with
variations under different experimental conditions. Existing approaches for
uncovering such network structures typically do not explicitly model shared and
differential patterns across networks, thus potentially reducing the detection
power. We develop an integrative modeling approach for jointly modeling
multiple brain networks across experimental conditions. The proposed Bayesian
Joint Network Learning approach develops flexible priors on the edge
probabilities involving a common intrinsic baseline structure and differential
effects specific to individual networks. Conditional on these edge
probabilities, connection strengths are modeled under a Bayesian spike and slab
prior on the off-diagonal elements of the inverse covariance matrix. The model
is fit under a posterior computation scheme based on Markov chain Monte Carlo.
Numerical simulations illustrate that the proposed joint modeling approach has
increased power to detect true differential edges while providing adequate
control on false positives and achieving greater accuracy in the estimation of
edge strengths compared to existing methods. An application of the method to
fMRI Stroop task data provides unique insights into brain network alterations
between cognitive conditions which existing graphical modeling techniques
failed to reveal.
-
The proliferation of terrorism is a serious concern in national and
international security, as its spread is seen as an existential threat to
Western liberal democracies. Understanding and effectively modelling the spread
of terrorism provides useful insight into formulating effective responses. A
mathematical model capturing the theoretical constructs of contagion and
diffusion is constructed for explaining the spread of terrorist activity and
used to analyse data from the Global Terrorism Database from 2000--2016 for
Afghanistan, Iraq, and Israel.
-
Consider the problem of estimating the entries of an unknown mean matrix or
tensor given a single noisy realization. In the matrix case, this problem can
be addressed by decomposing the mean matrix into a component that is additive
in the rows and columns, i.e.\ the additive ANOVA decomposition of the mean
matrix, plus a matrix of elementwise effects, and assuming that the elementwise
effects may be sparse. Accordingly, the mean matrix can be estimated by solving
a penalized regression problem, applying a lasso penalty to the elementwise
effects. Although solving this penalized regression problem is straightforward,
specifying appropriate values of the penalty parameters is not. Leveraging the
posterior mode interpretation of the penalized regression problem, moment-based
empirical Bayes estimators of the penalty parameters can be defined. Estimation
of the mean matrix using these these moment-based empirical Bayes estimators
can be called LANOVA penalization, and the corresponding estimate of the mean
matrix can be called the LANOVA estimate. The empirical Bayes estimators are
shown to be consistent. Additionally, LANOVA penalization is extended to
accommodate sparsity of row and column effects and to estimate an unknown mean
tensor. The behavior of the LANOVA estimate is examined under misspecification
of the distribution of the elementwise effects, and LANOVA penalization is
applied to several datasets, including a matrix of microarray data, a three-way
tensor of fMRI data and a three-way tensor of wheat infection data.
-
Cryo-electron microscopy provides 2-D projection images of the 3-D electron
scattering intensity of many instances of the particle under study (e.g., a
virus). Both symmetry (rotational point groups) and heterogeneity are important
aspects of biological particles and both aspects can be combined by describing
the electron scattering intensity of the particle as a stochastic process with
a symmetric probability law and therefore symmetric moments. A maximum
likelihood estimator implemented by an expectation-maximization algorithm is
described which estimates the unknown statistics of the electron scattering
intensity stochastic process from images of instances of the particle. The
algorithm is demonstrated on the bacteriophage HK97 and the virus N$\omega$V.
The results are contrasted with existing algorithms which assume that each
instance of the particle has the symmetry rather than the less restrictive
assumption that the probability law has the symmetry.
-
The functional linear regression model with points of impact is a recent
augmentation of the classical functional linear model with many practically
important applications. In this work, however, we demonstrate that the existing
data-driven procedure for estimating the parameters of this regression model
can be very instable and inaccurate. The tendency to omit relevant points of
impact is a particularly problematic aspect resulting in omitted-variable
biases. We explain the theoretical reason for this problem and propose a new
sequential estimation algorithm that leads to significantly improved estimation
results. Our estimation algorithm is compared with the existing estimation
procedure using an in-depth simulation study. The applicability is demonstrated
using data from Google AdWords, today's most important platform for online
advertisements. The \textsf{R}-package \texttt{FunRegPoI} and additional
\textsf{R}-codes are provided in the online supplementary material.
-
The method of model averaging has become an important tool to deal with model
uncertainty, for example in situations where a large amount of different
theories exist, as are common in economics. Model averaging is a natural and
formal response to model uncertainty in a Bayesian framework, and most of the
paper deals with Bayesian model averaging. The important role of the prior
assumptions in these Bayesian procedures is highlighted. In addition,
frequentist model averaging methods are also discussed. Numerical methods to
implement these methods are explained, and I point the reader to some freely
available computational resources. The main focus is on uncertainty regarding
the choice of covariates in normal linear regression models, but the paper also
covers other, more challenging, settings, with particular emphasis on sampling
models commonly used in economics. Applications of model averaging in economics
are reviewed and discussed in a wide range of areas, among which growth
economics, production modelling, finance and forecasting macroeconomic
quantities.