• A Non-Gaussian Spatio-Temporal Model for Daily Wind Speeds Based on a Multivariate Skew-t Distribution(1703.04312)

Feb. 13, 2019 stat.AP
Facing increasing domestic energy consumption from population growth and industrialization, Saudi Arabia is aiming to reduce its reliance on fossil fuels and to broaden its energy mix by expanding investment in renewable energy sources, including wind energy. A preliminary task in the development of wind energy infrastructure is the assessment of wind energy potential, a key aspect of which is the characterization of its spatio-temporal behavior. In this study we examine the impact of internal climate variability on seasonal wind power density fluctuations over Saudi Arabia using 30 simulations from the Large Ensemble Project (LENS) developed at the National Center for Atmospheric Research. Furthermore, a spatio-temporal model for daily wind speed is proposed with neighbor-based cross-temporal dependence, and a multivariate skew-t distribution to capture the spatial patterns of higher order moments. The model can be used to generate synthetic time series over the entire spatial domain that adequately reproduce the internal variability of the LENS dataset.
• A Copula Model for Non-Gaussian Multivariate Spatial Data(1603.03950)

Oct. 10, 2018 stat.ME, stat.AP
We propose a new copula model for replicated multivariate spatial data. Unlike classical models that assume multivariate normality of the data, the proposed copula is based on the assumption that some factors exist that affect the joint spatial dependence of all measurements of each variable as well as the joint dependence among these variables. The model is parameterized in terms of a cross-covariance function that may be chosen from the many models proposed in the literature. In addition, there are additive factors in the model that allow tail dependence and reflection asymmetry of each variable measured at different locations and of different variables to be modeled. The proposed approach can therefore be seen as an extension of the linear model of coregionalization widely used for modeling multivariate spatial data. The likelihood of the model can be obtained in a simple form and therefore the likelihood estimation is quite fast. The model is not restricted to the set of data locations, and using the estimated copula, spatial data can be interpolated at locations where values of variables are unknown. We apply the proposed model to temperature and pressure data and compare its performance with the performance of a popular model from multivariate geostatistics.
• Likelihood Approximation With Hierarchical Matrices For Large Spatial Datasets(1709.04419)

Sept. 12, 2018 stat.CO
We use available measurements to estimate the unknown parameters (variance, smoothness parameter, and covariance length) of a covariance function by maximizing the joint Gaussian log-likelihood function. To overcome cubic complexity in the linear algebra, we approximate the discretized covariance function in the hierarchical (H-) matrix format. The H-matrix format has a log-linear computational cost and storage $O(kn \log n)$, where the rank $k$ is a small integer and $n$ is the number of locations. The H-matrix technique allows us to work with general covariance matrices in an efficient way, since H-matrices can approximate inhomogeneous covariance functions, with a fairly general mesh that is not necessarily axes-parallel, and neither the covariance matrix itself nor its inverse have to be sparse. We demonstrate our method with Monte Carlo simulations and an application to soil moisture data. The C, C++ codes and data are freely available.
• Bayesian model averaging over tree-based dependence structures for multivariate extremes(1705.10488)

July 22, 2018 stat.ME
Describing the complex dependence structure of extreme phenomena is particularly challenging. To tackle this issue we develop a novel statistical algorithm that describes extremal dependence taking advantage of the inherent hierarchical dependence structure of the max-stable nested logistic distribution and that identifies possible clusters of extreme variables using reversible jump Markov chain Monte Carlo techniques. Parsimonious representations are achieved when clusters of extreme variables are found to be completely independent. Moreover, we significantly decrease the computational complexity of full likelihood inference by deriving a recursive formula for the nested logistic model likelihood. The algorithm performance is verified through extensive simulation experiments which also compare different likelihood procedures. The new methodology is used to investigate the dependence relationships between extreme concentration of multiple pollutants in California and how these pollutants are related to extreme weather conditions. Overall, we show that our approach allows for the representation of complex extremal dependence structures and has valid applications in multivariate data analysis, such as air pollution monitoring, where it can guide policymaking.
• Full likelihood inference for max-stable data(1703.08665)

July 13, 2018 stat.ME
We show how to perform full likelihood inference for max-stable multivariate distributions or processes based on a stochastic Expectation-Maximisation algorithm, which combines statistical and computational efficiency in high-dimensions. The good performance of this methodology is demonstrated by simulation based on the popular logistic and Brown--Resnick models, and it is shown to provide dramatic computational time improvements with respect to a direct computation of the likelihood. Strategies to further reduce the computational burden are also discussed.
• ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems(1708.02835)

June 22, 2018 cs.DC
We present ExaGeoStat, a high performance framework for geospatial statistics in climate and environment modeling. In contrast to simulation based on partial differential equations derived from first-principles modeling, ExaGeoStat employs a statistical model based on the evaluation of the Gaussian log-likelihood function, which operates on a large dense covariance matrix. Generated by the parametrizable Matern covariance function, the resulting matrix is symmetric and positive definite. The computational tasks involved during the evaluation of the Gaussian log-likelihood function become daunting as the number n of geographical locations grows, as O(n2) storage and O(n3) operations are required. While many approximation methods have been devised from the side of statistical modeling to ameliorate these polynomial complexities, we are interested here in the complementary approach of evaluating the exact algebraic result by exploiting advances in solution algorithms and many-core computer architectures. Using state-of-the-art high performance dense linear algebra libraries associated with various leading edge parallel architectures (Intel KNLs, NVIDIA GPUs, and distributed-memory systems), ExaGeoStat raises the game for statistical applications from climate and environmental science. ExaGeoStat provides a reference evaluation of statistical parameters, with which to assess the validity of the various approaches based on approximation. The framework takes a first step in the merger of large-scale data analytics and extreme computing for geospatial statistical applications, to be followed by additional complexity reducing improvements from the solver side that can be implemented under the same interface. Thus, a single uncompromised statistical model can ultimately be executed in a wide variety of emerging exascale environments.
• Tile Low-Rank Approximation of Large-Scale Maximum Likelihood Estimation on Manycore Architectures(1804.09137)

April 24, 2018 cs.NA
Maximum likelihood estimation is an important statistical technique for estimating missing data, for example in climate and environmental applications, which are usually large and feature data points that are irregularly spaced. In particular, the Gaussian log-likelihood function is the \emph{de facto} model, which operates on the resulting sizable dense covariance matrix. The advent of high performance systems with advanced computing power and memory capacity have enabled full simulations only for rather small dimensional climate problems, solved at the machine precision accuracy. The challenge for high dimensional problems lies in the computation requirements of the log-likelihood function, which necessitates ${\mathcal O}(n^2)$ storage and ${\mathcal O}(n^3)$ operations, where $n$ represents the number of given spatial locations. This prohibitive computational cost may be reduced by using approximation techniques that not only enable large-scale simulations otherwise intractable but also maintain the accuracy and the fidelity of the spatial statistics model. In this paper, we extend the Exascale GeoStatistics software framework (i.e., ExaGeoStat) to support the Tile Low-Rank (TLR) approximation technique, which exploits the data sparsity of the dense covariance matrix by compressing the off-diagonal tiles up to a user-defined accuracy threshold. The underlying linear algebra operations may then be carried out on this data compression format, which may ultimately reduce the arithmetic complexity of the maximum likelihood estimation and the corresponding memory footprint. Performance results of TLR-based computations on shared and distributed-memory systems attain up to 13X and 5X speedups, respectively, compared to full accuracy simulations using synthetic and real datasets (up to 2M), while ensuring adequate prediction accuracy.
• An Outlyingness Matrix for Multivariate Functional Data Classification(1704.02568)

April 22, 2018 stat.ME, stat.ML
The classification of multivariate functional data is an important task in scientific research. Unlike point-wise data, functional data are usually classified by their shapes rather than by their scales. We define an outlyingness matrix by extending directional outlyingness, an effective measure of the shape variation of curves that combines the direction of outlyingness with conventional depth. We propose two classifiers based on directional outlyingness and the outlyingness matrix, respectively. Our classifiers provide better performance compared with existing depth-based classifiers when applied on both univariate and multivariate functional data from simulation studies. We also test our methods on two data problems: speech recognition and gesture classification, and obtain results that are consistent with the findings from the simulated data.
• Multivariate Functional Data Visualization and Outlier Detection(1703.06419)

April 22, 2018 stat.CO, stat.ME
This article proposes a new graphical tool, the magnitude-shape (MS) plot, for visualizing both the magnitude and shape outlyingness of multivariate functional data. The proposed tool builds on the recent notion of functional directional outlyingness, which measures the centrality of functional data by simultaneously considering the level and the direction of their deviation from the central region. The MS-plot intuitively presents not only levels but also directions of magnitude outlyingness on the horizontal axis or plane, and demonstrates shape outlyingness on the vertical axis. A dividing curve or surface is provided to separate non-outlying data from the outliers. Both the simulated data and the practical examples confirm that the MS-plot is superior to existing tools for visualizing centrality and detecting outliers for functional data.
• Directional Outlyingness for Multivariate Functional Data(1612.04615)

April 22, 2018 stat.ME
The direction of outlyingness is crucial to describing the centrality of multivariate functional data. Motivated by this idea, we generalize classical depth to directional outlyingness for functional data. We investigate theoretical properties of functional directional outlyingness and find that the total outlyingness can be naturally decomposed into two parts: magnitude outlyingness and shape outlyingness which represent the centrality of a curve for magnitude and shape, respectively. Using this decomposition, we provide a visualization tool for the centrality of curves. Furthermore, we design an outlier detection procedure based on functional directional outlyingness. This criterion applies to both univariate and multivariate curves and simulation studies show that it outperforms competing methods. Weather and electrocardiogram data demonstrate the practical application of our proposed framework.
• Bayesian Modeling of Air Pollution Extremes Using Nested Multivariate Max-Stable Processes(1804.04588)

March 18, 2018 stat.AP
Capturing the potentially strong dependence among the peak concentrations of multiple air pollutants across a spatial region is crucial for assessing the related public health risks. In order to investigate the multivariate spatial dependence properties of air pollution extremes, we introduce a new class of multivariate max-stable processes. Our proposed model admits a hierarchical tree-based formulation, in which the data are conditionally independent given some latent nested $\alpha$-stable random factors. The hierarchical structure facilitates Bayesian inference and offers a convenient and interpretable characterization. We fit this nested multivariate max-stable model to the maxima of air pollution concentrations and temperatures recorded at a number of sites in the Los Angeles area, showing that the proposed model succeeds in capturing their complex tail dependence structure.
• Reducing Storage of Global Wind Ensembles with Stochastic Generators(1702.01995)

Oct. 1, 2017 stat.AP
Wind has the potential to make a significant contribution to future energy resources. Locating the sources of this renewable energy on a global scale is however extremely challenging, given the difficulty to store very large data sets generated by modern computer models. We propose a statistical model that aims at reproducing the data-generating mechanism of an ensemble of runs via a Stochastic Generator (SG) of global annual wind data. We introduce an evolutionary spectrum approach with spatially varying parameters based on large-scale geographical descriptors such as altitude to better account for different regimes across the Earth's orography. We consider a multi-step conditional likelihood approach to estimate the parameters that explicitly accounts for nonstationary features while also balancing memory storage and distributed computation. We apply the proposed model to more than 18 million points of yearly global wind speed. The proposed SG requires orders of magnitude less storage for generating surrogate ensemble members from wind than does creating additional wind fields from the climate model, even if an effective lossy data compression algorithm is applied to the simulation output.
• Factor Copula Models for Replicated Spatial Data(1511.03000)

Dec. 7, 2016 stat.AP
We propose a new copula model that can be used with replicated spatial data. Unlike the multivariate normal copula, the proposed copula is based on the assumption that a common factor exists and affects the joint dependence of all measurements of the process. Moreover, the proposed copula can model tail dependence and tail asymmetry. The model is parameterized in terms of a covariance function that may be chosen from the many models proposed in the literature, such as the Matern model. For some choice of common factors, the joint copula density is given in closed form and therefore likelihood estimation is very fast. In the general case, one-dimensional numerical integration is needed to calculate the likelihood, but estimation is still reasonably fast even with large data sets. We use simulation studies to show the wide range of dependence structures that can be generated by the proposed model with different choices of common factors. We apply the proposed model to spatial temperature data and compare its performance with some popular geostatistics models.
• A Multi-Resolution Spatio-Temporal Model for Brain Activation and Connectivity in fMRI Data(1602.02435)

June 15, 2016 stat.AP
Functional Magnetic Resonance Imaging (fMRI) is a primary modality for studying brain activity. Modeling spatial dependence of imaging data at different scales is one of the main challenges of contemporary neuroimaging, and it could allow for accurate testing for significance in neural activity. The high dimensionality of this type of data (on the order of hundreds of thousands of voxels) poses serious modeling challenges and considerable computational constraints. For the sake of feasibility, standard models typically reduce dimensionality by modeling covariance among regions of interest (ROIs) -- coarser or larger spatial units -- rather than among voxels. However, ignoring spatial dependence at different scales could drastically reduce our ability to detect activation patterns in the brain and hence produce misleading results. To overcome these problems, we introduce a multi-resolution spatio-temporal model and a computationally efficient methodology to estimate cognitive control related activation and whole-brain connectivity. The proposed model allows for testing voxel-specific activation while accounting for non-stationary local spatial dependence within anatomically defined ROIs, as well as regional dependence (between-ROIs). Furthermore, the model allows for detection of interpretable connectivity patterns among ROIs using the graphical Least Absolute Shrinkage Selection Operator (LASSO). The model is used in a motor-task fMRI study to investigate brain activation and connectivity patterns aimed at identifying associations between these patterns and regaining motor functionality following a stroke.
• Multi-Level Restricted Maximum Likelihood Covariance Estimation and Kriging for Large Non-Gridded Spatial Datasets(1504.00302)

March 28, 2016 stat.CO
We develop a multi-level restricted Gaussian maximum likelihood method for estimating the covariance function parameters and computing the best unbiased predictor. Our approach produces a new set of multi-level contrasts where the deterministic parameters of the model are filtered out thus enabling the estimation of the covariance parameters to be decoupled from the deterministic component. Moreover, the multi-level covariance matrix of the contrasts exhibit fast decay that is dependent on the smoothness of the covariance function. Due to the fast decay of the multi-level covariance matrix coefficients only a small set is computed with a level dependent criterion. We demonstrate our approach on problems of up to 512,000 observations with a Matern covariance function and highly irregular placements of the observations. In addition, these problems are numerically unstable and hard to solve with traditional methods.
• Non-Stationary Dependence Structures for Spatial Extremes(1411.3174)

Feb. 19, 2016 stat.ME
Max-stable processes are natural models for spatial extremes because they provide suitable asymptotic approximations to the distribution of maxima of random fields. In the recent past, several parametric families of stationary max-stable models have been developed, and fitted to various types of data. However, a recurrent problem is the modeling of non-stationarity. In this paper, we develop non-stationary max-stable dependence structures in which covariates can be easily incorporated. Inference is performed using pairwise likelihoods, and its performance is assessed by an extensive simulation study based on a non-stationary locally isotropic extremal $t$ model. Evidence that unknown parameters are well estimated is provided, and estimation of spatial return level curves is discussed. The methodology is demonstrated with temperature maxima recorded over a complex topography. Models are shown to satisfactorily capture extremal dependence.
• Bayesian linear regression with skew-symmetric error distributions with applications to survival analysis(1601.02224)

Jan. 10, 2016 stat.AP
We study Bayesian linear regression models with skew-symmetric scale mixtures of normal error distributions. These kinds of models can be used to capture departures from the usual assumption of normality of the errors in terms of heavy tails and asymmetry. We propose a general non-informative prior structure for these regression models and show that the corresponding posterior distribution is proper under mild conditions. We extend these propriety results to cases where the response variables are censored. The latter scenario is of interest in the context of accelerated failure time models, which are relevant in survival analysis. We present a simulation study that demonstrates good frequentist properties of the posterior credible intervals associated to the proposed priors. This study also sheds some light on the trade-off between increased model flexibility and the risk of over-fitting. We illustrate the performance of the proposed models with real data. Although we focus on models with univariate response variables, we also present some extensions to the multivariate case in the Supporting Web Material.
• On nomenclature for, and the relative merits of, two formulations of skew distributions(1402.5431)

Dec. 3, 2015 stat.CO, math.ST, stat.TH
We examine some distributions used extensively within the model-based clustering literature in recent years, paying special attention to} claims that have been made about their relative efficacy. Theoretical arguments are provided as well as real data examples.
• Rejoinder of Cross-Covariance Functions for Multivariate Geostatistics''(1507.08405)

July 30, 2015 stat.ME
Rejoinder of Cross-Covariance Functions for Multivariate Geostatistics'' by Genton and Kleiber [arXiv:1507.08017].
• Cross-Covariance Functions for Multivariate Geostatistics(1507.08017)

July 29, 2015 stat.ME
Continuously indexed datasets with multiple variables have become ubiquitous in the geophysical, ecological, environmental and climate sciences, and pose substantial analysis challenges to scientists and statisticians. For many years, scientists developed models that aimed at capturing the spatial behavior for an individual process; only within the last few decades has it become commonplace to model multiple processes jointly. The key difficulty is in specifying the cross-covariance function, that is, the function responsible for the relationship between distinct variables. Indeed, these cross-covariance functions must be chosen to be consistent with marginal covariance functions in such a way that the second-order structure always yields a nonnegative definite covariance matrix. We review the main approaches to building cross-covariance models, including the linear model of coregionalization, convolution methods, the multivariate Mat\'{e}rn and nonstationary and space-time extensions of these among others. We additionally cover specialized constructions, including those designed for asymmetry, compact support and spherical domains, with a review of physics-constrained models. We illustrate select models on a bivariate regional climate model output example for temperature and pressure, along with a bivariate minimum and maximum temperature observational dataset; we compare models by likelihood value as well as via cross-validation co-kriging studies. The article closes with a discussion of unsolved problems.
• Likelihood estimators for multivariate extremes(1411.3448)

June 16, 2015 stat.ME
The main approach to inference for multivariate extremes consists in approximating the joint upper tail of the observations by a parametric family arising in the limit for extreme events. The latter may be expressed in terms of componentwise maxima, high threshold exceedances or point processes, yielding different but related asymptotic characterizations and estimators. The present paper clarifies the connections between the main likelihood estimators, and assesses their practical performance. We investigate their ability to estimate the extremal dependence structure and to predict future extremes, using exact calculations and simulation, in the case of the logistic model.
• Efficient Maximum Approximated Likelihood Inference for Tukey's g-and-h Distribution(1506.00878)

June 2, 2015 stat.ME
Tukey's $g$-and-$h$ distribution has been a powerful tool for data exploration and modeling since its introduction. However, two long standing challenges associated with this distribution family have remained unsolved until this day: how to find an optimal estimation procedure and how to make valid statistical inference on unknown parameters. To overcome these two challenges, a computationally efficient estimation procedure based on maximizing an approximated likelihood function of the Tukey's $g$-and-$h$ distribution is proposed and is shown to have the same estimation efficiency as the maximum likelihood estimator under mild conditions. The asymptotic distribution of the proposed estimator is derived and a series of approximated likelihood ratio test statistics are developed to conduct hypothesis tests involving two shape parameters of Tukey's $g$-and-$h$ distribution. Simulation examples and an analysis of air pollution data are used to demonstrate the effectiveness of the proposed estimation and testing procedures.
• Incorporating geostrophic wind information for improved space-time short-term wind speed forecasting(1412.1915)

Dec. 5, 2014 physics.ao-ph, stat.AP
Accurate short-term wind speed forecasting is needed for the rapid development and efficient operation of wind energy resources. This is, however, a very challenging problem. Although on the large scale, the wind speed is related to atmospheric pressure, temperature, and other meteorological variables, no improvement in forecasting accuracy was found by incorporating air pressure and temperature directly into an advanced space-time statistical forecasting model, the trigonometric direction diurnal (TDD) model. This paper proposes to incorporate the geostrophic wind as a new predictor in the TDD model. The geostrophic wind captures the physical relationship between wind and pressure through the observed approximate balance between the pressure gradient force and the Coriolis acceleration due to the Earth's rotation. Based on our numerical experiments with data from West Texas, our new method produces more accurate forecasts than does the TDD model using air pressure and temperature for 1- to 6-hour-ahead forecasts based on three different evaluation criteria. Furthermore, forecasting errors can be further reduced by using moving average hourly wind speeds to fit the diurnal pattern. For example, our new method obtains between 13.9% and 22.4% overall mean absolute error reduction relative to persistence in 2-hour-ahead forecasts, and between 5.3% and 8.2% reduction relative to the best previous space-time methods in this setting.
• On the asymptotic joint distribution of sample space--time covariance estimators(0803.2171)

March 14, 2008 math.ST, stat.TH
We study the asymptotic joint distribution of sample space--time covariance estimators of strictly stationary random fields. We do this without any marginal or joint distributional assumptions other than mild moment and mixing conditions. We consider several situations depending on whether the observations are regularly or irregularly spaced and whether one part or the whole domain of interest is fixed or increasing. A simulation experiment illustrates the theoretical results.
• Discussion of "Breakdown and groups" by P. L. Davies and U. Gather(math/0508499)

Aug. 25, 2005 math.ST, stat.TH
Discussion of Breakdown and groups'' by P. L. Davies and U. Gather [math.ST/0508497]