• ### The critical exponent: a novel graph invariant(1802.06976)

Feb. 20, 2018 math.CO, math.FA
A surprising result of FitzGerald and Horn (1977) shows that $A^{\circ \alpha} := (a_{ij}^\alpha)$ is positive semidefinite (p.s.d.) for every entrywise nonnegative $n \times n$ p.s.d. matrix $A = (a_{ij})$ if and only if $\alpha$ is a positive integer or $\alpha \geq n-2$. Given a graph $G$, we consider the refined problem of characterizing the set $\mathcal{H}_G$ of entrywise powers preserving positivity for matrices with a zero pattern encoded by $G$. Using algebraic and combinatorial methods, we study how the geometry of $G$ influences the set $\mathcal{H}_G$. Our treatment provides new and exciting connections between combinatorics and analysis, and leads us to introduce and compute a new graph invariant called the critical exponent.
• ### From Random Walks to Random Leaps: Generalizing Classic Markov Chains for Big Data Applications(1708.03116)

Aug. 10, 2017 math.PR
Simple random walks are a basic staple of the foundation of probability theory and form the building block of many useful and complex stochastic processes. In this paper we study a natural generalization of the random walk to a process in which the allowed step sizes take values in the set $\{\pm1,\pm2,\ldots,\pm k\}$, a process we call a random leap. The need to analyze such models arises naturally in modern-day data science and so-called "big data" applications. We provide closed-form expressions for quantities associated with first passage times and absorption events of random leaps. These expressions are formulated in terms of the roots of the characteristic polynomial of a certain recurrence relation associated with the transition probabilities. Our analysis shows that the expressions for absorption probabilities for the classical simple random walk are a special case of a universal result that is very elegant. We also consider an important variant of a random leap: the reflecting random leap. We demonstrate that the reflecting random leap exhibits more interesting behavior in regard to the existence of a stationary distribution and properties thereof. Questions relating to recurrence/transience are also addressed, as well as an application of the random leap.
• ### Preserving positivity for rank-constrained matrices(1406.0042)

June 7, 2017 math.CA, math.FA
Entrywise functions preserving the cone of positive semidefinite matrices have been studied by many authors, most notably by Schoenberg [Duke Math. J. 9, 1942] and Rudin [Duke Math. J. 26, 1959]. Following their work, it is well-known that entrywise functions preserving Loewner positivity in all dimensions are precisely the absolutely monotonic functions. However, there are strong theoretical and practical motivations to study functions preserving positivity in a fixed dimension $n$. Such characterizations for a fixed value of $n$ are difficult to obtain, and in fact are only known in the $2 \times 2$ case. In this paper, using a novel and intuitive approach, we study entrywise functions preserving positivity on distinguished submanifolds inside the cone obtained by imposing rank constraints. These rank constraints are prevalent in applications, and provide a natural way to relax the elusive original problem of preserving positivity in a fixed dimension. In our main result, we characterize entrywise functions mapping $n \times n$ positive semidefinite matrices of rank at most $l$ into positive semidefinite matrices of rank at most $k$ for $1 \leq l \leq n$ and $1 \leq k < n$. We also demonstrate how an important necessary condition for preserving positivity by Horn and Loewner [Trans. Amer. Math. Soc. 136, 1969] can be significantly generalized by adding rank constraints. Finally, our techniques allow us to obtain an elementary proof of the classical characterization of functions preserving positivity in all dimensions obtained by Schoenberg and Rudin.
• ### Scalable Bayesian shrinkage and uncertainty quantification for high-dimensional regression(1509.03697)

April 19, 2017 stat.ME
Bayesian shrinkage methods have generated a lot of recent interest as tools for high-dimensional regression and model selection. These methods naturally facilitate tractable uncertainty quantification and incorporation of prior information. This benefit has led to extensive use of the Bayesian shrinkage methods across diverse applications. A common feature of these models is that the corresponding priors on the regression coefficients can be expressed as scale mixture of normals. While the three-step Gibbs sampler used to sample from the often intractable associated posterior density has been shown to be geometrically ergodic for several of these models, it has been demonstrated recently that convergence of this sampler can still be quite slow in modern high-dimensional settings despite this apparent theoretical safeguard. We propose a new method to draw from the same posterior via a tractable two-step blocked Gibbs sampler. We demonstrate that our proposed two-step blocked sampler exhibits vastly superior convergence behavior compared to the original three-step sampler in high-dimensional regimes on both real and simulated data. We also provide a detailed theoretical underpinning to the new method in the context of the Bayesian lasso. First, we prove that the proposed two-step sampler is geometrically ergodic, and derive explicit upper bounds for the (geometric) rate of convergence. Furthermore, we demonstrate theoretically that while the original Bayesian lasso chain is not Hilbert-Schmidt, the proposed chain is trace class (and hence Hilbert-Schmidt). The trace class property implies that the corresponding Markov operator is compact, and its (countably many) eigenvalues are summable. It also facilitates a rigorous comparison of the two-step blocked chain with "sandwich" algorithms which aim to improve performance of the two-step chain by inserting an inexpensive extra step.
• ### Scalable Bayesian shrinkage and uncertainty quantification in high-dimensional regression(1703.09163)

April 14, 2017 stat.CO
Bayesian shrinkage methods have generated a lot of recent interest as tools for high-dimensional regression and model selection. These methods naturally facilitate tractable uncertainty quantification and incorporation of prior information. A common feature of these models, including the Bayesian lasso, global-local shrinkage priors, and spike-and-slab priors is that the corresponding priors on the regression coefficients can be expressed as scale mixture of normals. While the three-step Gibbs sampler used to sample from the often intractable associated posterior density has been shown to be geometrically ergodic for several of these models (Khare and Hobert, 2013; Pal and Khare, 2014), it has been demonstrated recently that convergence of this sampler can still be quite slow in modern high-dimensional settings despite this apparent theoretical safeguard. We propose a new method to draw from the same posterior via a tractable two-step blocked Gibbs sampler. We demonstrate that our proposed two-step blocked sampler exhibits vastly superior convergence behavior compared to the original three- step sampler in high-dimensional regimes on both real and simulated data. We also provide a detailed theoretical underpinning to the new method in the context of the Bayesian lasso. First, we derive explicit upper bounds for the (geometric) rate of convergence. Furthermore, we demonstrate theoretically that while the original Bayesian lasso chain is not Hilbert-Schmidt, the proposed chain is trace class (and hence Hilbert-Schmidt). The trace class property has useful theoretical and practical implications. It implies that the corresponding Markov operator is compact, and its eigenvalues are summable. It also facilitates a rigorous comparison of the two-step blocked chain with "sandwich" algorithms which aim to improve performance of the two-step chain by inserting an inexpensive extra step.
• ### Generalized Pseudolikelihood Methods for Inverse Covariance Estimation(1606.00033)

Oct. 14, 2016 stat.ME
We introduce PseudoNet, a new pseudolikelihood-based estimator of the inverse covariance matrix, that has a number of useful statistical and computational properties. We show, through detailed experiments with synthetic and also real-world finance as well as wind power data, that PseudoNet outperforms related methods in terms of estimation error and support recovery, making it well-suited for use in a downstream application, where obtaining low estimation error can be important. We also show, under regularity conditions, that PseudoNet is consistent. Our proof assumes the existence of accurate estimates of the diagonal entries of the underlying inverse covariance matrix; we additionally provide a two-step method to obtain these estimates, even in a high-dimensional setting, going beyond the proofs for related methods. Unlike other pseudolikelihood-based methods, we also show that PseudoNet does not saturate, i.e., in high dimensions, there is no hard limit on the number of nonzero entries in the PseudoNet estimate. We present a fast algorithm as well as screening rules that make computing the PseudoNet estimate over a range of tuning parameters tractable.
• ### The Hoffmann-Jorgensen inequality in metric semigroups(1610.02324)

Oct. 11, 2016 math.PR
We prove a refinement of the inequality by Hoffmann-Jorgensen that is significant for three reasons. First, our result improves on the state-of-the-art even for real-valued random variables. Second, the result unifies several versions in the Banach space literature, including those by Johnson and Schechtman [Ann. Prob. 17], Klass and Nowicki [Ann. Prob. 28], and Hitczenko and Montgomery-Smith [Ann. Prob. 29]. Finally, we show that the Hoffmann-Jorgensen inequality (including our generalized version) holds not only in Banach spaces but more generally, in the minimal mathematical framework required to state the inequality: a metric semigroup $\mathscr{G}$. This includes normed linear spaces as well as all compact, discrete, or abelian Lie groups.
• ### The Khinchin-Kahane inequality and Banach space embeddings for metric groups(1610.03037)

Oct. 10, 2016 math.PR, math.GR
We extend the Khinchin-Kahane inequality to an arbitrary abelian metric group $\mathscr{G}$. In the special case where $\mathscr{G}$ is normed, we prove a refinement which is sharp and which extends the sharp version for Banach spaces. We also provide an alternate proof for normed metric groups as a consequence of a general "transfer principle". This transfer principle has immediate applications to stochastic inequalities for $\mathscr{G}$-valued random variables. We also show how to use it to define the expectation of random variables with values in arbitrary abelian normed metric semigroups.
• ### A convex framework for high-dimensional sparse Cholesky based covariance estimation(1610.02436)

Oct. 7, 2016 stat.ME
Covariance estimation for high-dimensional datasets is a fundamental problem in modern day statistics with numerous applications. In these high dimensional datasets, the number of variables p is typically larger than the sample size n. A popular way of tackling this challenge is to induce sparsity in the covariance matrix, its inverse or a relevant transformation. In particular, methods inducing sparsity in the Cholesky pa- rameter of the inverse covariance matrix can be useful as they are guaranteed to give a positive definite estimate of the covariance matrix. Also, the estimated sparsity pattern corresponds to a Directed Acyclic Graph (DAG) model for Gaussian data. In recent years, two useful penalized likelihood methods for sparse estimation of this Cholesky parameter (with no restrictions on the sparsity pattern) have been developed. How- ever, these methods either consider a non-convex optimization problem which can lead to convergence issues and singular estimates of the covariance matrix when p > n, or achieve a convex formulation by placing a strict constraint on the conditional variance parameters. In this paper, we propose a new penalized likelihood method for sparse estimation of the inverse covariance Cholesky parameter that aims to overcome some of the shortcomings of current methods, but retains their respective strengths. We ob- tain a jointly convex formulation for our objective function, which leads to convergence guarantees, even when p > n. The approach always leads to a positive definite and symmetric estimator of the covariance matrix. We establish high-dimensional estima- tion and graph selection consistency, and also demonstrate finite sample performance on simulated/real data.
• ### Probability inequalities and tail estimates for metric semigroups(1506.02605)

Oct. 6, 2016 math.PR, math.FA, math.GR
We study probability inequalities leading to tail estimates in a general semigroup $\mathscr{G}$ with a translation-invariant metric $d_{\mathscr{G}}$. Using recent work [Ann. Prob., to appear] that extends the Hoffmann-Jorgensen inequality to all metric semigroups, we obtain tail estimates and approximate bounds for sums of independent semigroup-valued random variables, their moments, and decreasing rearrangements. In particular, we obtain the "correct" universal constants in several cases, extending results in the Banach space literature by Johnson--Schechtman--Zinn [Ann. Prob. 13], Hitczenko [Ann. Prob. 22], and Hitczenko and Montgomery-Smith [Ann. Prob. 29]. Our results also hold more generally, in the minimal mathematical framework required to state them: metric semigroups $\mathscr{G}$. This includes all compact, discrete, or abelian Lie groups.
• ### Two-stage Sampling, Prediction and Adaptive Regression via Correlation Screening (SPARCS)(1502.06189)

Oct. 2, 2016 cs.LG, stat.ML
This paper proposes a general adaptive procedure for budget-limited predictor design in high dimensions called two-stage Sampling, Prediction and Adaptive Regression via Correlation Screening (SPARCS). SPARCS can be applied to high dimensional prediction problems in experimental science, medicine, finance, and engineering, as illustrated by the following. Suppose one wishes to run a sequence of experiments to learn a sparse multivariate predictor of a dependent variable $Y$ (disease prognosis for instance) based on a $p$ dimensional set of independent variables $\mathbf X=[X_1,\ldots, X_p]^T$ (assayed biomarkers). Assume that the cost of acquiring the full set of variables $\mathbf X$ increases linearly in its dimension. SPARCS breaks the data collection into two stages in order to achieve an optimal tradeoff between sampling cost and predictor performance. In the first stage we collect a few ($n$) expensive samples $\{y_i,\mathbf x_i\}_{i=1}^n$, at the full dimension $p\gg n$ of $\mathbf X$, winnowing the number of variables down to a smaller dimension $l < p$ using a type of cross-correlation or regression coefficient screening. In the second stage we collect a larger number $(t-n)$ of cheaper samples of the $l$ variables that passed the screening of the first stage. At the second stage, a low dimensional predictor is constructed by solving the standard regression problem using all $t$ samples of the selected variables. SPARCS is an adaptive online algorithm that implements false positive control on the selected variables, is well suited to small sample sizes, and is scalable to high dimensions. We establish asymptotic bounds for the Familywise Error Rate (FWER), specify high dimensional convergence rates for support recovery, and establish optimal sample allocation rules to the first and second stages.
• ### Model-free consistency of graph partitioning(1608.03860)

Aug. 12, 2016 math.CO, math.ST, stat.TH, math.SP
In this paper, we exploit the theory of dense graph limits to provide a new framework to study the stability of graph partitioning methods, which we call structural consistency. Both stability under perturbation as well as asymptotic consistency (i.e., convergence with probability $1$ as the sample size goes to infinity under a fixed probability model) follow from our notion of structural consistency. By formulating structural consistency as a continuity result on the graphon space, we obtain robust results that are completely independent of the data generating mechanism. In particular, our results apply in settings where observations are not independent, thereby significantly generalizing the common probabilistic approach where data are assumed to be i.i.d. In order to make precise the notion of structural consistency of graph partitioning, we begin by extending the theory of graph limits to include vertex colored graphons. We then define continuous node-level statistics and prove that graph partitioning based on such statistics is consistent. Finally, we derive the structural consistency of commonly used clustering algorithms in a general model-free setting. These include clustering based on local graph statistics such as homomorphism densities, as well as the popular spectral clustering using the normalized Laplacian. We posit that proving the continuity of clustering algorithms in the graph limit topology can stand on its own as a more robust form of model-free consistency. We also believe that the mathematical framework developed in this paper goes beyond the study of clustering algorithms, and will guide the development of similar model-free frameworks to analyze other procedures in the broader mathematical sciences.
• ### Towards a sparse, scalable, and stably positive definite (inverse) covariance estimator(1502.00471)

June 27, 2016 stat.ME
High dimensional covariance estimation and graphical models is a contemporary topic in statistics and machine learning having widespread applications. An important line of research in this regard is to shrink the extreme spectrum of the covariance matrix estimators. A separate line of research in the literature has considered sparse inverse covariance estimation which in turn gives rise to graphical models. In practice, however, a sparse covariance or inverse covariance matrix which is simultaneously well-conditioned and at the same time computationally tractable is desired. There has been little research at the confluence of these three topics. In this paper we consider imposing a condition number constraint to various types of losses used in covariance and inverse covariance matrix estimation. When the loss function can be decomposed as a sum of an orthogonally invariant function of the estimate and its inner product with a function of the sample covariance matrix, we show that a solution path algorithm can be derived, involving a series of ordinary differential equations. The path algorithm is attractive because it provides the entire family of estimates for all possible values of the condition number bound, at the same computational cost of a single estimate with a fixed upper bound. An important finding is that the proximal operator for the condition number constraint, which turns out to be very useful in regularizing loss functions that are not orthogonally invariant and may yield non-positive-definite estimates, can be efficiently computed by this path algorithm. As a concrete illustration of its practical importance, we develop an operator-splitting algorithm that imposes a guarantee of well-conditioning as well as positive definiteness to recently proposed convex pseudo-likelihood based graphical model selection methods.
• ### Preserving positivity for matrices with sparsity constraints(1406.3408)

Jan. 8, 2016 math.CA, math.FA
Functions preserving Loewner positivity when applied entrywise to positive semidefinite matrices have been widely studied in the literature. Following the work of Schoenberg [Duke Math. J. 9], Rudin [Duke Math. J. 26], and others, it is well-known that functions preserving positivity for matrices of all dimensions are absolutely monotonic (i.e., analytic with nonnegative Taylor coefficients). In this paper, we study functions preserving positivity when applied entrywise to sparse matrices, with zeros encoded by a graph $G$ or a family of graphs $G_n$. Our results generalize Schoenberg and Rudin's results to a modern setting, where functions are often applied entrywise to sparse matrices in order to improve their properties (e.g. better conditioning). The only such result known in the literature is for the complete graph $K_2$. We provide the first such characterization result for a large family of non-complete graphs. Specifically, we characterize functions preserving Loewner positivity on matrices with zeros according to a tree. These functions are multiplicatively midpoint-convex and super-additive. Leveraging the underlying sparsity in matrices thus admits the use of functions which are not necessarily analytic nor absolutely monotonic. We further show that analytic functions preserving positivity on matrices with zeros according to trees can contain arbitrarily long sequences of negative coefficients, thus obviating the need for absolute monotonicity in a very strong sense. This result leads to the question of exactly when absolute monotonicity is necessary when preserving positivity for an arbitrary class of graphs. We then provide a stronger condition in terms of the numerical range of all symmetric matrices, such that functions satisfying this condition on matrices with zeros according to any family of graphs with unbounded degrees are necessarily absolutely monotonic.
• ### Critical exponents of graphs(1504.04069)

Nov. 17, 2015 math.CO, math.FA
The study of entrywise powers of matrices was originated by Loewner in the pursuit of the Bieberbach conjecture. Since the work of FitzGerald and Horn (1977), it is known that $A^{\circ \alpha} := (a_{ij}^\alpha)$ is positive semidefinite for every entrywise nonnegative $n \times n$ positive semidefinite matrix $A = (a_{ij})$ if and only if $\alpha$ is a positive integer or $\alpha \geq n-2$. This surprising result naturally extends the Schur product theorem, and demonstrates the existence of a sharp phase transition in preserving positivity. In this paper, we study when entrywise powers preserve positivity for matrices with structure of zeros encoded by graphs. To each graph is associated an invariant called its "critical exponent", beyond which every power preserves positivity. In our main result, we determine the critical exponents of all chordal/decomposable graphs, and relate them to the geometry of the underlying graphs. We then examine the critical exponent of important families of non-chordal graphs such as cycles and bipartite graphs. Surprisingly, large families of dense graphs have small critical exponents that do not depend on the number of vertices of the graphs.
• ### Extracting Common Time Trends from Concurrent Time Series: Maximum Autocorrelation Factors with Application to Tree Ring Time Series Data(1502.01073)

Oct. 17, 2015 stat.ME
Concurrent time series commonly arise in various applications, including when monitoring the environment such as in air quality measurement networks, weather stations, oceanographic buoys, or in paleo form such as lake sediments, tree rings, ice cores, or coral isotopes, with each monitoring or sampling site providing one of the time series. The goal in such applications is to extract a common time trend or signal in the observed data. Other examples where the goal is to extract a common time trend for multiple time series are in stock price time series, neurological time series, and quality control time series. For this purpose we develop properties of MAF [Maximum Autocorrelation Factors] that linearly combines time series in order to maximize the resulting SNR [signal-to-noise-ratio] where there are multiple smooth signals present in the data. Equivalence is established in a regression setting between MAF and CCA [Canonical Correlation Analysis] even though MAF does not require specific signal knowledge as opposed to CCA. We proceed to derive the theoretical properties of MAF and quantify the SNR advantages of MAF in comparison with PCA [Principal Components Analysis], a commonly used method for linearly combining time series, and compare their statistical sample properties. MAF and PCA are then applied to real and simulated data sets to illustrate MAFs efficacy.
• ### MCMC-Based Inference in the Era of Big Data: A Fundamental Analysis of the Convergence Complexity of High-Dimensional Chains(1508.00947)

Aug. 27, 2015 math.PR, math.ST, stat.TH, stat.ME
Markov chain Monte Carlo (MCMC) lies at the core of modern Bayesian methodology, much of which would be impossible without it. Thus, the convergence properties of MCMCs have received significant attention, and in particular, proving (geometric) ergodicity is of critical interest. Trust in the ability of MCMCs to sample from modern-day high-dimensional posteriors, however, has been limited by a widespread perception that these chains typically experience serious convergence problems. In this paper, we first demonstrate that contemporary methods for obtaining convergence rates have serious limitations when the dimension grows. We then propose a framework for rigorously establishing the convergence behavior of commonly used high-dimensional MCMCs. In particular, we demonstrate theoretically the precise nature and severity of the convergence problems of popular MCMCs when implemented in high dimensions, including phase transitions in the convergence rates in various $n$ and $p$ regimes, and a universality result across an entire spectrum of models. We also show that convergence problems effectively eliminate the apparent safeguard of geometric ergodicity. We then demonstrate theoretical principles by which MCMCs can be constructed and analyzed to yield bounded geometric convergence rates even as the dimension $p$ grows without bound. Additionally, we propose a diagnostic tool for establishing convergence.
• ### Integration and measures on the space of countable labelled graphs(1506.01439)

June 4, 2015 math.CO, math.CA
In this paper we develop a rigorous foundation for the study of integration and measures on the space $\mathscr{G}(V)$ of all graphs defined on a countable labelled vertex set $V$. We first study several interrelated $\sigma$-algebras and a large family of probability measures on graph space. We then focus on a "dyadic" Hamming distance function $\left\| \cdot \right\|_{\psi,2}$, which was very useful in the study of differentiation on $\mathscr{G}(V)$. The function $\left\| \cdot \right\|_{\psi,2}$ is shown to be a Haar measure-preserving bijection from the subset of infinite graphs to the circle (with the Haar/Lebesgue measure), thereby naturally identifying the two spaces. As a consequence, we establish a "change of variables" formula that enables the transfer of the Riemann-Lebesgue theory on $\mathbb{R}$ to graph space $\mathscr{G}(V)$. This also complements previous work in which a theory of Newton-Leibnitz differentiation was transferred from the real line to $\mathscr{G}(V)$ for countable $V$. Finally, we identify the Pontryagin dual of $\mathscr{G}(V)$, and characterize the positive definite functions on $\mathscr{G}(V)$.
• ### Statistical paleoclimate reconstructions via Markov random fields(1309.6702)

June 2, 2015 stat.AP
Understanding centennial scale climate variability requires data sets that are accurate, long, continuous and of broad spatial coverage. Since instrumental measurements are generally only available after 1850, temperature fields must be reconstructed using paleoclimate archives, known as proxies. Various climate field reconstructions (CFR) methods have been proposed to relate past temperature to such proxy networks. In this work, we propose a new CFR method, called GraphEM, based on Gaussian Markov random fields embedded within an EM algorithm. Gaussian Markov random fields provide a natural and flexible framework for modeling high-dimensional spatial fields. At the same time, they provide the parameter reduction necessary for obtaining precise and well-conditioned estimates of the covariance structure, even in the sample-starved setting common in paleoclimate applications. In this paper, we propose and compare the performance of different methods to estimate the graphical structure of climate fields, and demonstrate how the GraphEM algorithm can be used to reconstruct past climate variations. The performance of GraphEM is compared to the widely used CFR method RegEM with regularization via truncated total least squares, using synthetic data. Our results show that GraphEM can yield significant improvements, with uniform gains over space, and far better risk properties. We demonstrate that the spatial structure of temperature fields can be well estimated by graphs where each neighbor is only connected to a few geographically close neighbors, and that the increase in performance is directly related to recovering the underlying sparsity in the covariance of the spatial field. Our work demonstrates how significant improvements can be made in climate reconstruction methods by better modeling the covariance structure of the climate field.
• ### Foundational principles for large scale inference: Illustrations through correlation mining(1505.02475)

May 18, 2015 math.ST, stat.TH, stat.ML
When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number $n$ of acquired samples (statistical replicates) is far fewer than the number $p$ of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size $n$ is fixed, and the dimension $p$ grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.
• ### Bayesian inference for Gaussian graphical models beyond decomposable graphs(1505.00703)

May 4, 2015 stat.ME
Bayesian inference for graphical models has received much attention in the literature in recent years. It is well known that when the graph G is decomposable, Bayesian inference is significantly more tractable than in the general non-decomposable setting. Penalized likelihood inference on the other hand has made tremendous gains in the past few years in terms of scalability and tractability. Bayesian inference, however, has not had the same level of success, though a scalable Bayesian approach has its respective strengths, especially in terms of quantifying uncertainty. To address this gap, we propose a scalable and flexible novel Bayesian approach for estimation and model selection in Gaussian undirected graphical models. We first develop a class of generalized G-Wishart distributions with multiple shape parameters for an arbitrary underlying graph. This class contains the G-Wishart distribution as a special case. We then introduce the class of Generalized Bartlett (GB) graphs, and derive an efficient Gibbs sampling algorithm to obtain posterior draws from generalized G-Wishart distributions corresponding to a GB graph. The class of Generalized Bartlett graphs contains the class of decomposable graphs as a special case, but is substantially larger than the class of decomposable graphs. We proceed to derive theoretical properties of the proposed Gibbs sampler. We then demonstrate that the proposed Gibbs sampler is scalable to significantly higher dimensional problems as compared to using an accept-reject or a Metropolis-Hasting algorithm. Finally, we show the efficacy of the proposed approach on simulated and real data.
• ### High dimensional Bayesian inference for Gaussian directed acyclic graph models(1109.4371)

March 6, 2015 math.ST, stat.TH, stat.OT
In this paper, we consider Gaussian models Markov with respect to an arbitrary DAG. We first construct a family of conjugate priors for the Cholesky parametrization of the covariance matrix of such models. This family has as many shape parameters as the DAG has vertices, and naturally extends the work of Geiger and Heckerman [8]. From these distributions, we derive prior distributions for the covariance and precision parameters of the Gaussian DAG Markov models. Our works thus extends the work of Dawid and Lauritzen [5] and Letac and Massam [16] for Gaussian models Markov with respect to a decomposable graph to arbitrary DAGs. For this reason, we call our distributions DAG-Wishart distributions. An advantage of these distributions is that they possess strong hyper Markov properties and thus allow for explicit estimation of the covariance and precision parameters, regardless of the dimension of the problem. They also allow us to develop methodology for model selection and covariance estimation in the space of DAG-Markov models. We demonstrate via several numerical examples that the proposed method scales well to high-dimensions.
• ### Differential Calculus on Graphon Space(1403.3736)

Feb. 18, 2015 math.CO, math.FA
Recently, the theory of dense graph limits has received attention from multiple disciplines including graph theory, computer science, statistical physics, probability, statistics, and group theory. In this paper we initiate the study of the general structure of differentiable graphon parameters $F$. We derive consistency conditions among the higher G\^ateaux derivatives of $F$ when restricted to the subspace of edge weighted graphs $\mathcal{W}_{\bf p}$. Surprisingly, these constraints are rigid enough to imply that the multilinear functionals $\Lambda: \mathcal{W}_{\bf p}^n \to \mathbb{R}$ satisfying the constraints are determined by a finite set of constants indexed by isomorphism classes of multigraphs with $n$ edges and no isolated vertices. Using this structure theory, we explain the central role that homomorphism densities play in the analysis of graphons, by way of a new combinatorial interpretation of their derivatives. In particular, homomorphism densities serve as the monomials in a polynomial algebra that can be used to approximate differential graphon parameters as Taylor polynomials. These ideas are summarized by our main theorem, which asserts that homomorphism densities $t(H,-)$ where $H$ has at most $N$ edges form a basis for the space of smooth graphon parameters whose $(N+1)$st derivatives vanish. As a consequence of this theory, we also extend and derive new proofs of linear independence of multigraph homomorphism densities, and characterize homomorphism densities. In addition, we develop a theory of series expansions, including Taylor's theorem for graph parameters and a uniqueness principle for series. We use this theory to analyze questions raised by Lov\'asz, including studying infinite quantum algebras and the connection between right- and left-homomorphism densities.
• ### Graphical Markov models for infinitely many variables(1501.07878)

Jan. 30, 2015 math.CA, math.PR
Representing the conditional independences present in a multivariate random vector via graphs has found widespread use in applications, and such representations are popularly known as graphical models or Markov random fields. These models have many useful properties, but their fundamental attractive feature is their ability to reflect conditional independences between blocks of variables through graph separation, a consequence of the equivalence of the pairwise, local and global Markov properties demonstrated by Pearl and Paz (1985). Modern day applications often necessitate working with either an infinite collection of variables (such as in a spatial-temporal field) or approximating a large high-dimensional finite stochastic system with an infinite-dimensional system. However, it is unclear whether the conditional independences present in an infinite-dimensional random vector or stochastic process can still be represented by separation criteria in an infinite graph. In light of the advantages of using graphs as tools to represent stochastic relationships, we undertake in this paper a general study of infinite graphical models. First, we demonstrate that naive extensions of the assumptions required for the finite case results do not yield equivalence of the Markov properties in the infinite-dimensional setting, thus calling for a more in-depth analysis. To this end, we proceed to derive general conditions which do allow representing the conditional independence in an infinite-dimensional random system by means of graphs, and our results render the result of Pearl and Paz as a special case of a more general phenomenon. We conclude by demonstrating the applicability of our theory through concrete examples of infinite-dimensional graphical models.
• ### Convergence of cyclic coordinatewise l1 minimization(1404.5100)

Jan. 29, 2015 stat.CO, math.OC
We consider the general problem of minimizing an objective function which is the sum of a convex function (not strictly convex) and absolute values of a subset of variables (or equivalently the l1-norm of the variables). This problem appears exten- sively in modern statistical applications associated with high-dimensional data or "big data", and corresponds to optimizing l1-regularized likelihoods in the context of model selection. In such applications, cyclic coordinatewise minimization (CCM), where the objective function is sequentially minimized with respect to each individual coordi- nate, is often employed as it offers a computationally cheap and effective optimization method. Consequently, it is crucial to obtain theoretical guarantees of convergence for the sequence of iterates produced by the cyclic coordinatewise minimization in this setting. Moreover, as the objective corresponds to at l1-regularized likelihoods of many variables, it is important to obtain convergence of the iterates themselves, and not just the function values. Previous results in the literature only establish either, (i) that every limit point of the sequence of iterates is a stationary point of the objective function, or (ii) establish convergence under special assumptions, or (iii) establish con- vergence for a different minimization approach (which uses quadratic approximation based gradient descent followed by an inexact line search), (iv) establish convergence of only the function values of the sequence of iterates produced by random coordinatewise minimization (a variant of CCM). In this paper, a rigorous general proof of convergence for the cyclic coordinatewise minimization algorithm is provided. We demonstrate the usefulness of our general results in contemporary applications.