• The widespread use of generalized linear models in case-control genetic studies has helped identify many disease-associated risk factors typically defined as DNA variants, or single nucleotide polymorphisms (SNPs). Up to now, most literature has focused on selecting a unique best subset of SNPs based on some statistical perspectives. In the presence of pronounced noise, however, multiple biological paths are often found to be equally supported by a given dataset when dealing with complex genetic diseases. We address the ambiguity related to SNP selection by constructing a list of models called variable selection confidence set (VSCS), which contains the collection of all well-supported SNP combinations at a user-specified confidence level. The VSCS extends the familiar notion of confidence intervals in the variable selection setting and provides the practitioner with new tools aiding the variable selection activity beyond trusting a single model. Based on the VSCS, we consider natural graphical and numerical statistics measuring the inclusion importance of a SNP based on its frequency in the most parsimonious VSCS models. This work is motivated by available case-control genetic data on age-related macular degeneration, a widespread complex disease and leading cause of vision loss.
  • In this article, we introduce the concept of model confidence bounds (MCB) for variable selection in the context of nested models. Similarly to the endpoints in the familiar confidence interval for parameter estimation, the MCB identifies two nested models (upper and lower confidence bound models) containing the true model at a given level of confidence. Instead of trusting a single selected model obtained from a given model selection method, the MCB proposes a group of nested models as candidates and the MCB's width and composition enable the practitioner to assess the overall model selection uncertainty. A new graphical tool --- the model uncertainty curve (MUC) --- is introduced to visualize the variability of model selection and to compare different model selection procedures. The MCB methodology is implemented by a fast bootstrap algorithm that is shown to yield the correct asymptotic coverage under rather general conditions. Our Monte Carlo simulations and real data examples confirm the validity and illustrate the advantages of the proposed method.
  • Multicolor cell spatio-temporal image data have become important to investigate organ development and regeneration, malignant growth or immune responses by tracking different cell types both in vivo and in vitro. Statistical modeling of image data from common longitudinal cell experiments poses significant challenges due to the presence of complex spatio-temporal interactions between different cell types and difficulties related to measurement of single cell trajectories. Current analysis methods focus mainly on univariate cases, often not considering the spatio-temporal effects affecting cell growth between different cell populations. In this paper, we propose a conditional spatial autoregressive model to describe multivariate count cell data on the lattice, and develop inference tools. The proposed methodology is computationally tractable and enables researchers to estimate a complete statistical model of multicolor cell growth. Our methodology is applied on real experimental data where we investigate how interactions between cells affect their growth. We include two case studies; the first evaluates interactions between cancer cells and fibroblasts, which are normally present in the tumor microenvironment, whilst the second evaluates interactions between cloned cancer cells when grown as different combinations.
  • Recently, IBM has made available a quantum computer provided with 16 qubits, denoted as IBM Q16. Previously, only a 5 qubit device, denoted as Q5, was available. Both IBM devices can be used to run quantum programs, by means of a cloud-based platform. In this paper, we illustrate our experience with IBM Q16 in demonstrating entanglement assisted invariance, also known as envariance, and parity learning by querying a uniform quantum example oracle. In particular, we illustrate the non-trivial strategy we have designed for compiling $n$-qubit quantum circuits ($n$ being an input parameter) on any IBM device, taking into account topological constraints.
  • The traditional activity of model selection aims at discovering a single model superior to other candidate models. In the presence of pronounced noise, however, multiple models are often found to explain the same data equally well. To resolve this model selection ambiguity, we introduce the general approach of model selection confidence sets (MSCSs) based on likelihood ratio testing. A MSCS is defined as a list of models statistically indistinguishable from the true model at a user-specified level of confidence, which extends the familiar notion of confidence intervals to the model-selection framework. Our approach guarantees asymptotically correct coverage probability of the true model when both sample size and model dimension increase. We derive conditions under which the MSCS contains all the relevant information about the true model structure. In addition, we propose natural statistics based on the MSCS to measure importance of variables in a principled way that accounts for the overall model uncertainty. When the space of feasible models is large, MSCS is implemented by an adaptive stochastic search algorithm which samples MSCS models with high probability. The MSCS methodology is illustrated through numerical experiments on synthetic data and real data examples.
  • Growth in both size and complexity of modern data challenges the applicability of traditional likelihood-based inference. Composite likelihood (CL) methods address the difficulties related to model selection and computational intractability of the full likelihood by combining a number of low-dimensional likelihood objects into a single objective function used for inference. This paper introduces a procedure to combine partial likelihood objects from a large set of feasible candidates and simultaneously carry out parameter estimation. The new method constructs estimating equations balancing statistical efficiency and computing cost by minimizing an approximate distance from the full likelihood score subject to a L1-norm penalty representing the available computing resources. This results in truncated CL equations containing only the most informative partial likelihood score terms. An asymptotic theory within a framework where both sample size and data dimension grow is developed and finite-sample properties are illustrated through numerical examples.
  • Testing the association between a phenotype and many genetic variants from case-control data is essential in genome-wide association study (GWAS). This is a challenging task as many such variants are correlated or non-informative. Similarities exist in testing the population difference between two groups of high dimensional data with intractable full likelihood function. Testing may be tackled by a maximum composite likelihood (MCL) not entailing the full likelihood, but current MCL tests are subject to power loss for involving non-informative or redundant sub-likelihoods. In this paper, we develop a forward search and test method for simultaneous powerful group difference testing and informative sub-likelihoods composition. Our method constructs a sequence of Wald-type test statistics by including only informative sub-likelihoods progressively so as to improve the test power under local sparsity alternatives. Numerical studies show that it achieves considerable improvement over the available tests as the modeling complexity grows. Our method is further validated by testing the motivating GWAS data on breast cancer with interesting results obtained.
  • Composite likelihood estimation has an important role in the analysis of multivariate data for which the full likelihood function is intractable. An important issue in composite likelihood inference is the choice of the weights associated with lower-dimensional data sub-sets, since the presence of incompatible sub-models can deteriorate the accuracy of the resulting estimator. In this paper, we introduce a new approach for simultaneous parameter estimation by tilting, or re-weighting, each sub-likelihood component called discriminative composite likelihood estimation (D-McLE). The data-adaptive weights maximize the composite likelihood function, subject to moving a given distance from uniform weights; then, the resulting weights can be used to rank lower-dimensional likelihoods in terms of their influence in the composite likelihood function. Our analytical findings and numerical examples support the stability of the resulting estimator compared to estimators constructed using standard composition strategies based on uniform weights. The properties of the new method are illustrated through simulated data and real spatial data on multivariate precipitation extremes.
  • The traditional maximum likelihood estimator (MLE) is often of limited use in complex high-dimensional data due to the intractability of the underlying likelihood function. Maximum composite likelihood estimation (McLE) avoids full likelihood specification by combining a number of partial likelihood objects depending on small data subsets, thus enabling inference for complex data. A fundamental difficulty in making the McLE approach practicable is the selection from numerous candidate likelihood objects for constructing the composite likelihood function. In this paper, we propose a flexible Gibbs sampling scheme for optimal selection of sub-likelihood components. The sampled composite likelihood functions are shown to converge to the one maximally informative on the unknown parameters in equilibrium, since sub-likelihood objects are chosen with probability depending on the variance of the corresponding McLE. A penalized version of our method generates sparse likelihoods with a relatively small number of components when the data complexity is intense. Our algorithms are illustrated through numerical examples on simulated data as well as real genotype SNP data from a case-control study.
  • In this paper, the maximum L$q$-likelihood estimator (ML$q$E), a new parameter estimator based on nonextensive entropy [Kibernetika 3 (1967) 30--35] is introduced. The properties of the ML$q$E are studied via asymptotic analysis and computer simulations. The behavior of the ML$q$E is characterized by the degree of distortion $q$ applied to the assumed model. When $q$ is properly chosen for small and moderate sample sizes, the ML$q$E can successfully trade bias for precision, resulting in a substantial reduction of the mean squared error. When the sample size is large and $q$ tends to 1, a necessary and sufficient condition to ensure a proper asymptotic normality and efficiency of ML$q$E is established.