• Quadratic discriminant analysis (QDA) is a standard tool for classification due to its simplicity and flexibility. Because the number of its parameters scales quadratically with the number of the variables, QDA is not practical, however, when the dimensionality is relatively large. To address this, we propose a novel procedure named DA-QDA for QDA in analyzing high-dimensional data. Formulated in a simple and coherent framework, DA-QDA aims to directly estimate the key quantities in the Bayes discriminant function including quadratic interactions and a linear index of the variables for classification. Under appropriate sparsity assumptions, we establish consistency results for estimating the interactions and the linear index, and further demonstrate that the misclassification rate of our procedure converges to the optimal Bayes risk, even when the dimensionality is exponentially high with respect to the sample size. An efficient algorithm based on the alternating direction method of multipliers (ADMM) is developed for finding interactions, which is much faster than its competitor in the literature. The promising performance of DA-QDA is illustrated via extensive simulation studies and the analysis of four real datasets.
  • Privacy amplification is an indispensable step in postprocessing of continuous-variable quantum key distribution (CV-QKD), which is used to distill unconditional secure keys from identical corrected keys between two distant legal parties. The processing speed of privacy amplification has a significant effect on the secret key rate of CV-QKD system. We report the high-speed parallel implementation of length-compatible privacy amplification algorithm based on graphic processing unit. Length-compatible algorithm is used to satisfy the security requirement of privacy amplification at different transmission distances when considering finite-size effect. We achieve the speed of privacy amplification over 1 Gbps at arbitrary input length and the speed is one to two orders of magnitude faster than previous demonstrations, which supports high-speed real-time CV-QKD system and ensures the security of privacy amplification.
  • Information reconciliation protocol has a significant effect on the secret key rate and maximal transmission distance of continuous-variable quantum key distribution (CV-QKD) systems. We propose an efficient rate-adaptive reconciliation protocol suitable for practical CV-QKD systems with time-varying quantum channel. This protocol changes the code rate of multi-edge type low density parity check codes, by puncturing (increasing the code rate) and shortening (decreasing the code rate) techniques, to enlarge the correctable signal-to-noise ratios regime, thus improves the overall reconciliation efficiency comparing to the original fixed rate reconciliation protocol. We verify our rate-adaptive reconciliation protocol with three typical code rate, i.e., 0.1, 0.05 and 0.02, the reconciliation efficiency keep around 93.5%, 95.4% and 96.4% for different signal-to-noise ratios, which shows the potential of implementing high-performance CV-QKD systems using single code rate matrix.
  • We unify slice sampling and Hamiltonian Monte Carlo (HMC) sampling, demonstrating their connection via the Hamiltonian-Jacobi equation from Hamiltonian mechanics. This insight enables extension of HMC and slice sampling to a broader family of samplers, called Monomial Gamma Samplers (MGS). We provide a theoretical analysis of the mixing performance of such samplers, proving that in the limit of a single parameter, the MGS draws decorrelated samples from the desired target distribution. We further show that as this parameter tends toward this limit, performance gains are achieved at a cost of increasing numerical difficulty and some practical convergence issues. Our theoretical results are validated with synthetic data and real-world applications.
  • The continuous-variable version of quantum key distribution (QKD) offers the advantages (over discrete-variable systems) of higher secret key rates in metropolitan areas as well as the use of standard telecom components that can operate at room temperature. An important step in the real-world adoption of continuous-variable QKD is the deployment of field tests over commercial fibers. Here we report two different field tests of a continuous-variable QKD system through commercial fiber networks in Xi'an and Guangzhou over distances of 30.02 km (12.48 dB) and 49.85 km (11.62 dB), respectively. We achieve secure key rates two orders-of-magnitude higher than previous field test demonstrations. This is achieved by developing a fully automatic control system to create stable excess noise and by applying a rate-adaptive reconciliation protocol to achieve a high reconciliation efficiency with high success probability. Our results pave the way to achieving continuous-variable QKD in a metropolitan setting.
  • The amount of data moved over dedicated and non-dedicated network links increases much faster than the increase in the network capacity, but the current solutions fail to guarantee even the promised achievable transfer throughputs. In this paper, we propose a novel dynamic throughput optimization model based on mathematical modeling with offline knowledge discovery/analysis and adaptive online decision making. In offline analysis, we mine historical transfer logs to perform knowledge discovery about the transfer characteristics. Online phase uses the discovered knowledge from the offline analysis along with real-time investigation of the network condition to optimize the protocol parameters. As real-time investigation is expensive and provides partial knowledge about the current network status, our model uses historical knowledge about the network and data to reduce the real-time investigation overhead while ensuring near optimal throughput for each transfer. Our network and data agnostic solution is tested over different networks and achieved up to 93% accuracy compared with the optimal achievable throughput possible on those networks.
  • We study the impact of finite-size effect on continuous-variable measurement-device-independent quantum key distribution (CV-MDI QKD) protocol, mainly considering the finite-size effect on parameter estimation procedure. The central-limit theorem and the maximum likelihood estimation theorem are used to estimate the parameters. We also analyze the relationship between the number of exchanged signals and the optimal modulation variance in the protocol. It is proved that when Charlie's position is close to Bob, the CV-MDI QKD protocol has the farthest transmission distance in finite-size scenario. Finally, we discuss the impact of finite-size effects related to the practical detection in the CV-MDI QKD protocol. The overall results indicate that the finite-size effect has a great influence on the secret key rate of the CV-MDI QKD protocol and should not be ignored.
  • Variational inference (VI) provides fast approximations of a Bayesian posterior in part because it formulates posterior approximation as an optimization problem: to find the closest distribution to the exact posterior over some family of distributions. For practical reasons, the family of distributions in VI is usually constrained so that it does not include the exact posterior, even as a limit point. Thus, no matter how long VI is run, the resulting approximation will not approach the exact posterior. We propose to instead consider a more flexible approximating family consisting of all possible finite mixtures of a parametric base distribution (e.g., Gaussian). For efficient inference, we borrow ideas from gradient boosting to develop an algorithm we call boosting variational inference (BVI). BVI iteratively improves the current approximation by mixing it with a new component from the base distribution family and thereby yields progressively more accurate posterior approximations as more computing time is spent. Unlike a number of common VI variants including mean-field VI, BVI is able to capture multimodality, general posterior covariance, and nonstandard posterior shapes.
  • We study an optimal control problem in which both the objective function and the dynamic constraint contain an uncertain parameter. Since the distribution of this uncertain parameter is not exactly known, the objective function is taken as the worst-case expectation over a set of possible distributions of the uncertain parameter. This ambiguity set of distributions is, in turn, defined by the first two moments of the random variables involved. The optimal control is found by minimizing the worst-case expectation over all possible distributions in this set. If the distributions are discrete, the stochastic min-max optimal control problem can be converted into a convensional optimal control problem via duality, which is then approximated as a finite-dimensional optimization problem via the control parametrization. We derive necessary conditions of optimality and propose an algorithm to solve the approximation optimization problem. The results of discrete probability distribution are then extended to the case with one dimensional continuous stochastic variable by applying the control parametrization methodology on the continuous stochastic variable, and the convergence results are derived. A numerical example is present to illustrate the potential application of the proposed model and the effectiveness of the algorithm.
  • Ordinary least squares (OLS) is the default method for fitting linear models, but is not applicable for problems with dimensionality larger than the sample size. For these problems, we advocate the use of a generalized version of OLS motivated by ridge regression, and propose two novel three-step algorithms involving least squares fitting and hard thresholding. The algorithms are methodologically simple to understand intuitively, computationally easy to implement efficiently, and theoretically appealing for choosing models consistently. Numerical exercises comparing our methods with penalization-based approaches in simulations and data analyses illustrate the great potential of the proposed algorithms.
  • Recent years have seen the exponential growth of heterogeneous multimedia data. The need for effective and accurate data retrieval from heterogeneous data sources has attracted much research interest in cross-media retrieval. Here, given a query of any media type, cross-media retrieval seeks to find relevant results of different media types from heterogeneous data sources. To facilitate large-scale cross-media retrieval, we propose a novel unsupervised cross-media hashing method. Our method incorporates local affinity and distance repulsion constraints into a matrix factorization framework. Correspondingly, the proposed method learns hash functions that generates unified hash codes from different media types, while ensuring intrinsic geometric structure of the data distribution is preserved. These hash codes empower the similarity between data of different media types to be evaluated directly. Experimental results on two large-scale multimedia datasets demonstrate the effectiveness of the proposed method, where we outperform the state-of-the-art methods.
  • Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space). While the majority of the literature focuses on sample space partitioning, feature space partitioning is more effective when $p\gg n$. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In this paper, we solve these problems through a new embarrassingly parallel framework named DECO for distributed variable selection and parameter estimation. In DECO, variables are first partitioned and allocated to $m$ distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number $m$. Extensive numerical experiments are provided to illustrate the performance of the new framework.
  • Photon subtraction can enhance the performance of continuous-variable quantum key distribution (CV QKD). However, the enhancement effect will be reduced by the imperfections of practical devices, especially the limited efficiency of a single-photon detector. In this paper, we propose a non-Gaussian postselection method to emulate the photon substraction used in coherent-state CV QKD protocols. The virtual photon subtraction not only can avoid the complexity and imperfections of a practical photon-subtraction operation, which extends the secure transmission distance as the ideal case does, but also can be adjusted flexibly according to the channel parameters to optimize the performance. Furthermore, our preliminary tests on the information reconciliation suggest that in the low signal-to-noise ratio regime, the performance of reconciliating the postselected non-Gaussian data is better than that of the Gaussian data, which implies the feasibility of implementing this method practically.
  • The modern scale of data has brought new challenges to Bayesian inference. In particular, conventional MCMC algorithms are computationally very expensive for large data sets. A promising approach to solve this problem is embarrassingly parallel MCMC (EP-MCMC), which first partitions the data into multiple subsets and runs independent sampling algorithms on each subset. The subset posterior draws are then aggregated via some combining rules to obtain the final approximation. Existing EP-MCMC algorithms are limited by approximation accuracy and difficulty in resampling. In this article, we propose a new EP-MCMC algorithm PART that solves these problems. The new algorithm applies random partition trees to combine the subset posterior draws, which is distribution-free, easy to resample from and can adapt to multiple scales. We provide theoretical justification and extensive experiments illustrating empirical performance.
  • Variable screening is a fast dimension reduction technique for assisting high dimensional feature selection. As a preselection method, it selects a moderate size subset of candidate variables for further refining via feature selection to produce the final model. The performance of variable screening depends on both computational efficiency and the ability to dramatically reduce the number of variables without discarding the important ones. When the data dimension $p$ is substantially larger than the sample size $n$, variable screening becomes crucial as 1) Faster feature selection algorithms are needed; 2) Conditions guaranteeing selection consistency might fail to hold. This article studies a class of linear screening methods and establishes consistency theory for this special class. In particular, we prove the restricted diagonally dominant (RDD) condition is a necessary and sufficient condition for strong screening consistency. As concrete examples, we show two screening methods $SIS$ and $HOLP$ are both strong screening consistent (subject to additional constraints) with large probability if $n > O((\rho s + \sigma/\tau)^2\log p)$ under random designs. In addition, we relate the RDD condition to the irrepresentable condition, and highlight limitations of $SIS$.
  • Variable selection is a challenging issue in statistical applications when the number of predictors $p$ far exceeds the number of observations $n$. In this ultra-high dimensional setting, the sure independence screening (SIS) procedure was introduced to significantly reduce the dimensionality by preserving the true model with overwhelming probability, before a refined second stage analysis. However, the aforementioned sure screening property strongly relies on the assumption that the important variables in the model have large marginal correlations with the response, which rarely holds in reality. To overcome this, we propose a novel and simple screening technique called the high-dimensional ordinary least-squares projection (HOLP). We show that HOLP possesses the sure screening property and gives consistent variable selection without the strong correlation assumption, and has a low computational complexity. A ridge type HOLP procedure is also discussed. Simulation study shows that HOLP performs competitively compared to many other marginal correlation based methods. An application to a mammalian eye disease data illustrates the attractiveness of HOLP.
  • For massive data sets, efficient computation commonly relies on distributed algorithms that store and process subsets of the data on different machines, minimizing communication costs. Our focus is on regression and classification problems involving many features. A variety of distributed algorithms have been proposed in this context, but challenges arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. We propose a MEdian Selection Subset AGgregation Estimator (message) algorithm, which attempts to solve these problems. The algorithm applies feature selection in parallel for each subset using Lasso or another method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in both sample and feature size, and has theoretical guarantees. In particular, we show model selection consistency and coefficient estimation efficiency. Extensive experiments show excellent performance in variable selection, estimation, prediction, and computation time relative to usual competitors.
  • With the rapidly growing scales of statistical problems, subset based communication-free parallel MCMC methods are a promising future for large scale Bayesian analysis. In this article, we propose a new Weierstrass sampler for parallel MCMC based on independent subsets. The new sampler approximates the full data posterior samples via combining the posterior draws from independent subset MCMC chains, and thus enjoys a higher computational efficiency. We show that the approximation error for the Weierstrass sampler is bounded by some tuning parameters and provide suggestions for choice of the values. Simulation study shows the Weierstrass sampler is very competitive compared to other methods for combining MCMC chains generated for subsets, including averaging and kernel smoothing.