• ### A Projection Based Conditional Dependence Measure with Applications to High-dimensional Undirected Graphical Models(1501.01617)

Jan. 11, 2019 math.ST, stat.TH, stat.ME, stat.AP, stat.ML
Measuring conditional dependence is an important topic in statistics with broad applications including graphical models. Under a factor model setting, a new conditional dependence measure based on projection is proposed. The corresponding conditional independence test is developed with the asymptotic null distribution unveiled where the number of factors could be high-dimensional. It is also shown that the new test has control over the asymptotic significance level and can be calculated efficiently. A generic method for building dependency graphs without Gaussian assumption using the new test is elaborated. Numerical results and real data analysis show the superiority of the new method.
• ### Refining Source Representations with Relation Networks for Neural Machine Translation(1709.03980)

May 25, 2018 cs.AI, cs.CL, cs.LG
Although neural machine translation (NMT) with the encoder-decoder framework has achieved great success in recent times, it still suffers from some drawbacks: RNNs tend to forget old information which is often useful and the encoder only operates through words without considering word relationship. To solve these problems, we introduce a relation networks (RN) into NMT to refine the encoding representations of the source. In our method, the RN first augments the representation of each source word with its neighbors and reasons all the possible pairwise relations between them. Then the source representations and all the relations are fed to the attention module and the decoder together, keeping the main encoder-decoder architecture unchanged. Experiments on two Chinese-to-English data sets in different scales both show that our method can outperform the competitive baselines significantly.
• ### Information-Propogation-Enhanced Neural Machine Translation by Relation Model(1709.01766)

May 25, 2018 cs.CL
Even though sequence-to-sequence neural machine translation (NMT) model have achieved state-of-art performance in the recent fewer years, but it is widely concerned that the recurrent neural network (RNN) units are very hard to capture the long-distance state information, which means RNN can hardly find the feature with long term dependency as the sequence becomes longer. Similarly, convolutional neural network (CNN) is introduced into NMT for speeding recently, however, CNN focus on capturing the local feature of the sequence; To relieve this issue, we incorporate a relation network into the standard encoder-decoder framework to enhance information-propogation in neural network, ensuring that the information of the source sentence can flow into the decoder adequately. Experiments show that proposed framework outperforms the statistical MT model and the state-of-art NMT model significantly on two data sets with different scales.
• ### Partial Distance Correlation Screening for High Dimensional Time Series(1802.09116)

April 13, 2018 stat.ME
High dimensional time series datasets are becoming increasingly common in various fields such as economics, finance, meteorology, and neuroscience. Given this ubiquity of time series data, it is surprising that very few works on variable screening discuss the time series setting, and even fewer works have developed methods which utilize the unique features of time series data. This paper introduces several model free screening methods based on the partial distance correlation and developed specifically to deal with time dependent data. Methods are developed both for univariate models, such as nonlinear autoregressive models with exogenous predictors (NARX), and multivariate models such as linear or nonlinear VAR models. Sure screening properties are proved for our methods, which depend on the moment conditions, and the strength of dependence in the response and covariate processes, amongst other factors. Dependence is quantified by functional dependence measures (Wu [Proc. Natl. Acad. Sci. USA 102 (2005) 14150-14154]) and $\beta$-mixing coefficients, and the results rely on the use of Nagaev and Rosenthal type inequalities for dependent random variables. Finite sample performance of our methods is shown through extensive simulation studies, and we include an application to macroeconomic forecasting.
• ### Large-Scale Model Selection with Misspecification(1803.07418)

Model selection is crucial to high-dimensional learning and inference for contemporary big data applications in pinpointing the best set of covariates among a sequence of candidate interpretable models. Most existing work assumes implicitly that the models are correctly specified or have fixed dimensionality. Yet both features of model misspecification and high dimensionality are prevalent in practice. In this paper, we exploit the framework of model selection principles in misspecified models originated in Lv and Liu (2014) and investigate the asymptotic expansion of Bayesian principle of model selection in the setting of high-dimensional misspecified models. With a natural choice of prior probabilities that encourages interpretability and incorporates Kullback-Leibler divergence, we suggest the high-dimensional generalized Bayesian information criterion with prior probability (HGBIC_p) for large-scale model selection with misspecification. Our new information criterion characterizes the impacts of both model misspecification and high dimensionality on model selection. We further establish the consistency of covariance contrast matrix estimation and the model selection consistency of HGBIC_p in ultra-high dimensions under some mild regularity conditions. The advantages of our new method are supported by numerical studies.
• ### Sparse Linear Discriminant Analysis under the Neyman-Pearson Paradigm(1802.02557)

Feb. 7, 2018 math.ST, stat.TH, stat.ME, stat.ML
In contrast to the classical binary classification paradigm that minimizes the overall classification error, the Neyman-Pearson (NP) paradigm seeks classifiers with a minimal type II error while having a constrained type I error under a user-specified level, addressing asymmetric type I/II error priorities. In this work, we present NP-sLDA, a new binary NP classifier that explicitly takes into account feature dependency under high-dimensional NP settings. This method adapts the popular sparse linear discriminant analysis (sLDA, Mai et al. (2012)) to the NP paradigm. We borrow the threshold determination method from the umbrella algorithm in Tong et al. (2017). On the theoretical front, we formulate a new conditional margin assumption and a new conditional detection condition to accommodate unbounded feature support, and show that NP-sLDA satisfies the NP oracle inequalities, which are natural NP paradigm counterparts of the oracle inequalities in classical classification. Numerical results show that NP-sLDA is a valuable addition to existing NP classifiers. We also suggest a general data-adaptive sample splitting scheme that, in many scenarios, improves the classification performance upon the default half-half class $0$ split used in Tong et al. (2017), and this new splitting scheme has been incorporated into a new version of the R package nproc.
• ### The restricted consistency property of leave-$n_v$-out cross-validation for high-dimensional variable selection(1308.5390)

Jan. 16, 2018 stat.ME
Cross-validation (CV) methods are popular for selecting the tuning parameter in the high-dimensional variable selection problem. We show the mis-alignment of the CV is one possible reason of its over-selection behavior. To fix this issue, we propose a version of leave-$n_v$-out cross-validation (CV($n_v$)), for selecting the optimal model among the restricted candidate model set for high-dimensional generalized linear models. By using the same candidate model sequence and a proper order of construction sample size $n_c$ in each CV split, CV($n_v$) avoids the potential hurdles in developing theoretical properties. CV($n_v$) is shown to enjoy the restricted model selection consistency property under mild conditions. Extensive simulations and real data analysis support the theoretical results and demonstrate the performances of CV($n_v$) in terms of both model selection and prediction.
• ### A note on estimation in a simple probit model under dependency(1712.09694)

We consider a probit model without covariates, but the latent Gaussian variables having compound symmetry covariance structure with a single parameter characterizing the common correlation. We study the parameter estimation problem under such one-parameter probit models. As a surprise, we demonstrate that the likelihood function does not yield consistent estimates for the correlation. We then formally prove the parameter's nonestimability by deriving a non-vanishing minimax lower bound. This counter-intuitive phenomenon provides an interesting insight that one bit information of the latent Gaussian variables is not sufficient to consistently recover their correlation. On the other hand, we further show that trinary data generated from the Gaussian variables can consistently estimate the correlation with parametric convergence rate. Hence we reveal a phase transition phenomenon regarding the discretization of latent Gaussian variables while preserving the estimability of the correlation.
• ### Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristics (NP-ROC)(1608.03109)

Sept. 27, 2017 stat.ME
In many binary classification applications such as disease diagnosis and spam detection, practitioners often face great needs to control type I errors (i.e., the conditional probability of misclassifying a class 0 observation as class 1) so that it remains below a desired threshold. To address this need, the Neyman-Pearson (NP) classification paradigm is a natural choice; it minimizes type II error (i.e., the conditional probability of misclassifying a class 1 observation as class 0) while enforcing an upper bound, $\alpha$, on the type I error. Although the NP paradigm has a century-long history in hypothesis testing, it has not been well recognized and implemented in classification schemes. Common practices that directly limit the empirical type I error to no more than $\alpha$ do not satisfy the type I error control objective because the resulting classifiers are still likely to have type I errors much larger than $\alpha$. As a result, the NP paradigm has not been properly implemented for many classification scenarios in practice. In this work, we develop the first umbrella algorithm that implements the NP paradigm for all scoring-type classification methods, including popular methods such as logistic regression, support vector machines and random forests. Powered by this umbrella algorithm, we propose a novel graphical tool for NP classification methods: NP receiver operating characteristic (NP-ROC) bands, motivated by the popular receiver operating characteristic (ROC) curves. NP-ROC bands will help choose $\alpha$ in a data adaptive way and compare different NP classifiers. We demonstrate the use and properties of the NP umbrella algorithm and NP-ROC bands, available in the R package nproc, through simulation and real data case studies.
• ### Regularization after retention in ultrahigh dimensional linear regression models(1311.5625)

Aug. 10, 2017 stat.ME
In ultrahigh dimensional setting, independence screening has been both theoretically and empirically proved a useful variable selection framework with low computation cost. In this work, we propose a two-step framework by using marginal information in a different perspective from independence screening. In particular, we retain significant variables rather than screening out irrelevant ones. The new method is shown to be model selection consistent in the ultrahigh dimensional linear regression model. To improve the finite sample performance, we then introduce a three-step version and characterize its asymptotic behavior. Simulations and real data analysis show advantages of our method over independence screening and its iterative variants in certain regimes.
• ### Memory-augmented Neural Machine Translation(1708.02005)

Aug. 7, 2017 cs.CL
Neural machine translation (NMT) has achieved notable success in recent times, however it is also widely recognized that this approach has limitations with handling infrequent words and word pairs. This paper presents a novel memory-augmented NMT (M-NMT) architecture, which stores knowledge about how words (usually infrequently encountered ones) should be translated in a memory and then utilizes them to assist the neural model. We use this memory mechanism to combine the knowledge learned from a conventional statistical machine translation system and the rules learned by an NMT system, and also propose a solution for out-of-vocabulary (OOV) words based on this framework. Our experiments on two Chinese-English translation tasks demonstrated that the M-NMT architecture outperformed the NMT baseline by $9.0$ and $2.7$ BLEU points on the two tasks, respectively. Additionally, we found this architecture resulted in a much more effective OOV treatment compared to competitive methods.
• ### Magnetic quantum phase transition in Cr-doped Bi2(SexTe1-x)3 driven by the Stark effect(1706.03506)

The interplay between magnetism and topology, as exemplified in the magnetic skyrmion systems, has emerged as a rich playground for finding novel quantum phenomena and applications in future information technology. Magnetic topological insulators (TI) have attracted much recent attention, especially after the experimental realization of quantum anomalous Hall effect. Future applications of magnetic TI hinge on the accurate manipulation of magnetism and topology by external perturbations, preferably with a gate electric field. In this work, we investigate the magneto transport properties of Cr doped Bi2(SexTe1-x)3 TI across the topological quantum critical point (QCP). We find that the external gate voltage has negligible effect on the magnetic order for samples far away from the topological QCP. But for the sample near the QCP, we observe a ferromagnetic (FM) to paramagnetic (PM) phase transition driven by the gate electric field. Theoretical calculations show that a perpendicular electric field causes a shift of electronic energy levels due to the Stark effect, which induces a topological quantum phase transition and consequently a magnetic phase transition. The in situ electrical control of the topological and magnetic properties of TI shed important new lights on future topological electronic or spintronic device applications.
• ### Collaborative Learning for Language and Speaker Recognition(1609.08442)

May 23, 2017 cs.CL, cs.SD
This paper presents a unified model to perform language and speaker recognition simultaneously and altogether. The model is based on a multi-task recurrent neural network where the output of one task is fed as the input of the other, leading to a collaborative learning framework that can improve both language and speaker recognition by borrowing information from each other. Our experiments demonstrated that the multi-task model outperforms the task-specific models on both tasks.
• ### Flexible and Creative Chinese Poetry Generation Using Neural Memory(1705.03773)

May 10, 2017 cs.AI, cs.CL
It has been shown that Chinese poems can be successfully generated by sequence-to-sequence neural models, particularly with the attention mechanism. A potential problem of this approach, however, is that neural models can only learn abstract rules, while poem generation is a highly creative process that involves not only rules but also innovations for which pure statistical models are not appropriate in principle. This work proposes a memory-augmented neural model for Chinese poem generation, where the neural model and the augmented memory work together to balance the requirements of linguistic accordance and aesthetic innovation, leading to innovative generations that are still rule-compliant. In addition, it is found that the memory mechanism provides interesting flexibility that can be used to generate poems with different styles.
• ### Memory Visualization for Gated Recurrent Neural Networks in Speech Recognition(1609.08789)

Feb. 27, 2017 cs.NE, cs.CL, cs.LG
Recurrent neural networks (RNNs) have shown clear superiority in sequence modeling, particularly the ones with gated units, such as long short-term memory (LSTM) and gated recurrent unit (GRU). However, the dynamic properties behind the remarkable performance remain unclear in many applications, e.g., automatic speech recognition (ASR). This paper employs visualization techniques to study the behavior of LSTM and GRU when performing speech recognition tasks. Our experiments show some interesting patterns in the gated memory, and some of them have inspired simple yet effective modifications on the network structure. We report two of such modifications: (1) lazy cell update in LSTM, and (2) shortcut connections for residual learning. Both modifications lead to more comprehensible and powerful networks.
• ### Community detection with nodal information(1610.09735)

Dec. 10, 2016 math.ST, stat.TH, stat.ME
Community detection is one of the fundamental problems in the study of network data. Most existing community detection approaches only consider edge information as inputs, and the output could be suboptimal when nodal information is available. In such cases, it is desirable to leverage nodal information for the improvement of community detection accuracy. Towards this goal, we propose a flexible network model incorporating nodal information, and develop likelihood-based inference methods. For the proposed methods, we establish favorable asymptotic properties as well as efficient algorithms for computation. Numerical experiments show the effectiveness of our methods in utilizing nodal information across a variety of simulated and real network data sets.
• ### Tactics and Tallies: Inferring Voter Preferences in the 2016 U.S. Presidential Primaries Using Sparse Learning(1611.03168)

Nov. 10, 2016 cs.SI
In this paper, we propose a web-centered framework to infer voter preferences for the 2016 U.S. presidential primaries. Using Twitter data collected from Sept. 2015 to March 2016, we first uncover the tweeting tactics of the candidates and then exploit the variations in the number of 'likes' to infer voters' preference. With sparse learning, we are able to reveal neutral topics as well as positive and negative ones. Methodologically, we are able to achieve a higher predictive power with sparse learning. Substantively, we show that for Hillary Clinton the (only) positive issue area is women's rights. We demonstrate that Hillary Clinton's tactic of linking herself to President Obama resonates well with her supporters but the same is not true for Bernie Sanders. In addition, we show that Donald Trump is a major topic for all the other candidates, and that the women's rights issue is equally emphasized in Sanders' campaign as in Clinton's.
• ### Gender Politics in the 2016 U.S. Presidential Election: A Computer Vision Approach(1611.02806)

Nov. 9, 2016 cs.SI
Gender is playing an important role in the 2016 U.S. presidential election, especially with Hillary Clinton becoming the first female presidential nominee and Donald Trump being frequently accused of sexism. In this paper, we introduce computer vision to the study of gender politics and present an image-driven method that can measure the effects of gender in an accurate and timely manner. We first collect all the profile images of the candidates' Twitter followers. Then we train a convolutional neural network using images that contain gender labels. Lastly, we classify all the follower and unfollower images. Through two case studies, one on the `woman card' controversy and one on Sanders followers, we demonstrate how gender is informing the 2016 presidential election. Our framework of analysis can be readily generalized to other case studies and elections.
• ### Do They All Look the Same? Deciphering Chinese, Japanese and Koreans by Fine-Grained Deep Learning(1610.01854)

Oct. 23, 2016 cs.CV
We study to what extend Chinese, Japanese and Korean faces can be classified and which facial attributes offer the most important cues. First, we propose a novel way of obtaining large numbers of facial images with nationality labels. Then we train state-of-the-art neural networks with these labeled images. We are able to achieve an accuracy of 75.03% in the classification task, with chances being 33.33% and human accuracy 38.89% . Further, we train multiple facial attribute classifiers to identify the most distinctive features for each group. We find that Chinese, Japanese and Koreans do exhibit substantial differences in certain attributes, such as bangs, smiling, and bushy eyebrows. Along the way, we uncover several gender-related cross-country patterns as well. Our work, which complements existing APIs such as Microsoft Cognitive Services and Face++, could find potential applications in tourism, e-commerce, social media marketing, criminal justice and even counter-terrorism.
• ### Model Selection for High Dimensional Quadratic Regression via Regularization(1501.00049)

July 14, 2016 math.ST, stat.TH, stat.ME
Quadratic regression (QR) models naturally extend linear models by considering interaction effects between the covariates. To conduct model selection in QR, it is important to maintain the hierarchical model structure between main effects and interaction effects. Existing regularization methods generally achieve this goal by solving complex optimization problems, which usually demands high computational cost and hence are not feasible for high dimensional data. This paper focuses on scalable regularization methods for model selection in high dimensional QR. We first consider two-stage regularization methods and establish theoretical properties of the two-stage LASSO. Then, a new regularization method, called Regularization Algorithm under Marginality Principle (RAMP), is proposed to compute a hierarchy-preserving regularization solution path efficiently. Both methods are further extended to solve generalized QR models. Numerical results are also shown to demonstrate performance of the methods.
• ### Will Sanders Supporters Jump Ship for Trump? Fine-grained Analysis of Twitter Followers(1605.09473)

May 31, 2016 cs.SI
In this paper, we study the likelihood of Bernie Sanders supporters voting for Donald Trump instead of Hillary Clinton. Building from a unique time-series dataset of the three candidates' Twitter followers, which we make public here, we first study the proportion of Sanders followers who simultaneously follow Trump (but not Clinton) and how this evolves over time. Then we train a convolutional neural network to classify the gender of Sanders followers, and study whether men are more likely to jump ship for Trump than women. Our study shows that between March and May an increasing proportion of Sanders followers are following Trump (but not Clinton). The proportion of Sanders followers who follow Clinton but not Trump has actually decreased. Equally important, our study suggests that the jumping ship behavior will be affected by gender and that men are more likely to switch to Trump than women.
• ### Experimental Observation of the Quantum Anomalous Hall Effect in a Magnetic Topological Insulator(1605.08829)

May 28, 2016 cond-mat.mes-hall
The quantized version of the anomalous Hall effect has been predicted to occur in magnetic topological insulators, but the experimental realization has been challenging. Here, we report the observation of the quantum anomalous Hall (QAH) effect in thin films of Cr-doped (Bi,Sb)2Te3, a magnetic topological insulator. At zero magnetic field, the gate-tuned anomalous Hall resistance reaches the predicted quantized value of h/e^2,accompanied by a considerable drop of the longitudinal resistance. Under a strong magnetic field, the longitudinal resistance vanishes whereas the Hall resistance remains at the quantized value. The realization of the QAH effect may lead to the development of low-power-consumption electronics.
• ### Pricing the Woman Card: Gender Politics between Hillary Clinton and Donald Trump(1605.05401)

May 18, 2016 cs.SI
In this paper, we propose a data-driven method to measure the impact of the 'woman card' exchange between Hillary Clinton and Donald Trump. Building from a unique dataset of the two candidates' Twitter followers, we first examine the transition dynamics of the two candidates' Twitter followers one week before the exchange and one week after. Then we train a convolutional neural network to classify the gender of the followers and unfollowers, and study how women in particular are reacting to the 'woman card' exchange. Our study suggests that the 'woman card' comment has made women more likely to follow Hillary Clinton, less likely to unfollow her and that it has apparently not affected the gender composition of Trump followers.
• ### When Do Luxury Cars Hit the Road? Findings by A Big Data Approach(1605.02827)

May 11, 2016 cs.CV, cs.CY
In this paper, we focus on studying the appearing time of different kinds of cars on the road. This information will enable us to infer the life style of the car owners. The results can further be used to guide marketing towards car owners. Conventionally, this kind of study is carried out by sending out questionnaires, which is limited in scale and diversity. To solve this problem, we propose a fully automatic method to carry out this study. Our study is based on publicly available surveillance camera data. To make the results reliable, we only use the high resolution cameras (i.e. resolution greater than $1280 \times 720$). Images from the public cameras are downloaded every minute. After obtaining 50,000 images, we apply faster R-CNN (region-based convoluntional neural network) to detect the cars in the downloaded images and a fine-tuned VGG16 model is used to recognize the car makes. Based on the recognition results, we present a data-driven analysis on the relationship between car makes and their appearing times, with implications on lifestyles.
• ### Post Selection Shrinkage Estimation for High Dimensional Data Analysis(1603.07277)

March 23, 2016 stat.ME
In high-dimensional data settings where $p\gg n$, many penalized regularization approaches were studied for simultaneous variable selection and estimation. However, with the existence of covariates with weak effect, many existing variable selection methods, including Lasso and its generations, cannot distinguish covariates with weak and no contribution. Thus, prediction based on a subset model of selected covariates only can be inefficient. In this paper, we propose a post selection shrinkage estimation strategy to improve the prediction performance of a selected subset model. Such a post selection shrinkage estimator (PSE) is data-adaptive and constructed by shrinking a post selection weighted ridge estimator in the direction of a selected candidate subset. Under an asymptotic distributional quadratic risk criterion, its prediction performance is explored analytically. We show that the proposed post selection PSE performs better than the post selection weighted ridge estimator. More importantly, it improves the prediction performance of any candidate subset model selected from most existing Lasso-type variable selection methods significantly. The relative performance of the post selection PSE is demonstrated by both simulation studies and real data analysis.