• ### Robust Frequent Directions with Application in Online Learning(1705.05067)

Feb. 24, 2019 cs.LG
The frequent directions (FD) technique is a deterministic approach for online sketching that has many applications in machine learning. The conventional FD is a heuristic procedure that often outputs rank deficient matrices. To overcome the rank deficiency problem, we propose a new sketching strategy called robust frequent directions (RFD) by introducing a regularization term. RFD can be derived from an optimization problem. It updates the sketch matrix and the regularization term adaptively and jointly. RFD reduces the approximation error of FD without increasing the computational cost. We also apply RFD to online learning and propose an effective hyperparameter-free online Newton algorithm. We derive a regret bound for our online Newton algorithm based on RFD, which guarantees the robustness of the algorithm. The experimental studies demonstrate that the proposed method outperforms state-of-the-art second order online learning algorithms.
• ### Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression(1604.08098)

Sept. 13, 2018 stat.CO, cs.LG, stat.ML
A major challenge for building statistical models in the big data era is that the available data volume far exceeds the computational capability. A common approach for solving this problem is to employ a subsampled dataset that can be handled by available computational resources. In this paper, we propose a general subsampling scheme for large-scale multi-class logistic regression and examine the variance of the resulting estimator. We show that asymptotically, the proposed method always achieves a smaller variance than that of the uniform random sampling. Moreover, when the classes are conditionally imbalanced, significant improvement over uniform sampling can be achieved. Empirical performance of the proposed method is compared to other methods on both simulated and real-world datasets, and these results match and confirm our theoretical analysis.
• ### Sparse Generalized Eigenvalue Problem: Optimal Statistical Rates via Truncated Rayleigh Flow(1604.08697)

Aug. 31, 2018 stat.ML
Sparse generalized eigenvalue problem (GEP) plays a pivotal role in a large family of high-dimensional statistical models, including sparse Fisher's discriminant analysis, canonical correlation analysis, and sufficient dimension reduction. Sparse GEP involves solving a non-convex optimization problem. Most existing methods and theory in the context of specific statistical models that are special cases of the sparse GEP require restrictive structural assumptions on the input matrices. In this paper, we propose a two-stage computational framework to solve the sparse GEP. At the first stage, we solve a convex relaxation of the sparse GEP. Taking the solution as an initial value, we then exploit a nonconvex optimization perspective and propose the truncated Rayleigh flow method (Rifle) to estimate the leading generalized eigenvector. We show that Rifle converges linearly to a solution with the optimal statistical rate of convergence for many statistical models. Theoretically, our method significantly improves upon the existing literature by eliminating structural assumptions on the input matrices for both stages. To achieve this, our analysis involves two key ingredients: (i) a new analysis of the gradient based method on nonconvex objective functions, and (ii) a fine-grained characterization of the evolution of sparsity patterns along the solution path. Thorough numerical studies are provided to validate the theoretical results.
• ### End-to-end Active Object Tracking via Reinforcement Learning(1705.10561)

June 1, 2018 cs.CV
We study active object tracking, where a tracker takes as input the visual observation (i.e., frame sequence) and produces the camera control signal (e.g., move forward, turn left, etc.). Conventional methods tackle the tracking and the camera control separately, which is challenging to tune jointly. It also incurs many human efforts for labeling and many expensive trial-and-errors in realworld. To address these issues, we propose, in this paper, an end-to-end solution via deep reinforcement learning, where a ConvNet-LSTM function approximator is adopted for the direct frame-toaction prediction. We further propose an environment augmentation technique and a customized reward function, which are crucial for a successful training. The tracker trained in simulators (ViZDoom, Unreal Engine) shows good generalization in the case of unseen object moving path, unseen object appearance, unseen background, and distracting object. It can restore tracking when occasionally losing the target. With the experiments over the VOT dataset, we also find that the tracking ability, obtained solely from simulators, can potentially transfer to real-world scenarios.
• ### Direct visualization of the nematic superconductivity in CuxBi2Se3(1804.09122)

CuxBi2Se3 hosts both topological surface states and bulk superconductivity. It has been identified recently as a topological superconductor (TSC) with an extraordinary nematic, i. e. C2-symmetric, superconducting state and odd-parity pairing. Here, using scanning tunneling microscopy (STM), we directly examine the response of the superconductivity of CuxBi2Se3 to magnetic field. Under out- of-plane (c-axis) fields, we discover elongated magnetic vortices hosting zero-bias conductance peaks consistent with the Majorana bound states expected in a TSC. Under in-plane (ab-plane) fields, the average superconducting gap exhibits two-fold symmetry with field orientation, the long C2 symmetry axes are pinned to the dihedral mirror planes under B//=0.5 T but slightly rotate under B//=1.0 T. Moreover, the intrinsic C2-symmetric gap distribution in momentum space is semi-quantitatively determined for the first time. Our data paint a microscopic picture of the nematic superconductivity in CuxBi2Se3 and pose strong constraints on theory.
• ### Decentralization Meets Quantization(1803.06443)

April 20, 2018 cs.DC, cs.LG, stat.ML, cs.SY
Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: {\em quantization} for low bandwidth networks, and {\em decentralization} for high latency networks. In this paper, we explore a natural question: {\em can the combination of both decentralization and quantization lead to a system that is robust to both bandwidth and latency?} Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: simply quantizing data sent in a decentralized training algorithm would accumulate the error. In this paper, we develop a framework of quantized, decentralized training and propose two different strategies, which we call {\em extrapolation compression} and {\em difference compression}. We analyze both algorithms and prove both converge at the rate of $O(1/\sqrt{nT})$ where $n$ is the number of workers and $T$ is the number of iterations, matching the {\rc convergence} rate for full precision, centralized training. We evaluate our algorithms on training deep learning models, and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with {\em both} high latency and low bandwidth.
• ### Incorporating Pseudo-Parallel Data for Quantifiable Sequence Editing(1804.07007)

April 19, 2018 cs.CL
In the task of quantifiable sequence editing (QuaSE), a model needs to edit an input sentence to generate an output that satisfies a given outcome, which is a numerical value measuring a certain property of the output. For example, for review sentences, the outcome could be review ratings; for advertisement, the outcome could be click-through rate. We propose a framework which performs QuaSE by incorporating pseudo-parallel data. Our framework can capture the content similarity and the outcome differences by exploiting pseudo-parallel sentence pairs, which enables a better disentanglement of the latent factors that are relevant to the outcome and thus provides a solid basis to generate output satisfying the desired outcome. The dual reconstruction structure further enhances the capability of generating expected output by exploiting the coupling of latent factors of pseudo-parallel sentences. We prepare a dataset of Yelp review sentences with the ratings as outcome. Experimental results show that our framework can outperform state-of-the-art methods under both sentiment polarity accuracy and target value errors.
• ### Fine-grained Video Attractiveness Prediction Using Multimodal Deep Learning on a Large Real-world Dataset(1804.01373)

April 7, 2018 cs.CV
Nowadays, billions of videos are online ready to be viewed and shared. Among an enormous volume of videos, some popular ones are widely viewed by online users while the majority attract little attention. Furthermore, within each video, different segments may attract significantly different numbers of views. This phenomenon leads to a challenging yet important problem, namely fine-grained video attractiveness prediction. However, one major obstacle for such a challenging problem is that no suitable benchmark dataset currently exists. To this end, we construct the first fine-grained video attractiveness dataset, which is collected from one of the most popular video websites in the world. In total, the constructed FVAD consists of 1,019 drama episodes with 780.6 hours covering different categories and a wide variety of video contents. Apart from the large amount of videos, hundreds of millions of user behaviors during watching videos are also included, such as "view counts", "fast-forward", "fast-rewind", and so on, where "view counts" reflects the video attractiveness while other engagements capture the interactions between the viewers and videos. First, we demonstrate that video attractiveness and different engagements present different relationships. Second, FVAD provides us an opportunity to study the fine-grained video attractiveness prediction problem. We design different sequential models to perform video attractiveness prediction by relying solely on video contents. The sequential models exploit the multimodal relationships between visual and audio components of the video contents at different levels. Experimental results demonstrate the effectiveness of our proposed sequential models with different visual and audio representations, the necessity of incorporating the two modalities, and the complementary behaviors of the sequential prediction models at different levels.
• ### Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective(1803.10910)

March 29, 2018 cs.CV
The success of current deep saliency detection methods heavily depends on the availability of large-scale supervision in the form of per-pixel labeling. Such supervision, while labor-intensive and not always possible, tends to hinder the generalization ability of the learned models. By contrast, traditional handcrafted features based unsupervised saliency detection methods, even though have been surpassed by the deep supervised methods, are generally dataset-independent and could be applied in the wild. This raises a natural question that "Is it possible to learn saliency maps without using labeled data while improving the generalization ability?". To this end, we present a novel perspective to unsupervised saliency detection through learning from multiple noisy labeling generated by "weak" and "noisy" unsupervised handcrafted saliency methods. Our end-to-end deep learning framework for unsupervised saliency detection consists of a latent saliency prediction module and a noise modeling module that work collaboratively and are optimized jointly. Explicit noise modeling enables us to deal with noisy saliency maps in a probabilistic way. Extensive experimental results on various benchmarking datasets show that our model not only outperforms all the unsupervised saliency methods with a large margin but also achieves comparable performance with the recent state-of-the-art supervised deep saliency methods.
• ### Tensor graph convolutional neural network(1803.10071)

March 27, 2018 cs.CV
In this paper, we propose a novel tensor graph convolutional neural network (TGCNN) to conduct convolution on factorizable graphs, for which here two types of problems are focused, one is sequential dynamic graphs and the other is cross-attribute graphs. Especially, we propose a graph preserving layer to memorize salient nodes of those factorized subgraphs, i.e. cross graph convolution and graph pooling. For cross graph convolution, a parameterized Kronecker sum operation is proposed to generate a conjunctive adjacency matrix characterizing the relationship between every pair of nodes across two subgraphs. Taking this operation, then general graph convolution may be efficiently performed followed by the composition of small matrices, which thus reduces high memory and computational burden. Encapsuling sequence graphs into a recursive learning, the dynamics of graphs can be efficiently encoded as well as the spatial layout of graphs. To validate the proposed TGCNN, experiments are conducted on skeleton action datasets as well as matrix completion dataset. The experiment results demonstrate that our method can achieve more competitive performance with the state-of-the-art methods.
• ### An Analysis on Caching Placement for Millimeter/Micro Wave Hybrid Networks(1802.10326)

March 26, 2018 eess.SP
In this paper, we consider a hybrid millimeter wave (mmWave) and micro wave ($\mu$Wave) network from the perspective of \emph{wireless caching} and study the optimal probabilistic content/file caching placement at desirable base stations (BSs) using a stochastic geometric framework. Considering the average success probability (ASP) of file delivery as the performance metric, we derive expressions for the association probability of the typical user to the mmWave and $\mu$Wave networks. Accordingly, we provide an upper bound for the ASP of file delivery and formulate the content caching placement scheme as an optimization problem with respect to caching probabilities, that jointly optimizes the ASP of file delivery considering both content placement and delivery phases. In particular, we consider the caching placement strategy under both noise-limited and interference-limited environments. We numerically evaluate the performance of the proposed caching schemes under essential factors, such as blockages in the mmWave network, cluster radius, BS density, and path loss and compare it with uniform caching placement, caching $M$ most popular contents, and random caching placement. Numerical results demonstrate the superiority of the proposed caching scheme over others, albeit certain trade-offs.
• ### Fibered varieties over curves with low slope and sharp bounds in dimension three(1803.03884)

March 21, 2018 math.AG
In this paper, we first construct varieties of any dimension $n>2$ fibered over curves with low slopes. These examples violate the conjectural slope inequality of Barja and Stoppino [BS14b]. Led by their conjecture, we focus on finding the lowest possible slope when $n=3$. Based on a characteristic $p > 0$ method, we prove that the sharp lower bound of the slope of fibered $3$-folds over curves is $4/3$, and it occurs only when the general fiber is a $(1, 2)$-surface. Otherwise, the sharp lower bound is $2$. We also obtain a Cornalba-Harris-Xiao type slope inequality for families of surfaces of general type over curves, and it is sharper than all previously known results. As an application of the slope bound, we deduce a sharp Noether-Severi type inequality that $K_X^3 \ge 2\chi(X, \omega_X)$ for an irregular minimal $3$-fold $X$ of general type not having a $(1,2)$-surface Albanese fibration. It answers a question in [Zha15] and thus completes the full Severi type inequality for irregular $3$-folds of general type.
• ### Evidence of nematic order and nodal superconducting gap along [110] direction in RbFe2As2(1803.07304)

Unconventional superconductivity often intertwines with various forms of order, such as the "nematic" order which breaks the rotational symmetry of the lattice. Investigation of these ordered phases sheds crucial light on the superconductivity itself. Here we report a low-temperature scanning tunneling microscopy (STM) study on RbFe2As2, a heavily hole-doped Fe-based superconductor (FeSC). We observe significant symmetry breaking in its electronic structure and magnetic vortex which differentiates the ({\pi}, {\pi}) and ({\pi}, -{\pi}) directions. It is thus a novel nematic state, distinct from the nematicity of undoped/lightly-doped FeSCs which breaks the ({\pi}, 0) / (0, {\pi}) equivalence. Moreover, we observe a clear "V" -shaped superconducting gap which can be well fitted with a nodal gap function. The gap is found to be suppressed on surface Rb vacancies and at step edges, and particularly strong at the [110]-oriented edges, which is possibly due to a dx2-y2 like pairing component with nodes along diagonal directions. Our results highlight the intimate connection between nematicity and superconducting pairing, and promote a universal understanding of them.
• ### NDDR-CNN: Layer-wise Feature Fusing in Multi-Task CNN by Neural Discriminative Dimensionality Reduction(1801.08297)

March 13, 2018 cs.CV, cs.LG
State-of-the-art Convolutional Neural Network (CNN) benefits much from multi-task learning (MTL), which learns multiple related tasks simultaneously to obtain shared or mutually related representations for different tasks. The most widely used MTL CNN structure is based on an empirical or heuristic split on a specific layer (e.g., the last convolutional layer) to minimize multiple task-specific losses. However, this heuristic sharing/splitting strategy may be harmful to the final performance of one or multiple tasks. In this paper, we propose a novel CNN structure for MTL, which enables automatic feature fusing at every layer. Specifically, we first concatenate features from different tasks according to their channel dimension, and then formulate the feature fusing problem as discriminative dimensionality reduction. We show that this discriminative dimensionality reduction can be fulfilled by 1x1 Convolution, Batch Normalization, and Weight Decay in one CNN, which we refer to as Neural Discriminative Dimensionality Reduction (NDDR). We perform detailed ablation analysis for different configurations in training the proposed NDDR-CNN network. The experiments carried out on different network structures and different task sets demonstrate the promising performance and desirable generalizability of our proposed method.
• ### Relative Clifford inequality for varieties fibered by curves(1706.06523)

March 7, 2018 math.AG
We prove a sharp relative Clifford inequality for relatively special divisors on varieties fibered by curves. It generalizes the classical Clifford inequality about a single curve to a family of curves. It yields a geographical inequality for varieties of general type and Albanese-fibered by curves, extending the work of Horikawa, Persson, and Xiao in dimension two to arbitrary dimensions. We also apply it to deduce a slope inequality for some arbitrary dimensional families of curves. It sheds light on the existence of a most general Cornalba-Harris-Xiao type inequality for families of curves. The whole proof is built on a new tree-like filtration for nef divisors on varieties fibered by curves. One key ingredient of the proof is to estimate the sum of all admissible products of nef thresholds with respect to this filtration.
• ### Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents(1802.08757)

Feb. 27, 2018 cs.AI, cs.MA, math.OC, cs.LG, stat.ML
We consider the problem of \emph{fully decentralized} multi-agent reinforcement learning (MARL), where the agents are located at the nodes of a time-varying communication network. Specifically, we assume that the reward functions of the agents might correspond to different tasks, and are only known to the corresponding agent. Moreover, each agent makes individual decisions based on both the information observed locally and the messages received from its neighbors over the network. Within this setting, the collective goal of the agents is to maximize the globally averaged return over the network through exchanging information with their neighbors. To this end, we propose two decentralized actor-critic algorithms with function approximation, which are applicable to large-scale MARL problems where both the number of states and the number of agents are massively large. Under the decentralized structure, the actor step is performed individually by each agent with no need to infer the policies of others. For the critic step, we propose a consensus update via communication over the network. Our algorithms are fully incremental and can be implemented in an online fashion. Convergence analyses of the algorithms are provided when the value functions are approximated within the class of linear functions. Extensive simulation results with both linear and nonlinear function approximations are presented to validate the proposed algorithms. Our work appears to be the first study of fully decentralized MARL algorithms for networked agents with function approximation, with provable convergence guarantees.
• ### Bayesian Model Averaging with Exponentiated Least Square Loss(1408.1234)

Feb. 27, 2018 math.ST, stat.TH
The model averaging problem is to average multiple models to achieve a prediction accuracy not much worse than that of the best single model in terms of mean squared error. It is known that if the models are misspecified, model averaging is superior to model selection. Specifically, let $n$ be the sample size, then the worst case regret of the former decays at a rate of $O(1/n)$ while the worst case regret of the latter decays at a rate of $O(1/\sqrt{n})$. The recently proposed $Q$-aggregation algorithm \citep{DaiRigZhang12} solves the model averaging problem with the optimal regret of $O(1/n)$ both in expectation and in deviation; however it suffers from two limitations: (1) for continuous dictionary, the proposed greedy algorithm for solving $Q$-aggregation is not applicable; (2) the formulation of $Q$-aggregation appears ad hoc without clear intuition. This paper examines a different approach to model averaging by considering a Bayes estimator for deviation optimal model averaging by using exponentiated least squares loss. We establish a primal-dual relationship of this estimator and that of $Q$-aggregation and propose new greedy procedures that satisfactorily resolve the above mentioned limitations of $Q$-aggregation.
• ### On Quadratic Convergence of DC Proximal Newton Algorithm for Nonconvex Sparse Learning in High Dimensions(1706.06066)

Feb. 15, 2018 math.OC, cs.LG, stat.ML
We propose a DC proximal Newton algorithm for solving nonconvex regularized sparse learning problems in high dimensions. Our proposed algorithm integrates the proximal Newton algorithm with multi-stage convex relaxation based on the difference of convex (DC) programming, and enjoys both strong computational and statistical guarantees. Specifically, by leveraging a sophisticated characterization of sparse modeling structures/assumptions (i.e., local restricted strong convexity and Hessian smoothness), we prove that within each stage of convex relaxation, our proposed algorithm achieves (local) quadratic convergence, and eventually obtains a sparse approximate local optimum with optimal statistical properties after only a few convex relaxations. Numerical experiments are provided to support our theory.

Jan. 19, 2018 cs.LG, stat.ML
Generative adversarial networks (GAN) have become popular for generating data that mimic observations by learning a suitable variable transformation from a random variable. However, empirically, GAN is known to suffer from instability. Also, the theory provided based on the minimax optimization formulation of GAN cannot explain the widely-used practical procedure that uses the so-called logd trick. This paper provides a different theoretical foundation for generative adversarial methods which does not rely on the minimax formulation. We show that with a strong discriminator, it is possible to learn a good variable transformation via functional gradient learning, which updates the functional definition of a generator model, instead of updating only the model parameters as in GAN. The theory guarantees that the learned generator improves the KL-divergence between the probability distributions of real data and generated data after each functional gradient step, until the KL-divergence converges to zero. This new point of view leads to enhanced stable procedures for training generative models that can utilize arbitrary learning algorithms. It also gives a new theoretical insight into the original GAN procedure both with and without the logd trick. Empirical results are shown on image generation to illustrate the effectiveness of our new method.
• ### Translating Pro-Drop Languages with Reconstruction Models(1801.03257)

Jan. 10, 2018 cs.CL
Pronouns are frequently omitted in pro-drop languages, such as Chinese, generally leading to significant challenges with respect to the production of complete translations. To date, very little attention has been paid to the dropped pronoun (DP) problem within neural machine translation (NMT). In this work, we propose a novel reconstruction-based approach to alleviating DP translation problems for NMT models. Firstly, DPs within all source sentences are automatically annotated with parallel information extracted from the bilingual training corpus. Next, the annotated source sentence is reconstructed from hidden representations in the NMT model. With auxiliary training objectives, in terms of reconstruction scores, the parameters associated with the NMT model are guided to produce enhanced hidden representations that are encouraged as much as possible to embed annotated DP information. Experimental results on both Chinese-English and Japanese-English dialogue translation tasks show that the proposed approach significantly and consistently improves translation performance over a strong NMT baseline, which is directly built on the training data annotated with DPs.
• ### Insight-HXMT observations of the first binary neutron star merger GW170817(1710.06065)

Oct. 17, 2017 astro-ph.HE
Finding the electromagnetic (EM) counterpart of binary compact star merger, especially the binary neutron star (BNS) merger, is critically important for gravitational wave (GW) astronomy, cosmology and fundamental physics. On Aug. 17, 2017, Advanced LIGO and \textit{Fermi}/GBM independently triggered the first BNS merger, GW170817, and its high energy EM counterpart, GRB 170817A, respectively, resulting in a global observation campaign covering gamma-ray, X-ray, UV, optical, IR, radio as well as neutrinos. The High Energy X-ray telescope (HE) onboard \textit{Insight}-HXMT (Hard X-ray Modulation Telescope) is the unique high-energy gamma-ray telescope that monitored the entire GW localization area and especially the optical counterpart (SSS17a/AT2017gfo) with very large collection area ($\sim$1000 cm$^2$) and microsecond time resolution in 0.2-5 MeV. In addition, \textit{Insight}-HXMT quickly implemented a Target of Opportunity (ToO) observation to scan the GW localization area for potential X-ray emission from the GW source. Although it did not detect any significant high energy (0.2-5 MeV) radiation from GW170817, its observation helped to confirm the unexpected weak and soft nature of GRB 170817A. Meanwhile, \textit{Insight}-HXMT/HE provides one of the most stringent constraints (~10$^{-7}$ to 10$^{-6}$ erg/cm$^2$/s) for both GRB170817A and any other possible precursor or extended emissions in 0.2-5 MeV, which help us to better understand the properties of EM radiation from this BNS merger. Therefore the observation of \textit{Insight}-HXMT constitutes an important chapter in the full context of multi-wavelength and multi-messenger observation of this historical GW event.
• ### Second-Order Methods with Cubic Regularization Under Inexact Information(1710.05782)

Oct. 16, 2017 math.OC
In this paper, we generalize (accelerated) Newton's method with cubic regularization under inexact second-order information for (strongly) convex optimization problems. Under mild assumptions, we provide global rate of convergence of these methods and show the explicit dependence of the rate of convergence on the problem parameters. While the complexity bounds of our presented algorithms are theoretically worse than those of their exact counterparts, they are at least as good as those of the optimal first-order methods. Our numerical experiments also show that using inexact Hessians can significantly speed up the algorithms in practice.
• ### Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations(1706.07001)

Oct. 11, 2017 math.OC, cs.LG, stat.ML
We present novel minibatch stochastic optimization methods for empirical risk minimization problems, the methods efficiently leverage variance reduced first-order and sub-sampled higher-order information to accelerate the convergence speed. For quadratic objectives, we prove improved iteration complexity over state-of-the-art under reasonable assumptions. We also provide empirical evidence of the advantages of our method compared to existing approaches in the literature.
• ### Near-Optimal Stochastic Approximation for Online Principal Component Estimation(1603.05305)

Oct. 5, 2017 math.OC, stat.ML
Principal component analysis (PCA) has been a prominent tool for high-dimensional data analysis. Online algorithms that estimate the principal component by processing streaming data are of tremendous practical and theoretical interests. Despite its rich applications, theoretical convergence analysis remains largely open. In this paper, we cast online PCA into a stochastic nonconvex optimization problem, and we analyze the online PCA algorithm as a stochastic approximation iteration. The stochastic approximation iteration processes data points incrementally and maintains a running estimate of the principal component. We prove for the first time a nearly optimal finite-sample error bound for the online PCA algorithm. Under the subgaussian assumption, we show that the finite-sample error bound closely matches the minimax information lower bound.
• ### Deep Subspace Clustering Networks(1709.02508)

Sept. 8, 2017 cs.CV
We present a novel deep neural network architecture for unsupervised subspace clustering. This architecture is built upon deep auto-encoders, which non-linearly map the input data into a latent space. Our key idea is to introduce a novel self-expressive layer between the encoder and the decoder to mimic the "self-expressiveness" property that has proven effective in traditional subspace clustering. Being differentiable, our new self-expressive layer provides a simple but effective way to learn pairwise affinities between all data points through a standard back-propagation procedure. Being nonlinear, our neural-network based method is able to cluster data points having complex (often nonlinear) structures. We further propose pre-training and fine-tuning strategies that let us effectively learn the parameters of our subspace clustering networks. Our experiments show that the proposed method significantly outperforms the state-of-the-art unsupervised subspace clustering methods.