• In this paper, we propose an improved quantitative evaluation framework for Generative Adversarial Networks (GANs) on generating domain-specific images, where we improve conventional evaluation methods on two levels: the feature representation and the evaluation metric. Unlike most existing evaluation frameworks which transfer the representation of ImageNet inception model to map images onto the feature space, our framework uses a specialized encoder to acquire fine-grained domain-specific representation. Moreover, for datasets with multiple classes, we propose Class-Aware Frechet Distance (CAFD), which employs a Gaussian mixture model on the feature space to better fit the multi-manifold feature distribution. Experiments and analysis on both the feature level and the image level were conducted to demonstrate improvements of our proposed framework over the recently proposed state-of-the-art FID method. To our best knowledge, we are the first to provide counter examples where FID gives inconsistent results with human judgments. It is shown in the experiments that our framework is able to overcome the shortness of FID and improves robustness. Code will be made available.
  • In this paper, we present a deep extension of Sparse Subspace Clustering, termed Deep Sparse Subspace Clustering (DSSC). Regularized by the unit sphere distribution assumption for the learned deep features, DSSC can infer a new data affinity matrix by simultaneously satisfying the sparsity principle of SSC and the nonlinearity given by neural networks. One of the appealing advantages brought by DSSC is: when original real-world data do not meet the class-specific linear subspace distribution assumption, DSSC can employ neural networks to make the assumption valid with its hierarchical nonlinear transformations. To the best of our knowledge, this is among the first deep learning based subspace clustering methods. Extensive experiments are conducted on four real-world datasets to show the proposed DSSC is significantly superior to 12 existing methods for subspace clustering.
  • Semantic parsing of large-scale 3D point clouds is an important research topic in computer vision and remote sensing fields. Most existing approaches utilize hand-crafted features for each modality independently and combine them in a heuristic manner. They often fail to consider the consistency and complementary information among features adequately, which makes them difficult to capture high-level semantic structures. The features learned by most of the current deep learning methods can obtain high-quality image classification results. However, these methods are hard to be applied to recognize 3D point clouds due to unorganized distribution and various point density of data. In this paper, we propose a 3DCNN-DQN-RNN method which fuses the 3D convolutional neural network (CNN), Deep Q-Network (DQN) and Residual recurrent neural network (RNN) for an efficient semantic parsing of large-scale 3D point clouds. In our method, an eye window under control of the 3D CNN and DQN can localize and segment the points of the object class efficiently. The 3D CNN and Residual RNN further extract robust and discriminative features of the points in the eye window, and thus greatly enhance the parsing accuracy of large-scale point clouds. Our method provides an automatic process that maps the raw data to the classification results. It also integrates object localization, segmentation and classification into one framework. Experimental results demonstrate that the proposed method outperforms the state-of-the-art point cloud classification methods.
  • In this paper, we propose a new correlated and individual multi-modal deep learning (CIMDL) method for RGB-D object recognition. Unlike most conventional RGB-D object recognition methods which extract features from the RGB and depth channels individually, our CIMDL jointly learns feature representations from raw RGB-D data with a pair of deep neural networks, so that the sharable and modal-specific information can be simultaneously exploited. Specifically, we construct a pair of deep convolutional neural networks (CNNs) for the RGB and depth data, and concatenate them at the top layer of the network with a loss function which learns a new feature space where both correlated part and the individual part of the RGB-D information are well modelled. The parameters of the whole networks are updated by using the back-propagation criterion. Experimental results on two widely used RGB-D object image benchmark datasets clearly show that our method outperforms state-of-the-arts.
  • In this paper, we address two challenging problems in unsupervised subspace learning: 1) how to automatically identify the feature dimension of the learned subspace (i.e., automatic subspace learning), and 2) how to learn the underlying subspace in the presence of Gaussian noise (i.e., robust subspace learning). We show that these two problems can be simultaneously solved by proposing a new method (called principal coefficients embedding, PCE). For a given data set $\mathbf{D}\in \mathds{R}^{m\times n}$, PCE recovers a clean data set $\mathbf{D}_{0}\in \mathds{R}^{m\times n}$ from $\mathbf{D}$ and simultaneously learns a global reconstruction relation $\mathbf{C}\in \mathbf{R}^{n\times n}$ of $\mathbf{D}_{0}$. By preserving $\mathbf{C}$ into an $m^{\prime}$-dimensional space, the proposed method obtains a projection matrix that can capture the latent manifold structure of $\mathbf{D}_{0}$, where $m^{\prime}\ll m$ is automatically determined by the rank of $\mathbf{C}$ with theoretical guarantees. PCE has three advantages: 1) it can automatically determine the feature dimension even though data are sampled from a union of multiple linear subspaces in presence of the Gaussian noise, 2) Although the objective function of PCE only considers the Gaussian noise, experimental results show that it is robust to the non-Gaussian noise (\textit{e.g.}, random pixel corruption) and real disguises, 3) Our method has a closed-form solution and can be calculated very fast. Extensive experimental results show the superiority of PCE on a range of databases with respect to the classification accuracy, robustness and efficiency.
  • Matching pedestrians across multiple camera views known as human re-identification (re-identification) is a challenging problem in visual surveillance. In the existing works concentrating on feature extraction, representations are formed locally and independent of other regions. We present a novel siamese Long Short-Term Memory (LSTM) architecture that can process image regions sequentially and enhance the discriminative capability of local feature representation by leveraging contextual information. The feedback connections and internal gating mechanism of the LSTM cells enable our model to memorize the spatial dependencies and selectively propagate relevant contextual information through the network. We demonstrate improved performance compared to the baseline algorithm with no LSTM units and promising results compared to state-of-the-art methods on Market-1501, CUHK03 and VIPeR datasets. Visualization of the internal mechanism of LSTM cells shows meaningful patterns can be learned by our method.
  • This paper proposes a joint multi-task learning algorithm to better predict attributes in images using deep convolutional neural networks (CNN). We consider learning binary semantic attributes through a multi-task CNN model, where each CNN will predict one binary attribute. The multi-task learning allows CNN models to simultaneously share visual knowledge among different attribute categories. Each CNN will generate attribute-specific feature representations, and then we apply multi-task learning on the features to predict their attributes. In our multi-task framework, we propose a method to decompose the overall model's parameters into a latent task matrix and combination matrix. Furthermore, under-sampled classifiers can leverage shared statistics from other classifiers to improve their performance. Natural grouping of attributes is applied such that attributes in the same group are encouraged to share more knowledge. Meanwhile, attributes in different groups will generally compete with each other, and consequently share less knowledge. We show the effectiveness of our method on two popular attribute datasets.
  • Person re-identification aims at matching pedestrians observed from non-overlapping camera views. Feature descriptor and metric learning are two significant problems in person re-identification. A discriminative metric learning method should be capable of exploiting complex nonlinear transformations due to the large variations in feature space. In this paper, we propose a nonlinear local metric learning (NLML) method to improve the state-of-the-art performance of person re-identification on public datasets. Motivated by the fact that local metric learning has been introduced to handle the data which varies locally and deep neural network has presented outstanding capability in exploiting the nonlinearity of samples, we utilize the merits of both local metric learning and deep neural network to learn multiple sets of nonlinear transformations. By enforcing a margin between the distances of positive pedestrian image pairs and distances of negative pairs in the transformed feature subspace, discriminative information can be effectively exploited in the developed neural networks. Our experiments show that the proposed NLML method achieves the state-of-the-art results on the widely used VIPeR, GRID, and CUHK 01 datasets.
  • Matching people across multiple camera views known as person re-identification, is a challenging problem due to the change in visual appearance caused by varying lighting conditions. The perceived color of the subject appears to be different with respect to illumination. Previous works use color as it is or address these challenges by designing color spaces focusing on a specific cue. In this paper, we propose a data driven approach for learning color patterns from pixels sampled from images across two camera views. The intuition behind this work is that, even though pixel values of same color would be different across views, they should be encoded with the same values. We model color feature generation as a learning problem by jointly learning a linear transformation and a dictionary to encode pixel values. We also analyze different photometric invariant color spaces. Using color as the only cue, we compare our approach with all the photometric invariant color spaces and show superior performance over all of them. Combining with other learned low-level and high-level features, we obtain promising results in ViPER, Person Re-ID 2011 and CAVIAR4REID datasets.
  • In this work, we propose a very simple deep learning network for image classification which comprises only the very basic data processing components: cascaded principal component analysis (PCA), binary hashing, and block-wise histograms. In the proposed architecture, PCA is employed to learn multistage filter banks. It is followed by simple binary hashing and block histograms for indexing and pooling. This architecture is thus named as a PCA network (PCANet) and can be designed and learned extremely easily and efficiently. For comparison and better understanding, we also introduce and study two simple variations to the PCANet, namely the RandNet and LDANet. They share the same topology of PCANet but their cascaded filters are either selected randomly or learned from LDA. We have tested these basic networks extensively on many benchmark visual datasets for different tasks, such as LFW for face verification, MultiPIE, Extended Yale B, AR, FERET datasets for face recognition, as well as MNIST for hand-written digits recognition. Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state of the art features, either prefixed, highly hand-crafted or carefully learned (by DNNs). Even more surprisingly, it sets new records for many classification tasks in Extended Yale B, AR, FERET datasets, and MNIST variations. Additional experiments on other public datasets also demonstrate the potential of the PCANet serving as a simple but highly competitive baseline for texture classification and object recognition.
  • We present an improved Locality Preserving Projections (LPP) method, named Gloablity-Locality Preserving Projections (GLPP), to preserve both the global and local geometric structures of data. In our approach, an additional constraint of the geometry of classes is imposed to the objective function of conventional LPP for respecting some more global manifold structures. Moreover, we formulate a two-dimensional extension of GLPP (2D-GLPP) as an example to show how to extend GLPP with some other statistical techniques. We apply our works to face recognition on four popular face databases, namely ORL, Yale, FERET and LFW-A databases, and extensive experimental results demonstrate that the considered global manifold information can significantly improve the performance of LPP and the proposed face recognition methods outperform the state-of-the-arts.