• We propose a systematic learning-based approach to the generation of massive quantities of synthetic 3D scenes and arbitrary numbers of photorealistic 2D images thereof, with associated ground truth information, for the purposes of training, benchmarking, and diagnosing learning-based computer vision and robotics algorithms. In particular, we devise a learning-based pipeline of algorithms capable of automatically generating and rendering a potentially infinite variety of indoor scenes by using a stochastic grammar, represented as an attributed Spatial And-Or Graph, in conjunction with state-of-the-art physics-based rendering. Our pipeline is capable of synthesizing scene layouts with high diversity, and it is configurable inasmuch as it enables the precise customization and control of important attributes of the generated scenes. It renders photorealistic RGB images of the generated scenes while automatically synthesizing detailed, per-pixel ground truth data, including visible surface depth and normal, object identity, and material information (detailed to object parts), as well as environments (e.g., illuminations and camera viewpoints). We demonstrate the value of our synthesized dataset, by improving performance in certain machine-learning-based scene understanding tasks--depth and surface normal prediction, semantic segmentation, reconstruction, etc.--and by providing benchmarks for and diagnostics of trained models by modifying object attributes and scene properties in a controllable manner.
  • This paper presents a novel method to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due to its non-Markovian property and the rich context between human and environments. We use a stochastic grammar model to capture the compositional structure of events, integrating human actions, objects, and their affordances. We represent the event by a spatial-temporal And-Or graph (ST-AOG). The ST-AOG is composed of a temporal stochastic grammar defined on sub-activities, and spatial graphs representing sub-activities that consist of human actions, objects, and their affordances. Future sub-activities are predicted using the temporal grammar and Earley parsing algorithm. The corresponding action, object, and affordance labels are then inferred accordingly. Extensive experiments are conducted to show the effectiveness of our model on both semantic event parsing and future activity prediction.
  • Person re-identification aims at matching pedestrians observed from non-overlapping camera views. Feature descriptor and metric learning are two significant problems in person re-identification. A discriminative metric learning method should be capable of exploiting complex nonlinear transformations due to the large variations in feature space. In this paper, we propose a nonlinear local metric learning (NLML) method to improve the state-of-the-art performance of person re-identification on public datasets. Motivated by the fact that local metric learning has been introduced to handle the data which varies locally and deep neural network has presented outstanding capability in exploiting the nonlinearity of samples, we utilize the merits of both local metric learning and deep neural network to learn multiple sets of nonlinear transformations. By enforcing a margin between the distances of positive pedestrian image pairs and distances of negative pairs in the transformed feature subspace, discriminative information can be effectively exploited in the developed neural networks. Our experiments show that the proposed NLML method achieves the state-of-the-art results on the widely used VIPeR, GRID, and CUHK 01 datasets.
  • Conventional single image based localization methods usually fail to localize a querying image when there exist large variations between the querying image and the pre-built scene. To address this, we propose an image-set querying based localization approach. When the localization by a single image fails to work, the system will ask the user to capture more auxiliary images. First, a local 3D model is established for the querying image set. Then, the pose of the querying image set is estimated by solving a nonlinear optimization problem, which aims to match the local 3D model against the pre-built scene. Experiments have shown the effectiveness and feasibility of the proposed approach.