• A problem not well understood in video hyperlinking is what qualifies a fragment as an anchor or target. Ideally, anchors provide good starting points for navigation, and targets supplement anchors with additional details while not distracting users with irrelevant, false and redundant information. The problem is not trivial for intertwining relationship between data characteristics and user expectation. Imagine that in a large dataset, there are clusters of fragments spreading over the feature space. The nature of each cluster can be described by its size (implying popularity) and structure (implying complexity). A principle way of hyperlinking can be carried out by picking centers of clusters as anchors and from there reach out to targets within or outside of clusters with consideration of neighborhood complexity. The question is which fragments should be selected either as anchors or targets, in one way to reflect the rich content of a dataset, and meanwhile to minimize the risk of frustrating user experience. This paper provides some insights to this question from the perspective of hubness and local intrinsic dimensionality, which are two statistical properties in assessing the popularity and complexity of data space. Based these properties, two novel algorithms are proposed for low-risk automatic selection of anchors and targets.
  • In recent years, both online retail and video hosting service are exponentially growing. In this paper, we explore a new cross-domain task, Video2Shop, targeting for matching clothes appeared in videos to the exact same items in online shops. A novel deep neural network, called AsymNet, is proposed to explore this problem. For the image side, well- established methods are used to detect and extract features for clothing patches with arbitrary sizes. For the video side, deep visual features are extracted from detected object re- gions in each frame, and further fed into a Long Short-Term Memory (LSTM) framework for sequence modeling, which captures the temporal dynamics in videos. To conduct exact matching between videos and online shopping images, LSTM hidden states, representing the video, and image features, which represent static object images, are jointly mod- eled under the similarity network with reconfigurable deep tree structure. Moreover, an approximate training method is proposed to achieve the efficiency when training. Extensive experiments conducted on a large cross-domain dataset have demonstrated the effectiveness and efficiency of the proposed AsymNet, which outperforms the state-of-the-art methods.
  • This paper addresses a challenging problem -- how to generate multi-view cloth images from only a single view input. To generate realistic-looking images with different views from the input, we propose a new image generation model termed VariGANs that combines the strengths of the variational inference and the Generative Adversarial Networks (GANs). Our proposed VariGANs model generates the target image in a coarse-to-fine manner instead of a single pass which suffers from severe artifacts. It first performs variational inference to model global appearance of the object (e.g., shape and color) and produce a coarse image with a different view. Conditioned on the generated low resolution images, it then proceeds to perform adversarial learning to fill details and generate images of consistent details with the input. Extensive experiments conducted on two clothing datasets, MVC and DeepFashion, have demonstrated that images of a novel view generated by our model are more plausible than those generated by existing approaches, in terms of more consistent global appearance as well as richer and sharper details.