• Locating actions in long untrimmed videos has been a challenging problem in video content analysis. The performances of existing action localization approaches remain unsatisfactory in precisely determining the beginning and the end of an action. Imitating the human perception procedure with observations and refinements, we propose a novel three-phase action localization framework. Our framework is embedded with an Actionness Network to generate initial proposals through frame-wise similarity grouping, and then a Refinement Network to conduct boundary adjustment on these proposals. Finally, the refined proposals are sent to a Localization Network for further fine-grained location regression. The whole process can be deemed as multi-stage refinement using a novel non-local pyramid feature under various temporal granularities. We evaluate our framework on THUMOS14 benchmark and obtain a significant improvement over the state-of-the-arts approaches. Specifically, the performance gain is remarkable under precise localization with high IoU thresholds. Our proposed framework achieves mAP@IoU=0.5 of 34.2%.
  • As wireless networks evolve towards high mobility and providing better support for connected vehicles, a number of new challenges arise due to the resulting high dynamics in vehicular environments and thus motive rethinking of traditional wireless design methodologies. Future intelligent vehicles, which are at the heart of high mobility networks, are increasingly equipped with multiple advanced onboard sensors and keep generating large volumes of data. Machine learning, as an effective approach to artificial intelligence, can provide a rich set of tools to exploit such data for the benefit of the networks. In this article, we first identify the distinctive characteristics of high mobility vehicular networks and motivate the use of machine learning to address the resulting challenges. After a brief introduction of the major concepts of machine learning, we discuss its applications to learn the dynamics of vehicular networks and make informed decisions to optimize network performance. In particular, we discuss in greater detail the application of reinforcement learning in managing network resources as an alternative to the prevalent optimization approach. Finally, some open issues worth further investigation are highlighted.
  • This paper introduces a novel rotation-based framework for arbitrary-oriented text detection in natural scene images. We present the Rotation Region Proposal Networks (RRPN), which are designed to generate inclined proposals with text orientation angle information. The angle information is then adapted for bounding box regression to make the proposals more accurately fit into the text region in terms of the orientation. The Rotation Region-of-Interest (RRoI) pooling layer is proposed to project arbitrary-oriented proposals to a feature map for a text region classifier. The whole framework is built upon a region-proposal-based architecture, which ensures the computational efficiency of the arbitrary-oriented text detection compared with previous text detection systems. We conduct experiments using the rotation-based framework on three real-world scene text detection datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.
  • The emerging vehicular networks are expected to make everyday vehicular operation safer, greener, and more efficient, and pave the path to autonomous driving in the advent of the fifth generation (5G) cellular system. Machine learning, as a major branch of artificial intelligence, has been recently applied to wireless networks to provide a data-driven approach to solve traditionally challenging problems. In this article, we review recent advances in applying machine learning in vehicular networks and attempt to bring more attention to this emerging area. After a brief overview of the major concept of machine learning, we present some application examples of machine learning in solving problems arising in vehicular networks. We finally discuss and highlight several open issues that warrant further research.
  • Scene classification is a fundamental problem to understand the high-resolution remote sensing imagery. Recently, convolutional neural network (ConvNet) has achieved remarkable performance in different tasks, and significant efforts have been made to develop various representations for satellite image scene classification. In this paper, we present a novel representation based on a deeper ConvNet with context aggregation. The proposed two-pathway ResNet (ResNet-TP) architecture adopts the ResNet [1] as backbone, and the two pathways allow the network to model both local details and regional context. The ResNet-TP based representation is generated by global average pooling on the last convolutional layers from both pathways. Experiments on two scene classification datasets, UCM Land Use and NWPU-RESISC45, show that the proposed mechanism achieves promising improvements over state-of-the-art methods.
  • This article presents our initial results in deep learning for channel estimation and signal detection in orthogonal frequency-division multiplexing (OFDM). OFDM has been widely adopted in wireless broadband communications to combat frequency-selective fading in wireless channels. In this article, we take advantage of deep learning in handling wireless OFDM channels in an end-to-end approach. Different from existing OFDM receivers that first estimate CSI explicitly and then detect/recover the transmitted symbols with the estimated CSI, our deep learning based approach estimates CSI implicitly and recovers the transmitted symbols directly. To address channel distortion, a deep learning model is first trained offline using the data generated from the simulation based on the channel statistics and then used for recovering the online transmitted data directly. From our simulation results, the deep learning based approach has the ability to address channel distortions and detect the transmitted symbols with performance comparable to minimum mean-square error (MMSE) estimator. Furthermore, the deep learning based approach is more robust than conventional methods when fewer training pilots are used, the cyclic prefix (CP) is omitted, and nonlinear clipping noise is presented. In summary, deep learning is a promising tool for channel estimation and signal detection in wireless communications with complicated channel distortions and interferences.
  • Factorization Machines (FMs) are a supervised learning approach that enhances the linear regression model by incorporating the second-order feature interactions. Despite effectiveness, FM can be hindered by its modelling of all feature interactions with the same weight, as not all feature interactions are equally useful and predictive. For example, the interactions with useless features may even introduce noises and adversely degrade the performance. In this work, we improve FM by discriminating the importance of different feature interactions. We propose a novel model named Attentional Factorization Machine (AFM), which learns the importance of each feature interaction from data via a neural attention network. Extensive experiments on two real-world datasets demonstrate the effectiveness of AFM. Empirically, it is shown on regression task AFM betters FM with a $8.6\%$ relative improvement, and consistently outperforms the state-of-the-art deep learning methods Wide&Deep and DeepCross with a much simpler structure and fewer model parameters. Our implementation of AFM is publicly available at: https://github.com/hexiangnan/attentional_factorization_machine
  • We perform fast vehicle detection from traffic surveillance cameras. A novel deep learning framework, namely Evolving Boxes, is developed that proposes and refines the object boxes under different feature representations. Specifically, our framework is embedded with a light-weight proposal network to generate initial anchor boxes as well as to early discard unlikely regions; a fine-turning network produces detailed features for these candidate boxes. We show intriguingly that by applying different feature fusion techniques, the initial boxes can be refined for both localization and recognition. We evaluate our network on the recent DETRAC benchmark and obtain a significant improvement over the state-of-the-art Faster RCNN by 9.5% mAP. Further, our network achieves 9-13 FPS detection speed on a moderate commercial GPU.
  • Social instant messaging services are emerging as a transformative form with which people connect, communicate with friends in their daily life - they catalyze the formation of social groups, and they bring people stronger sense of community and connection. However, research community still knows little about the formation and evolution of groups in the context of social messaging - their lifecycles, the change in their underlying structures over time, and the diffusion processes by which they develop new members. In this paper, we analyze the daily usage logs from WeChat group messaging platform - the largest standalone messaging communication service in China - with the goal of understanding the processes by which social messaging groups come together, grow new members, and evolve over time. Specifically, we discover a strong dichotomy among groups in terms of their lifecycle, and develop a separability model by taking into account a broad range of group-level features, showing that long-term and short-term groups are inherently distinct. We also found that the lifecycle of messaging groups is largely dependent on their social roles and functions in users' daily social experiences and specific purposes. Given the strong separability between the long-term and short-term groups, we further address the problem concerning the early prediction of successful communities. In addition to modeling the growth and evolution from group-level perspective, we investigate the individual-level attributes of group members and study the diffusion process by which groups gain new members. By considering members' historical engagement behavior as well as the local social network structure that they embedded in, we develop a membership cascade model and demonstrate the effectiveness by achieving AUC of 95.31% in predicting inviter, and an AUC of 98.66% in predicting invitee.
  • This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural Networks to model spatial, short-term motion and audio clues respectively. Long Short Term Memory networks are then adopted to explore long-term temporal dynamics. With the outputs of the individual streams, we propose a simple and effective fusion method to generate the final predictions, where the optimal fusion weights are learned adaptively for each class, and the learning process is regularized by automatically estimated class relationships. Our contributions are two-fold. First, the proposed multi-stream framework is able to exploit multimodal features that are more comprehensive than those previously attempted. Second, we demonstrate that the adaptive fusion method using the class relationship as a regularizer outperforms traditional alternatives that estimate the weights in a "free" fashion. Our framework produces significantly better results than the state of the arts on two popular benchmarks, 92.2\% on UCF-101 (without using audio) and 84.9\% on Columbia Consumer Videos.
  • Videos contain very rich semantic information. Traditional hand-crafted features are known to be inadequate in analyzing complex video semantics. Inspired by the huge success of the deep learning methods in analyzing image, audio and text data, significant efforts are recently being devoted to the design of deep nets for video analytics. Among the many practical needs, classifying videos (or video clips) based on their major semantic categories (e.g., "skiing") is useful in many applications. In this paper, we conduct an in-depth study to investigate important implementation options that may affect the performance of deep nets on video classification. Our evaluations are conducted on top of a recent two-stream convolutional neural network (CNN) pipeline, which uses both static frames and motion optical flows, and has demonstrated competitive performance against the state-of-the-art methods. In order to gain insights and to arrive at a practical guideline, many important options are studied, including network architectures, model fusion, learning parameters and the final prediction methods. Based on the evaluations, very competitive results are attained on two popular video classification benchmarks. We hope that the discussions and conclusions from this work can help researchers in related fields to quickly set up a good basis for further investigations along this very promising direction.
  • Classifying videos according to content semantics is an important problem with a wide range of applications. In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos. Specifically, the spatial and the short-term motion features are extracted separately by two Convolutional Neural Networks (CNN). These two types of CNN-based features are then combined in a regularized feature fusion network for classification, which is able to learn and utilize feature relationships for improved performance. In addition, Long Short Term Memory (LSTM) networks are applied on top of the two features to further model longer-term temporal clues. The main contribution of this work is the hybrid learning framework that can model several important aspects of the video data. We also show that (1) combining the spatial and the short-term motion features in the regularized fusion network is better than direct classification and fusion using the CNN with a softmax layer, and (2) the sequence-based LSTM is highly complementary to the traditional classification strategy without considering the temporal frame orders. Extensive experiments are conducted on two popular and challenging benchmarks, the UCF-101 Human Actions and the Columbia Consumer Videos (CCV). On both benchmarks, our framework achieves to-date the best reported performance: $91.3\%$ on the UCF-101 and $83.5\%$ on the CCV.