• Adaptive optimization algorithms, such as Adam and RMSprop, have shown better optimization performance than stochastic gradient descent (SGD) in some scenarios. However, recent studies show that they often lead to worse generalization performance than SGD, especially for training deep neural networks (DNNs). In this work, we identify the reasons that Adam generalizes worse than SGD, and develop a variant of Adam to eliminate the generalization gap. The proposed method, normalized direction-preserving Adam (ND-Adam), enables more precise control of the direction and step size for updating weight vectors, leading to significantly improved generalization performance. Following a similar rationale, we further improve the generalization performance in classification tasks by regularizing the softmax logits. By bridging the gap between SGD and Adam, we also hope to shed light on why certain optimization algorithms generalize better than others.
  • Human motion prediction aims at generating future frames of human motion based on an observed sequence of skeletons. Recent methods employ the latest hidden states of a recurrent neural network (RNN) to encode the historical skeletons, which can only address short-term prediction. In this work, we propose a motion context modeling by summarizing the historical human motion with respect to the current prediction. A modified highway unit (MHU) is proposed for efficiently eliminating motionless joints and estimating next pose given the motion context. Furthermore, we enhance the motion dynamic by minimizing the gram matrix loss for long-term motion prediction. Experimental results show that the proposed model can promisingly forecast the human future movements, which yields superior performances over related state-of-the-art approaches. Moreover, specifying the motion context with the activity labels enables our model to perform human motion transfer.
  • Nowadays, billions of videos are online ready to be viewed and shared. Among an enormous volume of videos, some popular ones are widely viewed by online users while the majority attract little attention. Furthermore, within each video, different segments may attract significantly different numbers of views. This phenomenon leads to a challenging yet important problem, namely fine-grained video attractiveness prediction. However, one major obstacle for such a challenging problem is that no suitable benchmark dataset currently exists. To this end, we construct the first fine-grained video attractiveness dataset, which is collected from one of the most popular video websites in the world. In total, the constructed FVAD consists of 1,019 drama episodes with 780.6 hours covering different categories and a wide variety of video contents. Apart from the large amount of videos, hundreds of millions of user behaviors during watching videos are also included, such as "view counts", "fast-forward", "fast-rewind", and so on, where "view counts" reflects the video attractiveness while other engagements capture the interactions between the viewers and videos. First, we demonstrate that video attractiveness and different engagements present different relationships. Second, FVAD provides us an opportunity to study the fine-grained video attractiveness prediction problem. We design different sequential models to perform video attractiveness prediction by relying solely on video contents. The sequential models exploit the multimodal relationships between visual and audio components of the video contents at different levels. Experimental results demonstrate the effectiveness of our proposed sequential models with different visual and audio representations, the necessity of incorporating the two modalities, and the complementary behaviors of the sequential prediction models at different levels.
  • Recently, caption generation with an encoder-decoder framework has been extensively studied and applied in different domains, such as image captioning, code captioning, and so on. In this paper, we propose a novel architecture, namely Auto-Reconstructor Network (ARNet), which, coupling with the conventional encoder-decoder framework, works in an end-to-end fashion to generate captions. ARNet aims at reconstructing the previous hidden state with the present one, besides behaving as the input-dependent transition operator. Therefore, ARNet encourages the current hidden state to embed more information from the previous one, which can help regularize the transition dynamics of recurrent neural networks (RNNs). Extensive experimental results show that our proposed ARNet boosts the performance over the existing encoder-decoder models on both image captioning and source code captioning tasks. Additionally, ARNet remarkably reduces the discrepancy between training and inference processes for caption generation. Furthermore, the performance on permuted sequential MNIST demonstrates that ARNet can effectively regularize RNN, especially on modeling long-term dependencies. Our code is available at: https://github.com/chenxinpeng/ARNet
  • Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component called guiding network. The guiding network models the attribute properties of input images, and its output is leveraged to compose the input of the decoder at each time step. The guiding network can be plugged into the current encoder-decoder framework and trained in an end-to-end manner. Hence, the guiding vector can be adaptively learned according to the signal from the decoder, making itself to embed information from both image and language. Additionally, discriminative supervision can be employed to further improve the quality of guidance. The advantages of our proposed approach are verified by experiments carried out on the MS COCO dataset.
  • Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video. We identify and tackle two challenges on this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate temporal event proposals in the forward direction, which neglects future video context. We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. We solve this problem by representing each event with an attentive fusion of hidden states from the proposal module and video contents (e.g., C3D features). We further propose a novel context gating mechanism to balance the contributions from the current event and its surrounding contexts dynamically. We empirically show that our attentively fused event representation is superior to the proposal hidden states or video contents alone. By coupling proposal and captioning modules into one unified framework, our model outperforms the state-of-the-arts on the ActivityNet Captions dataset with a relative gain of over 100% (Meteor score increases from 4.82 to 9.65).
  • In this paper, we propose an efficient algorithm to directly restore a clear image from a hazy input. The proposed algorithm hinges on an end-to-end trainable neural network that consists of an encoder and a decoder. The encoder is exploited to capture the context of the derived input images, while the decoder is employed to estimate the contribution of each input to the final dehazed result using the learned representations attributed to the encoder. The constructed network adopts a novel fusion-based strategy which derives three inputs from an original hazy image by applying White Balance (WB), Contrast Enhancing (CE), and Gamma Correction (GC). We compute pixel-wise confidence maps based on the appearance differences between these different inputs to blend the information of the derived inputs and preserve the regions with pleasant visibility. The final dehazed image is yielded by gating the important features of the derived inputs. To train the network, we introduce a multi-scale approach such that the halo artifacts can be avoided. Extensive experimental results on both synthetic and real-world images demonstrate that the proposed algorithm performs favorably against the state-of-the-art algorithms.
  • In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow to produce the sentence description based on the encoded video semantic features. Two types of reconstructors are customized to employ the backward flow and reproduce the video features based on the hidden state sequence generated by the decoder. The generation loss yielded by the encoder-decoder and the reconstruction loss introduced by the reconstructor are jointly drawn into training the proposed RecNet in an end-to-end fashion. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the encoder-decoder models and leads to significant gains in video caption accuracy.
  • Taking a photo outside, can we predict the immediate future, e.g., how would the cloud move in the sky? We address this problem by presenting a generative adversarial network (GAN) based two-stage approach to generating realistic time-lapse videos of high resolution. Given the first frame, our model learns to generate long-term future frames. The first stage generates videos of realistic contents for each frame. The second stage refines the generated video from the first stage by enforcing it to be closer to real videos with regard to motion dynamics. To further encourage vivid motion in the final generated video, Gram matrix is employed to model the motion more precisely. We build a large scale time-lapse dataset, and test our approach on this new dataset. Using our model, we are able to generate realistic videos of up to $128\times 128$ resolution for 32 frames. Quantitative and qualitative experiment results have demonstrated the superiority of our model over the state-of-the-art models.
  • Camera shake or target movement often leads to undesired blur effects in videos captured by a hand-held camera. Despite significant efforts being devoted to video-deblur research, two major challenges remain: 1) how to model the spatio-temporal characteristics across both the spatial domain (i.e. image plane) and temporal domain (i.e. neighboring frames), and 2) how to restore sharp image details w.r.t. the conventionally adopted metric of pixel-wise error. In this paper, to address the first challenge, we propose a DeBLuRring Network (DBLRNet) for spatial-temporal learning by applying a modified 3D convolution to both spatial and temporal domains. Our DBLRNet is able to capture jointly spatial and temporal information encoded in neighboring frames, which directly contributes to improved video deblur performance. To tackle the second challenge, we use the developed DBLRNet as a generator in the GAN (generative adversarial network) architecture, and employ a content loss in addition to an adversarial loss for efficient adversarial training. The developed network, which we name as DeBLuRring Generative Adversarial Network (DBLRGAN), is tested on two standard benchmarks and achieves the state-of-the-art performance.
  • Neural style transfer is an emerging technique which is able to endow daily-life images with attractive artistic styles. Previous work has succeeded in applying convolutional neural network (CNN) to style transfer for monocular images or videos. However, style transfer for stereoscopic images is still a missing piece. Different from processing a monocular image, the two views of a stylized stereoscopic pair are required to be consistent to provide the observer a comfortable visual experience. In this paper, we propose a dual path network for view-consistent style transfer on stereoscopic images. While each view of the stereoscopic pair is processed in an individual path, a novel feature aggregation strategy is proposed to effectively share information between the two paths. Besides a traditional perceptual loss used for controlling style transfer quality in each view, a multi-layer view loss is proposed to enforce the network to coordinate the learning of both paths to generate view-consistent stylized results. Extensive experiments show that, compared with previous methods, the proposed model can generate stylized stereoscopic images which achieve the best view consistency.
  • Cloud containers represent a new, light-weight alternative to virtual machines in cloud computing. A user job may be described by a container graph that specifies the resource profile of each container and container dependence relations. This work is the first in the cloud computing literature that designs efficient market mechanisms for container based cloud jobs. Our design targets simultaneously incentive compatibility, computational efficiency, and economic efficiency. It further adapts the idea of batch online optimization into the paradigm of mechanism design, leveraging agile creation of cloud containers and exploiting delay tolerance of elastic cloud jobs. The new and classic techniques we employ include: (i) compact exponential optimization for expressing and handling non-traditional constraints that arise from container dependence and job deadlines; (ii) the primal-dual schema for designing efficient approximation algorithms for social welfare maximization; and (iii) posted price mechanisms for batch decision making and truthful payment design. Theoretical analysis and trace-driven empirical evaluation verify the efficacy of our container auction algorithms.
  • Recently manifold learning algorithm for dimensionality reduction attracts more and more interests, and various linear and nonlinear, global and local algorithms are proposed. The key step of manifold learning algorithm is the neighboring region selection. However, so far for the references we know, few of which propose a generally accepted algorithm to well select the neighboring region. So in this paper, we propose an adaptive neighboring selection algorithm, which successfully applies the LLE and ISOMAP algorithms in the test. It is an algorithm that can find the optimal K nearest neighbors of the data points on the manifold. And the theoretical basis of the algorithm is the approximated curvature of the data point on the manifold. Based on Riemann Geometry, Jacob matrix is a proper mathematical concept to predict the approximated curvature. By verifying the proposed algorithm on embedding Swiss roll from R3 to R2 based on LLE and ISOMAP algorithm, the simulation results show that the proposed adaptive neighboring selection algorithm is feasible and able to find the optimal value of K, making the residual variance relatively small and better visualization of the results. By quantitative analysis, the embedding quality measured by residual variance is increased 45.45% after using the proposed algorithm in LLE.
  • With the recent development in mobile computing devices and as the ubiquitous deployment of access points(APs) of Wireless Local Area Networks(WLANs), WLAN based indoor localization systems(WILSs) are of mounting concentration and are becoming more and more prevalent for they do not require additional infrastructure. As to the localization methods in WILSs, for the approaches used to localization in satellite based global position systems are difficult to achieve in indoor environments, fingerprint based localization algorithms(FLAs) are predominant in the RSS based schemes. However, the performance of FLAs has close relationship with the number of APs and the number of reference points(RPs) in WILSs, especially as the redundant deployment of APs and RPs in the system. There are two fatal problems, curse of dimensionality (CoD) and asymmetric matching(AM), caused by increasing number of APs and breaking down APs during online stage. In this paper, a semi-supervised RSS dimensionality reduction algorithm is proposed to solve these two dilemmas at the same time and there are numerous analyses about the theoretical realization of the proposed method. Another significant innovation of this paper is jointing the fingerprint based algorithm with CM-SDE algorithm to improve the localization accuracy of indoor localization.
  • A rechargeable lithium metal battery (LMB), which uses metallic lithium at the anode, is among the most promising technologies for next generation electrochemical energy storage devices due to its high energy density, particularly when Li is paired with energetic conversion cathodes such as sulfur, oxygen/air, and carbon dioxide. Practical LMBs in any of these designs remain elusive due to multiple stubborn problems, including parasitic reactions of Li metal with liquid electrolytes, unstable/dendritic electrodeposition at the anode during cell recharge, and chemical reaction of dissolved cathode conversion products with the Li anode. The solid electrolyte interface (SEI) formed between lithium metal and liquid electrolytes plays a critical role in all of these processes. We report on the chemistry and interfacial properties of artificial SEI films created by in-situ reaction of a strong Lewis Acid AlI3 additive, Li metal, and aprotic liquid electrolytes. We find that these SEI films impart exceptional interfacial stability to a Li metal anode. We further show that the improvements come from at least three processes: (i) in-situ formation of Li-Al alloy, (ii) formation of a LiI salt layer on Li, and (iii) creation of a stable polymer thin film on the lithium metal anode.
  • The Langmuir-Blodgett (LB) technique is a powerful, widely used method for preparing coatings of amphiphilic molecules at air/water interfaces with thickness control down to a single molecule. Here we report two new LB techniques designed to create ordered, multifunctional nanoparticle films on any non-reactive support. The methods utilize Marangoni stresses produced by surfactants at a fluid/solid/gas interface and self-assembly of nanoparticles to facilitate rapid creation of dense monolayers of multi-wall carbon nanotubes (MWCNT), metal-oxide nanoparticles, polymers, and combinations of these materials in a layer-by-layer configuration. Using the polyolefin separator in a lithium sulfur (Li-S) electrochemical cell as an example, we illustrate how the method can be used to create structured membranes for regulating mass and charge transport. We show that a layered MWCNT/SiO2/MWCNT nanomaterial created in a clip-like configuration, with gravimetric areal coverage of ~130 mg cm-2 and a thickness of ~3 micron, efficiently adsorbs dissolved lithium polysulfide (LiPS) species and efficiently reutilize them for improving Li-S battery performance.
  • In this paper, we propose to employ the convolutional neural network (CNN) for the image question answering (QA). Our proposed CNN provides an end-to-end framework with convolutional architectures for learning not only the image and question representations, but also their inter-modal interactions to produce the answer. More specifically, our model consists of three CNNs: one image CNN to encode the image content, one sentence CNN to compose the words of the question, and one multimodal convolution layer to learn their joint representation for the classification in the space of candidate answer words. We demonstrate the efficacy of our proposed model on the DAQUAR and COCO-QA datasets, which are two benchmark datasets for the image QA, with the performances significantly outperforming the state-of-the-art.
  • In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence. The matching CNN composes words to different semantic fragments and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results on benchmark databases of bidirectional image and sentence retrieval demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. Specifically, our proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and Microsoft COCO databases achieve the state-of-the-art performances.
  • Randomness and regularities in Finance are usually treated in probabilistic terms. In this paper, we develop a completely different approach in using a non-probabilistic framework based on the algorithmic information theory initially developed by Kolmogorov (1965). We present some elements of this theory and show why it is particularly relevant to Finance, and potentially to other sub-fields of Economics as well. We develop a generic method to estimate the Kolmogorov complexity of numeric series. This approach is based on an iterative "regularity erasing procedure" implemented to use lossless compression algorithms on financial data. Examples are provided with both simulated and real-world financial time series. The contributions of this article are twofold. The first one is methodological : we show that some structural regularities, invisible with classical statistical tests, can be detected by this algorithmic method. The second one consists in illustrations on the daily Dow-Jones Index suggesting that beyond several well-known regularities, hidden structure may in this index remain to be identified.
  • As two-dimensional fluid shells, lipid bilayer membranes resist bending and stretching but are unable to sustain shear stresses. This property gives membranes the ability to adopt dramatic shape changes. In this paper, a finite element model is developed to study static equilibrium mechanics of membranes. In particular, a viscous regularization method is proposed to stabilize tangential mesh deformations and improve the convergence rate of nonlinear solvers. The Augmented Lagrangian method is used to enforce global constraints on area and volume during membrane deformations. As a validation of the method, equilibrium shapes for a shape-phase diagram of lipid bilayer vesicle are calculated. These numerical techniques are also shown to be useful for simulations of three-dimensional large-deformation problems: the formation of tethers (long tube-like exetensions); and Ginzburg-Landau phase separation of a two-lipid-component vesicle. To deal with the large mesh distortions of the two-phase model, modification of vicous regularization is explored to achieve r-adaptive mesh optimization.