• ### A Survey on QoE-oriented Wireless Resources Scheduling(1705.07839)

March 6, 2019 cs.NI, cs.MM
Future wireless systems are expected to provide a wide range of services to more and more users. Advanced scheduling strategies thus arise not only to perform efficient radio resource management, but also to provide fairness among the users. On the other hand, the users' perceived quality, i.e., Quality of Experience (QoE), is becoming one of the main drivers within the schedulers design. In this context, this paper starts by providing a comprehension of what is QoE and an overview of the evolution of wireless scheduling techniques. Afterwards, a survey on the most recent QoE-based scheduling strategies for wireless systems is presented, highlighting the application/service of the different approaches reported in the literature, as well as the parameters that were taken into account for QoE optimization. Therefore, this paper aims at helping readers interested in learning the basic concepts of QoE-oriented wireless resources scheduling, as well as getting in touch with its current research frontier.
• ### Towards Automatic Construction of Diverse, High-quality Image Dataset(1708.06495)

March 1, 2019 cs.CV, cs.MM
The availability of labeled image datasets has been shown critical for high-level image understanding, which continuously drives the progress of feature designing and models developing. However, constructing labeled image datasets is laborious and monotonous. To eliminate manual annotation, in this work, we propose a novel image dataset construction framework by employing multiple textual queries. We aim at collecting diverse and accurate images for given queries from the Web. Specifically, we formulate noisy textual queries removing and noisy images filtering as a multi-view and multi-instance learning problem separately. Our proposed approach not only improves the accuracy but also enhances the diversity of the selected images. To verify the effectiveness of our proposed approach, we construct an image dataset with 100 categories. The experiments show significant performance gains by using the generated data of our approach on several tasks, such as image classification, cross-dataset generalization, and object detection. The proposed method also consistently outperforms existing weakly supervised and web-supervised approaches.
• ### Narrative Smoothing: Dynamic Conversational Network for the Analysis of TV Series Plots(1602.07811)

Dec. 29, 2018 cs.SI, cs.MM
Modern popular TV series often develop complex storylines spanning several seasons, but are usually watched in quite a discontinuous way. As a result, the viewer generally needs a comprehensive summary of the previous season plot before the new one starts. The generation of such summaries requires first to identify and characterize the dynamics of the series subplots. One way of doing so is to study the underlying social network of interactions between the characters involved in the narrative. The standard tools used in the Social Networks Analysis field to extract such a network rely on an integration of time, either over the whole considered period, or as a sequence of several time-slices. However, they turn out to be inappropriate in the case of TV series, due to the fact the scenes showed onscreen alternatively focus on parallel storylines, and do not necessarily respect a traditional chronology. This makes existing extraction methods inefficient to describe the dynamics of relationships between characters, or to get a relevant instantaneous view of the current social state in the plot. This is especially true for characters shown as interacting with each other at some previous point in the plot but temporarily neglected by the narrative. In this article, we introduce narrative smoothing, a novel, still exploratory, network extraction method. It smooths the relationship dynamics based on the plot properties, aiming at solving some of the limitations present in the standard approaches. In order to assess our method, we apply it to a new corpus of 3 popular TV series, and compare it to both standard approaches. Our results are promising, showing narrative smoothing leads to more relevant observations when it comes to the characterization of the protagonists and their relationships. It could be used as a basis for further modeling the intertwined storylines constituting TV series plots.
• ### SSGAN: Secure Steganography Based on Generative Adversarial Networks(1707.01613)

Nov. 24, 2018 cs.CV, cs.MM
In this paper, a novel strategy of Secure Steganograpy based on Generative Adversarial Networks is proposed to generate suitable and secure covers for steganography. The proposed architecture has one generative network, and two discriminative networks. The generative network mainly evaluates the visual quality of the generated images for steganography, and the discriminative networks are utilized to assess their suitableness for information hiding. Different from the existing work which adopts Deep Convolutional Generative Adversarial Networks, we utilize another form of generative adversarial networks. By using this new form of generative adversarial networks, significant improvements are made on the convergence speed, the training stability and the image quality. Furthermore, a sophisticated steganalysis network is reconstructed for the discriminative network, and the network can better evaluate the performance of the generated images. Numerous experiments are conducted on the publicly available datasets to demonstrate the effectiveness and robustness of the proposed method.
• ### Hidden Two-Stream Convolutional Networks for Action Recognition(1704.00389)

Oct. 30, 2018 cs.CV, cs.LG, cs.MM
Analyzing videos of human actions involves understanding the temporal relationships among video frames. State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs. Such a two-stage approach is computationally expensive, storage demanding, and not end-to-end trainable. In this paper, we present a novel CNN architecture that implicitly captures motion information between adjacent frames. We name our approach hidden two-stream CNNs because it only takes raw video frames as input and directly predicts action classes without explicitly computing optical flow. Our end-to-end approach is 10x faster than its two-stage baseline. Experimental results on four challenging action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show that our approach significantly outperforms the previous best real-time approaches.
• ### Steerable Discrete Cosine Transform(1610.09152)

Oct. 23, 2018 cs.IT, math.IT, math.OC, cs.MM
In image compression, classical block-based separable transforms tend to be inefficient when image blocks contain arbitrarily shaped discontinuities. For this reason, transforms incorporating directional information are an appealing alternative. In this paper, we propose a new approach to this problem, namely a discrete cosine transform (DCT) that can be steered in any chosen direction. Such transform, called steerable DCT (SDCT), allows to rotate in a flexible way pairs of basis vectors, and enables precise matching of directionality in each image block, achieving improved coding efficiency. The optimal rotation angles for SDCT can be represented as solution of a suitable rate-distortion (RD) problem. We propose iterative methods to search such solution, and we develop a fully fledged image encoder to practically compare our techniques with other competing transforms. Analytical and numerical results prove that SDCT outperforms both DCT and state-of-the-art directional transforms.
• ### Adaptive Blind Image Watermarking Using Fuzzy Inference System Based on Human Visual Perception(1709.06536)

Oct. 17, 2018 cs.MM
Development of digital content has increased the necessity of copyright protection by means of watermarking. Imperceptibility and robustness are two important features of watermarking algorithms. The goal of watermarking methods is to satisfy the tradeoff between these two contradicting characteristics. Recently watermarking methods in transform domains have displayed favorable results. In this paper, we present an adaptive blind watermarking method which has high transparency in areas that are important to human visual system. We propose a fuzzy system for adaptive control of the embedding strength factor. Features such as saliency, intensity, and edge-concentration, are used as fuzzy attributes. Redundant embedding in discrete cosine transform (DCT) of wavelet domain has increased the robustness of our method. Experimental results show the efficiency of the proposed method and better results are obtained as compared to comparable methods with same size of watermark logo.
• ### DNA Steganalysis Using Deep Recurrent Neural Networks(1704.08443)

Oct. 5, 2018 cs.LG, cs.MM
Recent advances in next-generation sequencing technologies have facilitated the use of deoxyribonucleic acid (DNA) as a novel covert channels in steganography. There are various methods that exist in other domains to detect hidden messages in conventional covert channels. However, they have not been applied to DNA steganography. The current most common detection approaches, namely frequency analysis-based methods, often overlook important signals when directly applied to DNA steganography because those methods depend on the distribution of the number of sequence characters. To address this limitation, we propose a general sequence learning-based DNA steganalysis framework. The proposed approach learns the intrinsic distribution of coding and non-coding sequences and detects hidden messages by exploiting distribution variations after hiding these messages. Using deep recurrent neural networks (RNNs), our framework identifies the distribution variations by using the classification score to predict whether a sequence is to be a coding or non-coding sequence. We compare our proposed method to various existing methods and biological sequence analysis methods implemented on top of our framework. According to our experimental results, our approach delivers a robust detection performance compared to other tools.
• ### Causally Regularized Learning with Agnostic Data Selection Bias(1708.06656)

Aug. 19, 2018 cs.CV, stat.ML, cs.MM
Most of previous machine learning algorithms are proposed based on the i.i.d. hypothesis. However, this ideal assumption is often violated in real applications, where selection bias may arise between training and testing process. Moreover, in many scenarios, the testing data is not even available during the training process, which makes the traditional methods like transfer learning infeasible due to their need on prior of test distribution. Therefore, how to address the agnostic selection bias for robust model learning is of paramount importance for both academic research and real applications. In this paper, under the assumption that causal relationships among variables are robust across domains, we incorporate causal technique into predictive modeling and propose a novel Causally Regularized Logistic Regression (CRLR) algorithm by jointly optimize global confounder balancing and weighted logistic regression. Global confounder balancing helps to identify causal features, whose causal effect on outcome are stable across domains, then performing logistic regression on those causal features constructs a robust predictive model against the agnostic bias. To validate the effectiveness of our CRLR algorithm, we conduct comprehensive experiments on both synthetic and real world datasets. Experimental results clearly demonstrate that our CRLR algorithm outperforms the state-of-the-art methods, and the interpretability of our method can be fully depicted by the feature visualization.
• ### Creating A Multi-track Classical Musical Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications(1612.08727)

Aug. 8, 2018 cs.MM, cs.SD
We introduce a dataset for facilitating audio-visual analysis of music performances. The dataset comprises 44 simple multi-instrument classical music pieces assembled from coordinated but separately recorded performances of individual tracks. For each piece, we provide the musical score in MIDI format, the audio recordings of the individual tracks, the audio and video recording of the assembled mixture, and ground-truth annotation files including frame-level and note-level transcriptions. We describe our methodology for the creation of the dataset, particularly highlighting our approaches for addressing the challenges involved in maintaining synchronization and expressiveness. We demonstrate the high quality of synchronization achieved with our proposed approach by comparing the dataset with existing widely-used music audio datasets. We anticipate that the dataset will be useful for the development and evaluation of existing music information retrieval (MIR) tasks, as well as for novel multi-modal tasks. We benchmark two existing MIR tasks (multi-pitch analysis and score-informed source separation) on the dataset and compare with other existing music audio datasets. Additionally, we consider two novel multi-modal MIR tasks (visually informed multi-pitch analysis and polyphonic vibrato analysis) enabled by the dataset and provide evaluation measures and baseline systems for future comparisons (from our recent work). Finally, we propose several emerging research directions that the dataset enables.
• ### Enhancing Quality for HEVC Compressed Videos(1709.06734)

July 6, 2018 cs.MM
The latest High Efficiency Video Coding (HEVC) standard has been increasingly applied to generate video streams over the Internet. However, HEVC compressed videos may incur severe quality degradation, particularly at low bit-rates. Thus, it is necessary to enhance the visual quality of HEVC videos at the decoder side. To this end, this paper proposes a Quality Enhancement Convolutional Neural Network (QE-CNN) method that does not require any modification of the encoder to achieve quality enhancement for HEVC. In particular, our QE-CNN method learns QE-CNN-I and QE-CNN-P models to reduce the distortion of HEVC I and P frames, respectively. The proposed method differs from the existing CNN-based quality enhancement approaches, which only handle intra-coding distortion and are thus not suitable for P frames. Our experimental results validate that our QE-CNN method is effective in enhancing quality for both I and P frames of HEVC videos. To apply our QE-CNN method in time-constrained scenarios, we further propose a Time-constrained Quality Enhancement Optimization (TQEO) scheme. Our TQEO scheme controls the computational time of QE-CNN to meet a target, meanwhile maximizing the quality enhancement. Next, the experimental results demonstrate the effectiveness of our TQEO scheme from the aspects of time control accuracy and quality enhancement under different time constraints. Finally, we design a prototype to implement our TQEO scheme in a real-time scenario.
• ### Multi-layer architecture for efficient steganalysis of Undermp3cover in multi-encoder scenario(1710.01230)

July 3, 2018 cs.CR, cs.MM
Mp3 is a very popular audio format and hence it can be a good host for carrying hidden messages. Therefore, different steganography methods have been proposed for mp3 hosts. But, current literature has only focused on steganalysis of mp3stego. In this paper we mention some of the limitations of mp3stego and argue that UnderMp3Cover (Ump3c) does not have those limitations. Ump3c makes subtle changes only to the global gain of bitstream and keeps the rest of bitstream intact. Therefore, its detection is much harder than mp3stego. To address this, joint distributions between global gain and other fields of mp3 bit stream are used. The changes are detected by measuring the mutual information from those joint distributions. Furthermore, we show that different mp3 encoders have dissimilar performances. Consequently, a novel multi-layer architecture for steganalysis of Ump3c is proposed. In this manner, the first layer detects the encoder and the second layer performs the steganalysis job. One of advantages of this architecture is that feature extraction and feature selection can be optimized for each encoder separately. We show this multi-layer architecture outperforms the conventional single-layer methods. Comparing results of the proposed method with other works shows an improvement of 20.4% in the accuracy of steganalysis.
• ### Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text(1707.01606)

June 29, 2018 cs.AI, cs.CV, cs.LG, cs.MM
Real world multimedia data is often composed of multiple modalities such as an image or a video with associated text (e.g. captions, user comments, etc.) and metadata. Such multimodal data packages are prone to manipulations, where a subset of these modalities can be altered to misrepresent or repurpose data packages, with possible malicious intent. It is, therefore, important to develop methods to assess or verify the integrity of these multimedia packages. Using computer vision and natural language processing methods to directly compare the image (or video) and the associated caption to verify the integrity of a media package is only possible for a limited set of objects and scenes. In this paper, we present a novel deep learning-based approach for assessing the semantic integrity of multimedia packages containing images and captions, using a reference set of multimedia packages. We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). The integrity of query media packages is assessed as the inlierness of the query ICCSs with respect to the reference dataset. We present the MultimodAl Information Manipulation dataset (MAIM), a new dataset of media packages from Flickr, which we make available to the research community. We use both the newly created dataset as well as Flickr30K and MS COCO datasets to quantitatively evaluate our proposed approach. The reference dataset does not contain unmanipulated versions of tampered query packages. Our method is able to achieve F1 scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K and MS COCO, respectively, for detecting semantically incoherent media packages.
• ### Video Streaming in Distributed Erasure-coded Storage Systems: Stall Duration Analysis(1703.08348)

June 26, 2018 cs.IT, math.IT, cs.DC, cs.NI, cs.MM
The demand for global video has been burgeoning across industries. With the expansion and improvement of video-streaming services, cloud-based video is evolving into a necessary feature of any successful business for reaching internal and external audiences. This paper considers video streaming over distributed systems where the video segments are encoded using an erasure code for better reliability thus being the first work to our best knowledge that considers video streaming over erasure-coded distributed cloud systems. The download time of each coded chunk of each video segment is characterized and ordered statistics over the choice of the erasure-coded chunks is used to obtain the playback time of different video segments. Using the playback times, bounds on the moment generating function on the stall duration is used to bound the mean stall duration. Moment generating function based bounds on the ordered statistics are also used to bound the stall duration tail probability which determines the probability that the stall time is greater than a pre-defined number. These two metrics, mean stall duration and the stall duration tail probability, are important quality of experience (QoE) measures for the end users. Based on these metrics, we formulate an optimization problem to jointly minimize the convex combination of both the QoE metrics averaged over all requests over the placement and access of the video content. The non-convex problem is solved using an efficient iterative algorithm. Numerical results show significant improvement in QoE metrics for cloud-based video as compared to the considered baselines.
• ### ISTA-Net: Interpretable Optimization-Inspired Deep Network for Image Compressive Sensing(1706.07929)

June 18, 2018 cs.CV, cs.MM
With the aim of developing a fast yet accurate algorithm for compressive sensing (CS) reconstruction of natural images, we combine in this paper the merits of two existing categories of CS methods: the structure insights of traditional optimization-based methods and the speed of recent network-based ones. Specifically, we propose a novel structured deep network, dubbed ISTA-Net, which is inspired by the Iterative Shrinkage-Thresholding Algorithm (ISTA) for optimizing a general $\ell_1$ norm CS reconstruction model. To cast ISTA into deep network form, we develop an effective strategy to solve the proximal mapping associated with the sparsity-inducing regularizer using nonlinear transforms. All the parameters in ISTA-Net (\eg nonlinear transforms, shrinkage thresholds, step sizes, etc.) are learned end-to-end, rather than being hand-crafted. Moreover, considering that the residuals of natural images are more compressible, an enhanced version of ISTA-Net in the residual domain, dubbed {ISTA-Net}$^+$, is derived to further improve CS reconstruction. Extensive CS experiments demonstrate that the proposed ISTA-Nets outperform existing state-of-the-art optimization-based and network-based CS methods by large margins, while maintaining fast computational speed. Our source codes are available: \textsl{http://jianzhang.tech/projects/ISTA-Net}.
• ### Weighted Unsupervised Learning for 3D Object Detection(1602.05920)

June 4, 2018 cs.CV, cs.LG, cs.RO, cs.MM, cs.GR
This paper introduces a novel weighted unsupervised learning for object detection using an RGB-D camera. This technique is feasible for detecting the moving objects in the noisy environments that are captured by an RGB-D camera. The main contribution of this paper is a real-time algorithm for detecting each object using weighted clustering as a separate cluster. In a preprocessing step, the algorithm calculates the pose 3D position X, Y, Z and RGB color of each data point and then it calculates each data point's normal vector using the point's neighbor. After preprocessing, our algorithm calculates k-weights for each data point; each weight indicates membership. Resulting in clustered objects of the scene.
• ### Digital Cardan Grille: A Modern Approach for Information Hiding(1803.09219)

May 11, 2018 cs.CR, cs.MM
In this paper, a new framework for construction of Cardan grille for information hiding is proposed. Based on the semantic image inpainting technique, the stego image are driven by secret messages directly. A mask called Digital Cardan Grille (DCG) for determining the hidden location is introduced to hide the message. The message is written to the corrupted region that needs to be filled in the corrupted image in advance. Then the corrupted image with secret message is feeded into a Generative Adversarial Network (GAN) for semantic completion. The adversarial game not only reconstruct the corrupted image , but also generate a stego image which contains the logic rationality of image content. The experimental results verify the feasibility of the proposed method.
• ### Enhancing HEVC Compressed Videos with a Partition-masked Convolutional Neural Network(1805.03894)

May 10, 2018 cs.MM
In this paper, we propose a partition-masked Convolution Neural Network (CNN) to achieve compressed-video enhancement for the state-of-the-art coding standard, High Efficiency Video Coding (HECV). More precisely, our method utilizes the partition information produced by the encoder to guide the quality enhancement process. In contrast to existing CNN-based approaches, which only take the decoded frame as the input to the CNN, the proposed approach considers the coding unit (CU) size information and combines it with the distorted decoded frame such that the degradation introduced by HEVC is reduced more efficiently. Experimental results show that our approach leads to over 9.76% BD-rate saving on benchmark sequences, which achieves the state-of-the-art performance.
• ### CloudAR: A Cloud-based Framework for Mobile Augmented Reality(1805.03060)

May 8, 2018 cs.MM
Computation capabilities of recent mobile devices enable natural feature processing for Augmented Reality (AR). However, mobile AR applications are still faced with scalability and performance challenges. In this paper, we propose CloudAR, a mobile AR framework utilizing the advantages of cloud and edge computing through recognition task offloading. We explore the design space of cloud-based AR exhaustively and optimize the offloading pipeline to minimize the time and energy consumption. We design an innovative tracking system for mobile devices which provides lightweight tracking in 6 degree of freedom (6DoF) and hides the offloading latency from users' perception. We also design a multi-object image retrieval pipeline that executes fast and accurate image recognition tasks on servers. In our evaluations, the mobile AR application built with the CloudAR framework runs at 30 frames per second (FPS) on average with precise tracking of only 1~2 pixel errors and image recognition of at least 97% accuracy. Our results also show that CloudAR outperforms one of the leading commercial AR framework in several performance metrics.
• ### Optimization of Occlusion-Inducing Depth Pixels in 3-D Video Coding(1805.03105)

May 8, 2018 cs.MM
The optimization of occlusion-inducing depth pixels in depth map coding has received little attention in the literature, since their associated texture pixels are occluded in the synthesized view and their effect on the synthesized view is considered negligible. However, the occlusion-inducing depth pixels still need to consume the bits to be transmitted, and will induce geometry distortion that inherently exists in the synthesized view. In this paper, we propose an efficient depth map coding scheme specifically for the occlusion-inducing depth pixels by using allowable depth distortions. Firstly, we formulate a problem of minimizing the overall geometry distortion in the occlusion subject to the bit rate constraint, for which the depth distortion is properly adjusted within the set of allowable depth distortions that introduce the same disparity error as the initial depth distortion. Then, we propose a dynamic programming solution to find the optimal depth distortion vector for the occlusion. The proposed algorithm can improve the coding efficiency without alteration of the occlusion order. Simulation results confirm the performance improvement compared to other existing algorithms.
• ### Low Rank Tensor Completion for Multiway Visual Data(1805.03967)

May 8, 2018 cs.NA, cs.MM
Tensor completion recovers missing entries of multiway data. Teh missing of entries could often be caused during teh data acquisition and transformation. In dis paper, we provide an overview of recent development in low rank tensor completion for estimating teh missing components of visual data, e. g. , color images and videos. First, we categorize these methods into two groups based on teh different optimization models. One optimizes factors of tensor decompositions wif predefined tensor rank. Teh other iteratively updates teh estimated tensor via minimizing teh tensor rank. Besides, we summarize teh corresponding algorithms to solve those optimization problems in details. Numerical experiments are given to demonstrate teh performance comparison when different methods are applied to color image and video processing.
• ### Multimodal Machine Translation with Reinforcement Learning(1805.02356)

May 7, 2018 cs.AI, cs.MA, cs.CL, cs.IR, cs.MM
Multimodal machine translation is one of the applications that integrates computer vision and language processing. It is a unique task given that in the field of machine translation, many state-of-the-arts algorithms still only employ textual information. In this work, we explore the effectiveness of reinforcement learning in multimodal machine translation. We present a novel algorithm based on the Advantage Actor-Critic (A2C) algorithm that specifically cater to the multimodal machine translation task of the EMNLP 2018 Third Conference on Machine Translation (WMT18). We experiment our proposed algorithm on the Multi30K multilingual English-German image description dataset and the Flickr30K image entity dataset. Our model takes two channels of inputs, image and text, uses translation evaluation metrics as training rewards, and achieves better results than supervised learning MLE baseline models. Furthermore, we discuss the prospects and limitations of using reinforcement learning for machine translation. Our experiment results suggest a promising reinforcement learning solution to the general task of multimodal sequence to sequence learning.
• ### Competitive Video Retrieval with vitrivr at the Video Browser Showdown 2018 - Final Notes(1805.02371)

May 7, 2018 cs.MM
This paper presents an after-the-fact summary of the participation of the vitrivr system to the 2018 Video Browser Showdown. A particular focus is on additions made since the original publication and the systems performance during the competition.
• ### QARC: Video Quality Aware Rate Control for Real-Time Video Streaming based on Deep Reinforcement Learning(1805.02482)

May 7, 2018 cs.MM
Real-time video streaming is now one of the main applications in all network environments. Due to the fluctuation of throughput under various network conditions, how to choose a proper bitrate adaptively has become an upcoming and interestingly issue. To tackle this problem, most adaptive bitrate control methods have been proposed to provide high video bitrates instead of video qualities. Nevertheless, we notice that there exists a trade-off between sending bitrate and video quality, which motivates us to focus on how to get a balance between them. In this paper, we propose QARC (video Quality Awareness Rate Control), a rate control algorithm that aims to have a higher perceptual video quality with possibly lower sending rate and transmission latency. Starting from scratch, QARC uses deep reinforcement learning(DRL) algorithm to train a neural network to select future bitrates based on previously observed network status and past video frames. To overcome the "state explosion problem", we design a neural network to predict future perceptual video quality as a vector for taking the place of the raw picture in the DRL's inputs. We evaluate QARC over a trace-driven emulation, outperforming existing approach with improvements in average video quality of 18\% - 25\% and decreases in average latency with 23% -45%. Meanwhile, Comparing QARC with offline optimal high bitrate method on various network conditions also yields a solid result.
• ### Learning Optical Flow via Dilated Networks and Occlusion Reasoning(1805.02733)

May 7, 2018 cs.CV, cs.MM
Despite the significant progress that has been made on estimating optical flow recently, most estimation methods, including classical and deep learning approaches, still have difficulty with multi-scale estimation, real-time computation, and/or occlusion reasoning. In this paper, we introduce dilated convolution and occlusion reasoning into unsupervised optical flow estimation to address these issues. The dilated convolution allows our network to avoid upsampling via deconvolution and the resulting gridding artifacts. Dilated convolution also results in a smaller memory footprint which speeds up interference. The occlusion reasoning prevents our network from learning incorrect deformations due to occluded image regions during training. Our proposed method outperforms state-of-the-art unsupervised approaches on the KITTI benchmark. We also demonstrate its generalization capability by applying it to action recognition in video.