• ### meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting(1706.06197)

March 11, 2019 cs.AI, cs.CV, cs.CL, cs.LG
We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-$k$ elements (in terms of magnitude) are kept. As a result, only $k$ rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction ($k$ divided by the vector dimension) in the computational cost. Surprisingly, experimental results demonstrate that we can update only 1-4% of the weights at each back propagation pass. This does not result in a larger number of training iterations. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The code is available at https://github.com/lancopku/meProp
• ### Towards Easier and Faster Sequence Labeling for Natural Language Processing: A Search-based Probabilistic Online Learning Framework (SAPO)(1503.08381)

Nov. 19, 2018 cs.AI, cs.LG
There are two major approaches for sequence labeling. One is the probabilistic gradient-based methods such as conditional random fields (CRF) and neural networks (e.g., RNN), which have high accuracy but drawbacks: slow training, and no support of search-based optimization (which is important in many cases). The other is the search-based learning methods such as structured perceptron and margin infused relaxed algorithm (MIRA), which have fast training but also drawbacks: low accuracy, no probabilistic information, and non-convergence in real-world tasks. We propose a novel and "easy" solution, a search-based probabilistic online learning method, to address most of those issues. The method is "easy", because the optimization algorithm at the training stage is as simple as the decoding algorithm at the test stage. This method searches the output candidates, derives probabilities, and conducts efficient online learning. We show that this method with fast training and theoretical guarantee of convergence, which is easy to implement, can support search-based optimization and obtain top accuracy. Experiments on well-known tasks show that our method has better accuracy than CRF and BiLSTM\footnote{The SAPO code is released at \url{https://github.com/lancopku/SAPO}.}.
• ### Regularizing Output Distribution of Abstractive Chinese Social Media Text Summarization for Improved Semantic Consistency(1805.04033)

May 10, 2018 cs.CL
Abstractive text summarization is a highly difficult problem, and the sequence-to-sequence model has shown success in improving the performance on the task. However, the generated summaries are often inconsistent with the source content in semantics. In such cases, when generating summaries, the model selects semantically unrelated words with respect to the source content as the most probable output. The problem can be attributed to heuristically constructed training data, where summaries can be unrelated to the source content, thus containing semantically unrelated words and spurious word correspondence. In this paper, we propose a regularization approach for the sequence-to-sequence model and make use of what the model has learned to regularize the learning objective to alleviate the effect of the problem. In addition, we propose a practical human evaluation method to address the problem that the existing automatic evaluation method does not evaluate the semantic consistency with the source content properly. Experimental results demonstrate the effectiveness of the proposed approach, which outperforms almost all the existing models. Especially, the proposed approach improves the semantic consistency by 4\% in terms of human evaluation.
• ### A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification(1805.01089)

May 3, 2018 cs.CL, cs.LG
Text summarization and sentiment classification both aim to capture the main ideas of the text but at different levels. Text summarization is to describe the text within a few sentences, while sentiment classification can be regarded as a special type of summarization which "summarizes" the text into a even more abstract fashion, i.e., a sentiment class. Based on this idea, we propose a hierarchical end-to-end model for joint learning of text summarization and sentiment classification, where the sentiment classification label is treated as the further "summarization" of the text summarization output. Hence, the sentiment classification layer is put upon the text summarization layer, and a hierarchical structure is derived. Experimental results on Amazon online reviews datasets show that our model achieves better performance than the strong baseline systems on both abstractive summarization and sentiment classification.
• ### Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation(1803.01465)

March 30, 2018 cs.CL, cs.LG
Most recent approaches use the sequence-to-sequence model for paraphrase generation. The existing sequence-to-sequence model tends to memorize the words and the patterns in the training dataset instead of learning the meaning of the words. Therefore, the generated sentences are often grammatically correct but semantically improper. In this work, we introduce a novel model based on the encoder-decoder framework, called Word Embedding Attention Network (WEAN). Our proposed model generates the words by querying distributed word representations (i.e. neural word embeddings), hoping to capturing the meaning of the according words. Following previous work, we evaluate our model on two paraphrase-oriented tasks, namely text simplification and short text abstractive summarization. Experimental results show that our model outperforms the sequence-to-sequence baseline by the BLEU score of 6.3 and 5.5 on two English text simplification datasets, and the ROUGE-2 F1 score of 5.7 on a Chinese summarization dataset. Moreover, our model achieves state-of-the-art performances on these three benchmark datasets.
• ### Structure Regularized Neural Network for Entity Relation Classification for Chinese Literature Text(1803.05662)

March 15, 2018 cs.CL
Relation classification is an important semantic processing task in the field of natural language processing. In this paper, we propose the task of relation classification for Chinese literature text. A new dataset of Chinese literature text is constructed to facilitate the study in this task. We present a novel model, named Structure Regularized Bidirectional Recurrent Convolutional Neural Network (SR-BRCNN), to identify the relation between entities. The proposed model learns relation representations along the shortest dependency path (SDP) extracted from the structure regularized dependency tree, which has the benefits of reducing the complexity of the whole model. Experimental results show that the proposed method significantly improves the F1 score by 10.3, and outperforms the state-of-the-art approaches on Chinese literature text.
• ### DP-GAN: Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text(1802.01345)

Feb. 6, 2018 cs.CL
Existing text generation methods tend to produce repeated and "boring" expressions. To tackle this problem, we propose a new text generation model, called Diversity-Promoting Generative Adversarial Network (DP-GAN). The proposed model assigns low reward for repeated text and high reward for "novel" text, encouraging the generator to produce diverse and informative text. Moreover, we propose a novel language-model based discriminator, which can better distinguish novel text from repeated text without the saturation problem compared with existing classifier-based discriminators. The experimental results on review generation and dialogue generation tasks demonstrate that our method can generate substantially more diverse and informative text than existing baseline methods. The code is available at https://github.com/lancopku/DPGAN
• ### Hybrid Oracle: Making Use of Ambiguity in Transition-based Chinese Dependency Parsing(1711.10163)

Feb. 6, 2018 cs.CL
In the training of transition-based dependency parsers, an oracle is used to predict a transition sequence for a sentence and its gold tree. However, the transition system may exhibit ambiguity, that is, there can be multiple correct transition sequences that form the gold tree. We propose to make use of the property in the training of neural dependency parsers, and present the Hybrid Oracle. The new oracle gives all the correct transitions for a parsing state, which are used in the cross entropy loss function to provide better supervisory signal. It is also used to generate different transition sequences for a sentence to better explore the training data and improve the generalization ability of the parser. Evaluations show that the parsers trained using the hybrid oracle outperform the parsers using the traditional oracle in Chinese dependency parsing. We provide analysis from a linguistic view. The code is available at https://github.com/lancopku/nndep .
• ### Building an Ellipsis-aware Chinese Dependency Treebank for Web Text(1801.06613)

Jan. 23, 2018 cs.CL
Web 2.0 has brought with it numerous user-produced data revealing one's thoughts, experiences, and knowledge, which are a great source for many tasks, such as information extraction, and knowledge base construction. However, the colloquial nature of the texts poses new challenges for current natural language processing techniques, which are more adapt to the formal form of the language. Ellipsis is a common linguistic phenomenon that some words are left out as they are understood from the context, especially in oral utterance, hindering the improvement of dependency parsing, which is of great importance for tasks relied on the meaning of the sentence. In order to promote research in this area, we are releasing a Chinese dependency treebank of 319 weibos, containing 572 sentences with omissions restored and contexts reserved.
• ### Minimal Effort Back Propagation for Convolutional Neural Networks(1709.05804)

Sept. 18, 2017 cs.NE, cs.LG, stat.ML
As traditional neural network consumes a significant amount of computing resources during back propagation, \citet{Sun2017mePropSB} propose a simple yet effective technique to alleviate this problem. In this technique, only a small subset of the full gradients are computed to update the model parameters. In this paper we extend this technique into the Convolutional Neural Network(CNN) to reduce calculation in back propagation, and the surprising results verify its validity in CNN: only 5\% of the gradients are passed back but the model still achieves the same effect as the traditional CNN, or even better. We also show that the top-$k$ selection of gradients leads to a sparse calculation in back propagation, which may bring significant computational benefits for high computational complexity of convolution operation in CNN.