Arxiv Sanity Preserver

Web Annotation as a First Class Object (1310.6555)

Paolo Ciccarese, Stian Soiland-Reyes, Tim Clark

March 20, 2019 cs.IR, cs.DL

Scholars have made handwritten notes and comments in books and manuscripts for centuries. Today's blogs and news sites typically invite users to express their opinions on the published content; URLs allow web resources to be shared with accompanying annotations and comments using third-party services like Twitter or Facebook. These contributions have until recently been constrained within specific services, making them second-class citizens of the Web. Web Annotations are now emerging as fully independent Linked Data in their own right, no longer restricted to plain textual comments in application silos. Annotations can now range from bookmarks and comments, to fine-grained annotations of a selection of, for example, a section of a frame within a video stream. Technologies and standards now exist to create, publish, syndicate, mash-up and consume, finely targeted, semantically rich digital annotations on practically any content, as first-class Web citizens. This development is being driven by the need for collaboration and annotation reuse amongst domain researchers, computer scientists, scientific publishers, and scholarly content databases.
To Index or Not to Index: Optimizing Exact Maximum Inner Product Search (1706.01449)

Firas Abuzaid, Geet Sethi, Peter Bailis, Matei Zaharia

March 15, 2019 cs.PF, cs.DB, cs.DS, cs.IR

Exact Maximum Inner Product Search (MIPS) is an important task that is widely pertinent to recommender systems and high-dimensional similarity search. The brute-force approach to solving exact MIPS is computationally expensive, thus spurring recent development of novel indexes and pruning techniques for this task. In this paper, we show that a hardware-efficient brute-force approach, blocked matrix multiply (BMM), can outperform the state-of-the-art MIPS solvers by over an order of magnitude, for some -- but not all -- inputs. In this paper, we also present a novel MIPS solution, MAXIMUS, that takes advantage of hardware efficiency and pruning of the search space. Like BMM, MAXIMUS is faster than other solvers by up to an order of magnitude, but again only for some inputs. Since no single solution offers the best runtime performance for all inputs, we introduce a new data-dependent optimizer, OPTIMUS, that selects online with minimal overhead the best MIPS solver for a given input. Together, OPTIMUS and MAXIMUS outperform state-of-the-art MIPS solvers by 3.2 $\times$ on average, and up to 10.9 $\times$ , on widely studied MIPS datasets.
A User's Guide to CARSKit (1511.03780)

Yong Zheng

Feb. 18, 2019 cs.IR

Context-aware recommender systems extend traditional recommenders by adapting their suggestions to users' contextual situations. CARSKit is a Java-based open-source library specifically designed for the context-aware recommendation, where the state-of-the-art context-aware recommendation algorithms have been implemented. This report provides the basic user's guide to CARSKit, including how to prepare the data set, how to configure the experimental settings, and how to evaluate the algorithms, as well as interpreting the outputs. The instructions in this guide are applicable for CARSKit v0.3.5 and above.
Spectral Ranking (0912.0238)

Sebastiano Vigna

Feb. 8, 2019 physics.soc-ph, cs.SI, cs.IR

We sketch the history of spectral ranking, a general umbrella name for techniques that apply the theory of linear maps (in particular, eigenvalues and eigenvectors) to matrices that do not represent geometric transformations, but rather some kind of relationship between entities. Albeit recently made famous by the ample press coverage of Google's PageRank algorithm, spectral ranking was devised more than a century ago, and has been studied in tournament ranking, psychology, social sciences, bibliometrics, economy and choice theory. We describe the contribution given by previous scholars in precise and modern mathematical terms: along the way, we show how to express in a general way damped rankings, such as Katz's index, as dominant eigenvectors of perturbed matrices, and then use results on the Drazin inverse to go back to the dominant eigenvectors by a limit process. The result suggests a regularized definition of spectral ranking that yields for a general matrix a unique vector depending on a boundary condition.
Local Search Methods for Fast Near Neighbor Search (1705.10351)

Eric S. Tellez, Guillermo Ruiz, Edgar Chavez, Mario Graff

Jan. 17, 2019 cs.DS, cs.IR

Near neighbor search is a powerful abstraction for data access; it allows searching for objects similar to a query. Search indexes are data structures designed to accelerate computing-intensive data processing, like those routinely found in clustering and classification tasks. However, for intrinsically high-dimensional data, competitive indexes tend to have either impractical index construction times or memory usage. A recent turn around in the literature has been introduced with the use of the approximate proximity graph (APG): a connected graph with a greedy search algorithm with restarts, needing sublinear time to solve queries. The APG computes an approximation of the result set using a small memory footprint, i.e., proportional to the underlying graph's degree. The degree along with the number of search repeats determine the speed and accuracy of the algorithm. This manuscript introduces three new algorithms based on local-search metaheuristics for the search graph. Two of these algorithms are direct improvements of the original one, yet we reduce the number of free parameters of the algorithm; the third one is an entirely new method that improves both the search speed and the accuracy of the result in most of our benchmarks. We also provide a broad experimental study to characterize our search structures and prove our claims; we also report an extensive performance comparison with the current alternatives.
Extractive Summarization using Deep Learning (1708.04439)

Sukriti Verma, Vagisha Nidhi

Jan. 9, 2019 cs.CL, cs.LG, cs.IR

This paper proposes a text summarization approach for factual reports using a deep learning model. This approach consists of three phases: feature extraction, feature enhancement, and summary generation, which work together to assimilate core information and generate a coherent, understandable summary. We are exploring various features to improve the set of sentences selected for the summary, and are using a Restricted Boltzmann Machine to enhance and abstract those features to improve resultant accuracy without losing any important information. The sentences are scored based on those enhanced features and an extractive summary is constructed. Experimentation carried out on several articles demonstrates the effectiveness of the proposed approach. Source code available at: https://github.com/vagisha-nidhi/TextSummarizer
Combining Privileged Information to Improve Context-Aware Recommender Systems (1511.02290)

Camila V. Sundermann, Marcos A. Domingues, Ricardo M. Marcacini, Solange O. Rezende

Jan. 4, 2019 cs.AI, cs.IR

A recommender system is an information filtering technology which can be used to predict preference ratings of items (products, services, movies, etc) and/or to output a ranking of items that are likely to be of interest to the user. Context-aware recommender systems (CARS) learn and predict the tastes and preferences of users by incorporating available contextual information in the recommendation process. One of the major challenges in context-aware recommender systems research is the lack of automatic methods to obtain contextual information for these systems. Considering this scenario, in this paper, we propose to use contextual information from topic hierarchies of the items (web pages) to improve the performance of context-aware recommender systems. The topic hierarchies are constructed by an extension of the LUPI-based Incremental Hierarchical Clustering method that considers three types of information: traditional bag-of-words (technical information), and the combination of named entities (privileged information I) with domain terms (privileged information II). We evaluated the contextual information in four context-aware recommender systems. Different weights were assigned to each type of information. The empirical results demonstrated that topic hierarchies with the combination of the two kinds of privileged information can provide better recommendations.
Narrative Smoothing: Dynamic Conversational Network for the Analysis of TV Series Plots (1602.07811)

Xavier Bost, Georges Linarès

Jan. 30, 2020 cs.SI, cs.IR, cs.MM

Modern popular TV series often develop complex storylines spanning several seasons, but are usually watched in quite a discontinuous way. As a result, the viewer generally needs a comprehensive summary of the previous season plot before the new one starts. The generation of such summaries requires first to identify and characterize the dynamics of the series subplots. One way of doing so is to study the underlying social network of interactions between the characters involved in the narrative. The standard tools used in the Social Networks Analysis field to extract such a network rely on an integration of time, either over the whole considered period, or as a sequence of several time-slices. However, they turn out to be inappropriate in the case of TV series, due to the fact the scenes showed onscreen alternatively focus on parallel storylines, and do not necessarily respect a traditional chronology. This makes existing extraction methods inefficient to describe the dynamics of relationships between characters, or to get a relevant instantaneous view of the current social state in the plot. This is especially true for characters shown as interacting with each other at some previous point in the plot but temporarily neglected by the narrative. In this article, we introduce narrative smoothing, a novel, still exploratory, network extraction method. It smooths the relationship dynamics based on the plot properties, aiming at solving some of the limitations present in the standard approaches. In order to assess our method, we apply it to a new corpus of 3 popular TV series, and compare it to both standard approaches. Our results are promising, showing narrative smoothing leads to more relevant observations when it comes to the characterization of the protagonists and their relationships. It could be used as a basis for further modeling the intertwined storylines constituting TV series plots.
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features (1703.02507)

Matteo Pagliardini, Prakhar Gupta, Martin Jaggi

Dec. 28, 2018 cs.AI, cs.CL, cs.IR

The recent tremendous success of unsupervised word embeddings in a multitude of applications raises the obvious question if similar methods could be derived to improve embeddings (i.e. semantic representations) of word sequences as well. We present a simple but efficient unsupervised objective to train distributed representations of sentences. Our method outperforms the state-of-the-art unsupervised models on most benchmark tasks, highlighting the robustness of the produced general-purpose sentence embeddings.
Patent Retrieval: A Literature Review (1701.00324)

Walid Shalaby, Wlodek Zadrozny

Dec. 20, 2018 cs.IR

With the ever increasing number of filed patent applications every year, the need for effective and efficient systems for managing such tremendous amounts of data becomes inevitably important. Patent Retrieval (PR) is considered the pillar of almost all patent analysis tasks. PR is a subfield of Information Retrieval (IR) which is concerned with developing techniques and methods that effectively and efficiently retrieve relevant patent documents in response to a given search request. In this paper we present a comprehensive review on PR methods and approaches. It is clear that, recent successes and maturity in IR applications such as Web search cannot be transferred directly to PR without deliberate domain adaptation and customization. Furthermore, state-of-the-art performance in automatic PR is still around average in terms of recall. These observations motivate the need for interactive search tools which provide cognitive assistance to patent professionals with minimal effort. These tools must also be developed in hand with patent professionals considering their practices and expectations. We additionally touch on related tasks to PR such as patent valuation, litigation, licensing, and highlight potential opportunities and open directions for computational scientists in these domains.
Neural Architecture for Question Answering Using a Knowledge Graph and Web Corpus (1706.00973)

Uma Sawant, Saurabh Garg, Soumen Chakrabarti, Ganesh Ramakrishnan

Dec. 6, 2018 cs.IR

In Web search, entity-seeking queries often trigger a special Question Answering (QA) system. It may use a parser to interpret the question to a structured query, execute that on a knowledge graph (KG), and return direct entity responses. QA systems based on precise parsing tend to be brittle: minor syntax variations may dramatically change the response. Moreover, KG coverage is patchy. At the other extreme, a large corpus may provide broader coverage, but in an unstructured, unreliable form. We present AQQUCN, a QA system that gracefully combines KG and corpus evidence. AQQUCN accepts a broad spectrum of query syntax, between well-formed questions to short `telegraphic' keyword sequences. In the face of inherent query ambiguities, AQQUCN aggregates signals from KGs and large corpora to directly rank KG entities, rather than commit to one semantic interpretation of the query. AQQUCN models the ideal interpretation as an unobservable or latent variable. Interpretations and candidate entity responses are scored as pairs, by combining signals from multiple convolutional networks that operate collectively on the query, KG and corpus. On four public query workloads, amounting to over 8,000 queries with diverse query syntax, we see 5--16% absolute improvement in mean average precision (MAP), compared to the entity ranking performance of recent systems. Our system is also competitive at entity set retrieval, almost doubling F1 scores for challenging short queries.
Integrating Reviews into Personalized Ranking for Cold Start Recommendation (1701.08888)

Guang-Neng Hu, Xin-Yu Dai

Dec. 5, 2018 cs.AI, cs.CL, cs.IR

Item recommendation task predicts a personalized ranking over a set of items for each individual user. One paradigm is the rating-based methods that concentrate on explicit feedbacks and hence face the difficulties in collecting them. Meanwhile, the ranking-based methods are presented with rated items and then rank the rated above the unrated. This paradigm takes advantage of widely available implicit feedback. It, however, usually ignores a kind of important information: item reviews. Item reviews not only justify the preferences of users, but also help alleviate the cold-start problem that fails the collaborative filtering. In this paper, we propose two novel and simple models to integrate item reviews into Bayesian personalized ranking. In each model, we make use of text features extracted from item reviews using word embeddings. On top of text features we uncover the review dimensions that explain the variation in users' feedback and these review factors represent a prior preference of users. Experiments on six real-world data sets show the benefits of leveraging item reviews on ranking prediction. We also conduct analyses to understand the proposed models.
Personalized Video Recommendation Using Rich Contents from Videos (1612.06935)

Xingzhong Du, Hongzhi Yin, Ling Chen, Yang Wang, Yi Yang, Xiaofang Zhou

Dec. 5, 2018 cs.LG, cs.IR

Video recommendation has become an essential way of helping people explore the massive videos and discover the ones that may be of interest to them. In the existing video recommender systems, the models make the recommendations based on the user-video interactions and single specific content features. When the specific content features are unavailable, the performance of the existing models will seriously deteriorate. Inspired by the fact that rich contents (e.g., text, audio, motion, and so on) exist in videos, in this paper, we explore how to use these rich contents to overcome the limitations caused by the unavailability of the specific ones. Specifically, we propose a novel general framework that incorporates arbitrary single content feature with user-video interactions, named as collaborative embedding regression (CER) model, to make effective video recommendation in both in-matrix and out-of-matrix scenarios. Our extensive experiments on two real-world large-scale datasets show that CER beats the existing recommender models with any single content feature and is more time efficient. In addition, we propose a priority-based late fusion (PRI) method to gain the benefit brought by the integrating the multiple content features. The corresponding experiment shows that PRI brings real performance improvement to the baseline and outperforms the existing fusion methods.
MV-RNN: A Multi-View Recurrent Neural Network for Sequential Recommendation (1611.06668)

Qiang Cui, Shu Wu, Qiang Liu, Wen Zhong, Liang Wang

Nov. 20, 2018 cs.IR

Sequential recommendation is a fundamental task for network applications, and it usually suffers from the item cold start problem due to the insufficiency of user feedbacks. There are currently three kinds of popular approaches which are respectively based on matrix factorization (MF) of collaborative filtering, Markov chain (MC), and recurrent neural network (RNN). Although widely used, they have some limitations. MF based methods could not capture dynamic user's interest. The strong Markov assumption greatly limits the performance of MC based methods. RNN based methods are still in the early stage of incorporating additional information. Based on these basic models, many methods with additional information only validate incorporating one modality in a separate way. In this work, to make the sequential recommendation and deal with the item cold start problem, we propose a Multi-View Recurrent Neural Network (MV-RNN}) model. Given the latent feature, MV-RNN can alleviate the item cold start problem by incorporating visual and textual information. First, At the input of MV-RNN, three different combinations of multi-view features are studied, like concatenation, fusion by addition and fusion by reconstructing the original multi-modal data. MV-RNN applies the recurrent structure to dynamically capture the user's interest. Second, we design a separate structure and a united structure on the hidden state of MV-RNN to explore a more effective way to handle multi-view features. Experiments on two real-world datasets show that MV-RNN can effectively generate the personalized ranking list, tackle the missing modalities problem and significantly alleviate the item cold start problem.
Morphology-based Entity and Relational Entity Extraction Framework for Arabic (1709.05700)

Amin Jaber, Fadi A. Zaraket

Nov. 18, 2018 cs.CL, cs.IR

Rule-based techniques to extract relational entities from documents allow users to specify desired entities with natural language questions, finite state automata, regular expressions and structured query language. They require linguistic and programming expertise and lack support for Arabic morphological analysis. We present a morphology-based entity and relational entity extraction framework for Arabic (MERF). MERF requires basic knowledge of linguistic features and regular expressions, and provides the ability to interactively specify Arabic morphological and synonymity features, tag types associated with regular expressions, and relations and code actions defined over matches of subexpressions. MERF constructs entities and relational entities from matches of the specifications. We evaluated MERF with several case studies. The results show that MERF requires shorter development time and effort compared to existing application specific techniques and produces reasonably accurate results within a reasonable overhead in run time.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset (1611.09268)

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang

Oct. 31, 2018 cs.CL, cs.IR

We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.
A Survey of Multi-View Representation Learning (1610.01206)

Yingming Li, Ming Yang, Zhongfei Zhang

Oct. 24, 2018 cs.CV, cs.LG, cs.IR

Recently, multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas. This paper introduces two categories for multi-view representation learning: multi-view representation alignment and multi-view representation fusion. Consequently, we first review the representative methods and theories of multi-view representation learning based on the perspective of alignment, such as correlation-based alignment. Representative examples are canonical correlation analysis (CCA) and its several extensions. Then from the perspective of representation fusion we investigate the advancement of multi-view representation learning that ranges from generative methods including multi-modal topic learning, multi-view sparse coding, and multi-view latent space Markov networks, to neural network-based methods including multi-modal autoencoders, multi-view convolutional neural networks, and multi-modal recurrent neural networks. Further, we also investigate several important applications of multi-view representation learning. Overall, this survey aims to provide an insightful overview of theoretical foundation and state-of-the-art developments in the field of multi-view representation learning and to help researchers find the most appropriate tools for particular applications.
Diffusion-like recommendation with enhanced similarity of objects (1511.03518)

Ya-Hui An, Qiang Dong, Chong-Jing Sun, Da-Cheng Nie, Yan Fu

Oct. 11, 2018 cs.SI, cs.IR

In last decades, diversity and accuracy have been regarded as two important measures in evaluating a recommendation model. However, a clear concern is that a model focusing excessively on one measure will put the other one at risk, thus it is not easy to greatly improve diversity and accuracy simultaneously. In this paper, we propose to enhance the Resource-Allocation (RA) similarity in resource transfer equations of diffusion-like models, by giving a tunable exponent to the RA similarity, and traversing the value of the exponent to achieve the optimal recommendation results. In this way, we can increase the recommendation scores (allocated resource) of many unpopular objects. Experiments on three benchmark data sets, MovieLens, Netflix, and RateYourMusic show that the modified models can yield remarkable performance improvement compared with the original ones.
An Information-theoretic Approach to Machine-oriented Music Summarization (1612.02350)

Francisco Raposo, David Martins de Matos, Ricardo Ribeiro

Sept. 21, 2018 cs.LG, cs.IR, cs.SD

Music summarization allows for higher efficiency in processing, storage, and sharing of datasets. Machine-oriented approaches, being agnostic to human consumption, optimize these aspects even further. Such summaries have already been successfully validated in some MIR tasks. We now generalize previous conclusions by evaluating the impact of generic summarization of music from a probabilistic perspective. We estimate Gaussian distributions for original and summarized songs and compute their relative entropy, in order to measure information loss incurred by summarization. Our results suggest that relative entropy is a good predictor of summarization performance in the context of tasks relying on a bag-of-features model. Based on this observation, we further propose a straightforward yet expressive summarizer, which minimizes relative entropy with respect to the original song, that objectively outperforms previous methods and is better suited to avoid potential copyright issues.
Basic Filters for Convolutional Neural Networks Applied to Music: Training or Design? (1709.02291)

Monika Doerfler, Thomas Grill, Roswitha Bammer, Arthur Flexer

Sept. 19, 2018 cs.LG, cs.IR

When convolutional neural networks are used to tackle learning problems based on music or, more generally, time series data, raw one-dimensional data are commonly pre-processed to obtain spectrogram or mel-spectrogram coefficients, which are then used as input to the actual neural network. In this contribution, we investigate, both theoretically and experimentally, the influence of this pre-processing step on the network's performance and pose the question, whether replacing it by applying adaptive or learned filters directly to the raw data, can improve learning success. The theoretical results show that approximately reproducing mel-spectrogram coefficients by applying adaptive filters and subsequent time-averaging is in principle possible. We also conducted extensive experimental work on the task of singing voice detection in music. The results of these experiments show that for classification based on Convolutional Neural Networks the features obtained from adaptive filter banks followed by time-averaging perform better than the canonical Fourier-transform-based mel-spectrogram coefficients. Alternative adaptive approaches with center frequencies or time-averaging lengths learned from training data perform equally well.
Deep Learning based Recommender System: A Survey and New Perspectives (1707.07435)

Shuai Zhang, Lina Yao, Aixin Sun, Yi Tay

July 10, 2019 cs.IR

With the ever-growing volume of online information, recommender systems have been an effective strategy to overcome such information overload. The utility of recommender systems cannot be overstated, given its widespread adoption in many web applications, along with its potential impact to ameliorate many problems related to over-choice. In recent years, deep learning has garnered considerable interest in many research fields such as computer vision and natural language processing, owing not only to stellar performance but also the attractive property of learning feature representations from scratch. The influence of deep learning is also pervasive, recently demonstrating its effectiveness when applied to information retrieval and recommender systems research. Evidently, the field of deep learning in recommender system is flourishing. This article aims to provide a comprehensive review of recent research efforts on deep learning based recommender systems. More concretely, we provide and devise a taxonomy of deep learning based recommendation models, along with providing a comprehensive summary of the state-of-the-art. Finally, we expand on current trends and provide new perspectives pertaining to this new exciting development of the field.
Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge (1611.10277)

Ryan J. Gallagher, Kyle Reing, David Kale, Greg Ver Steeg

Sept. 3, 2018 cs.IT, math.IT, cs.CL, cs.IR, stat.ML

While generative models such as Latent Dirichlet Allocation (LDA) have proven fruitful in topic modeling, they often require detailed assumptions and careful specification of hyperparameters. Such model complexity issues only compound when trying to generalize generative models to incorporate human input. We introduce Correlation Explanation (CorEx), an alternative approach to topic modeling that does not assume an underlying generative model, and instead learns maximally informative topics through an information-theoretic framework. This framework naturally generalizes to hierarchical and semi-supervised extensions with no additional modeling assumptions. In particular, word-level domain knowledge can be flexibly incorporated within CorEx through anchor words, allowing topic separability and representation to be promoted with minimal human intervention. Across a variety of datasets, metrics, and experiments, we demonstrate that CorEx produces topics that are comparable in quality to those produced by unsupervised and semi-supervised variants of LDA.
BLENDER: Enabling Local Search with a Hybrid Differential Privacy Model (1705.00831)

Brendan Avent, Aleksandra Korolova, David Zeber, Torgeir Hovden, Benjamin Livshits

Nov. 21, 2019 cs.CR, cs.LG, cs.CY, cs.IR

We propose a hybrid model of differential privacy that considers a combination of regular and opt-in users who desire the differential privacy guarantees of the local privacy model and the trusted curator model, respectively. We demonstrate that within this model, it is possible to design a new type of blended algorithm for the task of privately computing the head of a search log. This blended approach provides significant improvements in the utility of obtained data compared to related work while providing users with their desired privacy guarantees. Specifically, on two large search click data sets, comprising 1.75 and 16 GB respectively, our approach attains NDCG values exceeding 95% across a range of privacy budget values.
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis (1707.01780)

Jose Camacho-Collados, Mohammad Taher Pilehvar

Aug. 23, 2018 cs.CL, cs.IR

Text preprocessing is often the first step in the pipeline of a Natural Language Processing (NLP) system, with potential impact in its final performance. Despite its importance, text preprocessing has not received much attention in the deep learning literature. In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a standard neural text classifier. We perform an extensive evaluation on standard benchmarks from text categorization and sentiment analysis. While our experiments show that a simple tokenization of input text is generally adequate, they also highlight significant degrees of variability across preprocessing techniques. This reveals the importance of paying attention to this usually-overlooked step in the pipeline, particularly when comparing different models. Finally, our evaluation provides insights into the best preprocessing practices for training word embeddings.
Neural Vector Spaces for Unsupervised Information Retrieval (1708.02702)

Christophe Van Gysel, Maarten de Rijke, Evangelos Kanoulas

Aug. 18, 2018 cs.CL, cs.IR

We propose the Neural Vector Space Model (NVSM), a method that learns representations of documents in an unsupervised manner for news article retrieval. In the NVSM paradigm, we learn low-dimensional representations of words and documents from scratch using gradient descent and rank documents according to their similarity with query representations that are composed from word representations. We show that NVSM performs better at document ranking than existing latent semantic vector space methods. The addition of NVSM to a mixture of lexical language models and a state-of-the-art baseline vector space model yields a statistically significant increase in retrieval effectiveness. Consequently, NVSM adds a complementary relevance signal. Next to semantic matching, we find that NVSM performs well in cases where lexical matching is needed. NVSM learns a notion of term specificity directly from the document collection without feature engineering. We also show that NVSM learns regularities related to Luhn significance. Finally, we give advice on how to deploy NVSM in situations where model selection (e.g., cross-validation) is infeasible. We find that an unsupervised ensemble of multiple models trained with different hyperparameter values performs better than a single cross-validated model. Therefore, NVSM can safely be used for ranking documents without supervised relevance judgments.

Web Annotation as a First Class Object (1310.6555)

To Index or Not to Index: Optimizing Exact Maximum Inner Product Search (1706.01449)

A User's Guide to CARSKit (1511.03780)

Spectral Ranking (0912.0238)

Local Search Methods for Fast Near Neighbor Search (1705.10351)

Extractive Summarization using Deep Learning (1708.04439)

Combining Privileged Information to Improve Context-Aware Recommender Systems (1511.02290)

Narrative Smoothing: Dynamic Conversational Network for the Analysis of TV Series Plots (1602.07811)

Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features (1703.02507)

Patent Retrieval: A Literature Review (1701.00324)

Neural Architecture for Question Answering Using a Knowledge Graph and Web Corpus (1706.00973)

Integrating Reviews into Personalized Ranking for Cold Start Recommendation (1701.08888)

Personalized Video Recommendation Using Rich Contents from Videos (1612.06935)

MV-RNN: A Multi-View Recurrent Neural Network for Sequential Recommendation (1611.06668)

Morphology-based Entity and Relational Entity Extraction Framework for Arabic (1709.05700)

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset (1611.09268)

A Survey of Multi-View Representation Learning (1610.01206)

Diffusion-like recommendation with enhanced similarity of objects (1511.03518)

An Information-theoretic Approach to Machine-oriented Music Summarization (1612.02350)

Basic Filters for Convolutional Neural Networks Applied to Music: Training or Design? (1709.02291)

Deep Learning based Recommender System: A Survey and New Perspectives (1707.07435)

Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge (1611.10277)

BLENDER: Enabling Local Search with a Hybrid Differential Privacy Model (1705.00831)

On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis (1707.01780)

Neural Vector Spaces for Unsupervised Information Retrieval (1708.02702)