• Scholars have made handwritten notes and comments in books and manuscripts for centuries. Today's blogs and news sites typically invite users to express their opinions on the published content; URLs allow web resources to be shared with accompanying annotations and comments using third-party services like Twitter or Facebook. These contributions have until recently been constrained within specific services, making them second-class citizens of the Web. Web Annotations are now emerging as fully independent Linked Data in their own right, no longer restricted to plain textual comments in application silos. Annotations can now range from bookmarks and comments, to fine-grained annotations of a selection of, for example, a section of a frame within a video stream. Technologies and standards now exist to create, publish, syndicate, mash-up and consume, finely targeted, semantically rich digital annotations on practically any content, as first-class Web citizens. This development is being driven by the need for collaboration and annotation reuse amongst domain researchers, computer scientists, scientific publishers, and scholarly content databases.
  • This paper contains a data mining approach to the Short Title Catalogue Flanders (http://www.stcv.be/), which aims to record all books printed in Flanders up to 1801 (24.850 editions, per 31/08/2018). More specifically, it aims to analyse the Early Modern practice of 'quiring' gatherings in handpress book production
  • A cross-disciplinary examination of the user behaviours involved in seeking and evaluating data is surprisingly absent from the research data discussion. This review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data. Two analytical frameworks rooted in information retrieval and science technology studies are used to identify key similarities in practices as a first step toward developing a model describing data retrieval.
  • The Shannon-Weaver model of linear information transmission is extended with two loops potentially generating redundancies: (i) meaning is provided locally to the information from the perspective of hindsight, and (ii) meanings can be codified differently and then refer to other horizons of meaning. Thus, three layers are distinguished: variations in the communications, historical organization at each moment of time, and evolutionary self-organization of the codes of communication over time. Furthermore, the codes of communication can functionally be different and then the system is both horizontally and vertically differentiated. All these subdynamics operate in parallel and necessarily generate uncertainty. However, meaningful information can be considered as the specific selection of a signal from the noise; the codes of communication are social constructs that can generate redundancy by giving different meanings to the same information. Reflexively, one can translate among codes in more elaborate discourses. The second (instantiating) layer can be operationalized in terms of semantic maps using the vector space model; the third in terms of mutual redundancy among the latent dimensions of the vector space. Using Blaise Cronin's {\oe}uvre, the different operations of the three layers are demonstrated empirically.
  • Context: A Multivocal Literature Review (MLR) is a form of a Systematic Literature Review (SLR) which includes the grey literature (e.g., blog posts and white papers) in addition to the published (formal) literature (e.g., journal and conference papers). MLRs are useful for both researchers and practitioners since they provide summaries both the state-of-the art and -practice in a given area. Objective: There are several guidelines to conduct SLR studies in SE. However, given the facts that several phases of MLRs differ from those of traditional SLRs, for instance with respect to the search process and source quality assessment. Therefore, SLR guidelines are only partially useful for conducting MLR studies. Our goal in this paper is to present guidelines on how to conduct MLR studies in SE. Method: To develop the MLR guidelines, we benefit from three inputs: (1) existing SLR guidelines in SE, (2), a literature survey of MLR guidelines and experience papers in other fields, and (3) our own experiences in conducting several MLRs in SE. All derived guidelines are discussed in the context of three examples MLRs as running examples (two from SE and one MLR from the medical sciences). Results: The resulting guidelines cover all phases of conducting and reporting MLRs in SE from the planning phase, over conducting the review to the final reporting of the review. In particular, we believe that incorporating and adopting a vast set of recommendations from MLR guidelines and experience papers in other fields have enabled us to propose a set of guidelines with solid foundations. Conclusion: Having been developed on the basis of three types of solid experience and evidence, the provided MLR guidelines support researchers to effectively and efficiently conduct new MLRs in any area of SE.
  • Binarization plays a key role in the automatic information retrieval from document images. This process is usually performed in the first stages of documents analysis systems, and serves as a basis for subsequent steps. Hence it has to be robust in order to allow the full analysis workflow to be successful. Several methods for document image binarization have been proposed so far, most of which are based on hand-crafted image processing strategies. Recently, Convolutional Neural Networks have shown an amazing performance in many disparate duties related to computer vision. In this paper we discuss the use of convolutional auto-encoders devoted to learning an end-to-end map from an input image to its selectional output, in which activations indicate the likelihood of pixels to be either foreground or background. Once trained, documents can therefore be binarized by parsing them through the model and applying a threshold. This approach has proven to outperform existing binarization strategies in a number of document domains.
  • The increase in the number of researchers coupled with the ease of publishing and distribution of scientific papers (due to technological advancements) has resulted in a dramatic increase in astronomy literature. This has likely led to the predicament that the body of the literature is too large for traditional human consumption and that related and crucial knowledge is not discovered by researchers. In addition to the increased production of astronomical literature, recent decades have also brought several advancements in computational linguistics. Especially, the machine-aided processing of literature dissemination might make it possible to convert this stream of papers into a coherent knowledge set. In this paper, we present the application of computational linguistics techniques to astronomy literature. In particular, we developed a tool that will find similar articles purely based on text content from an input paper. We find that our technique performs robustly in comparison with other tools recommending articles given a reference paper (known as recommender system). Our novel tool shows the great power in combining computational linguistics with astronomy literature and suggests that additional research in this endeavor will likely produce even better tools that will help researchers cope with the vast amounts of knowledge being produced.
  • As more scholarly content is born digital or converted to a digital format, digital libraries are becoming increasingly vital to researchers seeking to leverage scholarly big data for scientific discovery. Although scholarly products are available in abundance-especially in environments created by the advent of social networking services-little is known about international scholarly information needs, information-seeking behavior, or information use. The purpose of this paper is to address these gaps via an in-depth analysis of the information needs and information-seeking behavior of researchers, both students and faculty, at two universities, one in the U.S. and the other in Qatar. Based on this analysis, the study identifies and describes new behavior patterns on the part of researchers as they engage in the information-seeking process. The analysis reveals that the use of academic social networks has notable effects on various scholarly activities. Further, this study identifies differences between students and faculty members in regard to their use of academic social networks, and it identifies differences between researchers according to discipline. Although the researchers who participated in the present study represent a range of disciplinary and cultural backgrounds, the study reports a number of similarities in terms of the researchers' scholarly activities.
  • The latest developments in digital have provided large data sets that can increasingly easily be accessed and used. These data sets often contain indirect localisation information, such as historical addresses. Historical geocoding is the process of transforming the indirect localisation information to direct localisation that can be placed on a map, which enables spatial analysis and cross-referencing. Many efficient geocoders exist for current addresses, but they do not deal with the temporal aspect and are based on a strict hierarchy (..., city, street, house number) that is hard or impossible to use with historical data. Indeed historical data are full of uncertainties (temporal aspect, semantic aspect, spatial precision, confidence in historical source, ...) that can not be resolved, as there is no way to go back in time to check. We propose an open source, open data, extensible solution for geocoding that is based on the building of gazetteers composed of geohistorical objects extracted from historical topographical maps. Once the gazetteers are available, geocoding an historical address is a matter of finding the geohistorical object in the gazetteers that is the best match to the historical address. The matching criteriae are customisable and include several dimensions (fuzzy semantic, fuzzy temporal, scale, spatial precision ...). As the goal is to facilitate historical work, we also propose web-based user interfaces that help geocode (one address or batch mode) and display over current or historical topographical maps, so that they can be checked and collaboratively edited. The system is tested on Paris city for the 19-20th centuries, shows high returns rate and is fast enough to be used interactively.
  • The organization and evolution of science has recently become itself an object of scientific quantitative investigation, thanks to the wealth of information that can be extracted from scientific documents, such as citations between papers and co-authorship between researchers. However, only few studies have focused on the concepts that characterize full documents and that can be extracted and analyzed, revealing the deeper organization of scientific knowledge. Unfortunately, several concepts can be so common across documents that they hinder the emergence of the underlying topical structure of the document corpus, because they give rise to a large amount of spurious and trivial relations among documents. To identify and remove common concepts, we introduce a method to gauge their relevance according to an objective information-theoretic measure related to the statistics of their occurrence across the document corpus. After progressively removing concepts that, according to this metric, can be considered as generic, we find that the topic organization displays a correspondingly more refined structure.
  • This paper reconstructs the Freebase data dumps to understand the underlying ontology behind Google's semantic search feature. The Freebase knowledge base was a major Semantic Web and linked data technology that was acquired by Google in 2010 to support the Google Knowledge Graph, the backend for Google search results that include structured answers to queries instead of a series of links to external resources. After its shutdown in 2016, Freebase is contained in a data dump of 1.9 billion Resource Description Format (RDF) triples. A recomposition of the Freebase ontology will be analyzed in relation to concepts and insights from the literature on classification by Bowker and Star. This paper will explore how the Freebase ontology is shaped by many of the forces that also shape classification systems through a deep dive into the ontology and a small correlational study. These findings will provide a glimpse into the proprietary blackbox Knowledge Graph and what is meant by Google's mission to ""organize the world's information and make it universally accessible and useful"".
  • As one of the richest sources of encyclopedic information on the Web, Wikipedia generates an enormous amount of traffic. In this paper, we study large-scale article access data of the English Wikipedia in order to compare articles with respect to the two main paradigms of information seeking, i.e., search by formulating a query, and navigation by following hyperlinks. To this end, we propose and employ two main metrics, namely (i) searchshare -- the relative amount of views an article received by search --, and (ii) resistance -- the ability of an article to relay traffic to other Wikipedia articles -- to characterize articles. We demonstrate how articles in distinct topical categories differ substantially in terms of these properties. For example, architecture-related articles are often accessed through search and are simultaneously a "dead end" for traffic, whereas historical articles about military events are mainly navigated. We further link traffic differences to varying network, content, and editing activity features. Lastly, we measure the impact of the article properties by modeling access behavior on articles with a gradient boosting approach. The results of this paper constitute a step towards understanding human information seeking behavior on the Web.
  • In this paper, we proposed a novel framework which uses user interests inferred from activities (a.k.a., activity interests) in multiple social collaborative platforms to predict users' platform activities. Included in the framework are two prediction approaches: (i) direct platform activity prediction, which predicts a user's activities in a platform using his or her activity interests from the same platform (e.g., predict if a user answers a given Stack Overflow question using the user's interests inferred from his or her prior answer and favorite activities in Stack Overflow), and (ii) cross-platform activity prediction, which predicts a user's activities in a platform using his or her activity interests from another platform (e.g., predict if a user answers a given Stack Overflow question using the user's interests inferred from his or her fork and watch activities in GitHub). To evaluate our proposed method, we conduct prediction experiments on two widely used social collaborative platforms in the software development community: GitHub and Stack Overflow. Our experiments show that combining both direct and cross-platform activity prediction approaches yield the best accuracies for predicting user activities in GitHub (AUC=0.75) and Stack Overflow (AUC=0.89).
  • Recently, a vast number of scientific publications have been produced in cities in emerging countries. It has long been observed that the publication output of Beijing has exceeded that of any other city in the world, including such leading centres of science as Boston, New York, London, Paris, and Tokyo. Researchers have suggested that, instead of focusing on cities' total publication output, the quality of the output in terms of the number of highly cited papers should be examined. However, in the period from 2014 to 2016, Beijing produced as many highly cited papers as Boston, London, or New York. In this paper, I propose another method to measure cities' publishing performance; I focus on cities' publishing efficiency (i.e., the ratio of highly cited articles to all articles produced in that city). First, I rank 554 cities based on their publishing efficiency, then I reveal some general factors influencing cities' publishing efficiency. The general factors examined in this paper are as follows: the linguistic environment, cities' economic development level, the location of excellent organisations, cities' international collaboration patterns, and the productivity of scientific disciplines.
  • A recent independent study resulted in a ranking system which ranked Astronomy and Computing (ASCOM) much higher than most of the older journals highlighting its niche prominence. We investigate the notable ascendancy in reputation of ASCOM by proposing a novel differential equation based modeling. The modeling is a consequence of knowledge discovery from big data methods, namely L1-SVD. We propose a growth model by accounting for the behavior of parameters that contribute to the growth of a field. It is worthwhile to spend some time in analyzing the cause and control variables behind rapid rise in the reputation of a journal in a niche area. We intend to identify and probe the parameters responsible for its growing influence. Delay differential equations are used to model the change of influence on a journal's status by exploiting the effects of historical data. The manuscript justifies the use of implicit control variables and models those accordingly that demonstrate certain behavior in the journal influence.
  • In science and beyond, numbers are omnipresent when it comes to justifying different kinds of judgments. Which scientific author, hiring committee-member, or advisory board panelist has not been confronted with page-long "publication manuals", "assessment reports", "evaluation guidelines", calling for p-values, citation rates, h-indices, or other statistics in order to motivate judgments about the "quality" of findings, applicants, or institutions? Yet, many of those relying on and calling for statistics do not even seem to understand what information those numbers can actually convey, and what not. Focusing on the uninformed usage of bibliometrics as worrysome outgrowth of the increasing quantification of science and society, we place the abuse of numbers into larger historical contexts and trends. These are characterized by a technology-driven bureaucratization of science, obsessions with control and accountability, and mistrust in human intuitive judgment. The ongoing digital revolution increases those trends. We call for bringing sanity back into scientific judgment exercises. Despite all number crunching, many judgments - be it about scientific output, scientists, or research institutions - will neither be unambiguous, uncontroversial, or testable by external standards, nor can they be otherwise validated or objectified. Under uncertainty, good human judgment remains, for the better, indispensable, but it can be aided, so we conclude, by a toolbox of simple judgment tools, called heuristics. In the best position to use those heuristics are research evaluators (1) who have expertise in the to-be-evaluated area of research, (2) who have profound knowledge in bibliometrics, and (3) who are statistically literate.
  • In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters and language models to compose word transcriptions. Our approach requires minimal training efforts, making the transcription process more scalable as the production of training sets requires a few pages and can be easily crowdsourced. We have conducted experiments on manuscripts from the Vatican Registers, an unreleased corpus containing the correspondence of the popes. With training data produced by 120 high school students, our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speedup the transcription process at a large scale.
  • Trends are analysed in the annual number of documents published by Russian institutions and indexed in Scopus and Web of Science, giving special attention to the time period starting in the year 2013 in which the Project 5-100 was launched by the Russian Government. Numbers are broken down by document type, publication language, type of source, research discipline, country and source. It is concluded that Russian publication counts strongly depend upon the database used, and upon changes in database coverage, and that one should be cautious when using indicators derived from WoS, and especially from Scopus, as tools in the measurement of research performance and international orientation of the Russian science system.
  • Open data and open-source software may be part of the solution to science's "reproducibility crisis", but they are insufficient to guarantee reproducibility. Requiring minimal end-user expertise, encapsulator creates a "time capsule" with reproducible code in a self-contained computational environment. encapsulator provides end-users with a fully-featured desktop environment for reproducible research.
  • Thelwall (2017a, 2017b) proposed a new family of field- and time-normalized indicators, which is intended for sparse data. These indicators are based on units of analysis (e.g., institutions) rather than on the paper level. They compare the proportion of mentioned papers (e.g., on Twitter) of a unit with the proportion of mentioned papers in the corresponding fields and publication years (the expected values). We propose a new indicator (Mantel-Haenszel quotient, MHq) for the indicator family. The MHq goes back to the MH analysis. This analysis is an established method, which can be used to pool the data from several 2x2 cross tables based on different subgroups. We investigate (using citations and assessments by peers, i.e., F1000Prime recommendations) whether the indicator family (including the MHq) can distinguish between quality levels defined by the assessments of peers. Thus, we test the convergent validity. We find that the MHq is able to distinguish between quality levels (in most cases) while other indicators of the family are not. Since our study approves the MHq as a convergent valid indicator, we apply the MHq to four different Twitter groups as defined by the company Altmetric (e.g., science communicators). Our results show that there is a weak relationship between all four Twitter groups and scientific quality, much weaker than between citations and scientific quality. Therefore, our results discourage the use of Twitter counts in research evaluation.
  • We conducted a large-scale analysis of around 10,000 scientific articles, from the period 2007-2016, to study the bibliometric or formal aspects influencing citations. A transversal analysis was conducted disaggregating the articles into more than one hundred scientific areas and two groups, one experimental and one control, each with a random sample of around five thousand documents. The experimental group comprised a random sample of the top 1% most cited articles in each field and year of publication (highly cited articles), and the control group a random sample of the remaining articles in the Journal Citation Reports (science and social science citation indexes in the Web of Science database). As the main result, highly cited articles differ from non-highly cited articles in most of the bibliometric aspects considered. There are significant differences, below the 0.01 level, between the groups of articles in many variables and areas. The highly cited articles are published in journals of higher impact factor (33 percentile points above) and have 25% higher co-authorship. The highly cited articles are also longer in terms of number of pages (10% higher) and bibliographical references (35% more). Finally, highly cited articles have slightly shorter titles (3% lower) but, contrastingly, longer abstracts (10% higher).
  • The main objective of this paper is to empirically test whether the identification of highly-cited documents through Google Scholar is feasible and reliable. To this end, we carried out a longitudinal analysis (1950 to 2013), running a generic query (filtered only by year of publication) to minimise the effects of academic search engine optimisation. This gave us a final sample of 64,000 documents (1,000 per year). The strong correlation between a document's citations and its position in the search results (r= -0.67) led us to conclude that Google Scholar is able to identify highly-cited papers effectively. This, combined with Google Scholar's unique coverage (no restrictions on document type and source), makes the academic search engine an invaluable tool for bibliometric research relating to the identification of the most influential scientific documents. We find evidence, however, that Google Scholar ranks those documents whose language (or geographical web domain) matches with the user's interface language higher than could be expected based on citations. Nonetheless, this language effect and other factors related to the Google Scholar's operation, i.e. the proper identification of versions and the date of publication, only have an incidental impact. They do not compromise the ability of Google Scholar to identify the highly-cited papers.
  • This article describes a procedure to generate a snapshot of the structure of a specific scientific community and their outputs based on the information available in Google Scholar Citations (GSC). We call this method MADAP (Multifaceted Analysis of Disciplines through Academic Profiles). The international community of researchers working in Bibliometrics, Scientometrics, Informetrics, Webometrics, and Altmetrics was selected as a case study. The records of the top 1,000 most cited documents by these authors according to GSC were manually processed to fill any missing information and deduplicate fields like the journal titles and book publishers. The results suggest that it is feasible to use GSC and the MADAP method to produce an accurate depiction of the community of researchers working in Bibliometrics (both specialists and occasional researchers) and their publication habits (main publication venues such as journals and book publishers). Additionally, the wide document coverage of Google Scholar (specially books and book chapters) enables more comprehensive analyses of the documents published in a specific discipline than were previously possible with other citation indexes, finally shedding light on what until now had been a blind spot in most citation analyses.
  • Understanding how a scientist develops new scientific collaborations or how their papers receive new citations is a major challenge in scientometrics. The approach being proposed simultaneously examines the growth processes of the co-authorship and citation networks by analyzing the evolutions of the rich get richer and the fit get richer phenomena. In particular, the preferential attachment function and author fitnesses, which govern the two phenomena, are estimated non-parametrically in each network. The approach is applied to the co-authorship and citation networks of the flagship journal of the strategic management scientific community, namely the Strategic Management Journal. The results suggest that the abovementioned phenomena have been consistently governing both temporal networks. The average of the attachment exponents in the co-authorship network is 0.30 while it is 0.29 in the citation network. This suggests that the rich get richer phenomenon has been weak in both networks. The right tails of the distributions of author fitness in both networks are heavy, which imply that the intrinsic scientific quality of each author has been playing a crucial role in getting new citations and new co-authorships. Since the total competitiveness in each temporal network is founded to be rising with time, it is getting harder to receive a new citation or to develop a new collaboration. Analyzing the average competency, it was found that on average, while the veterans tend to be more competent at developing new collaborations, the newcomers are likely better at acquiring new citations. Furthermore, the author fitness in both networks has been consistent with the history of the strategic management scientific community. This suggests that coupling node fitnesses throughout different networks might be a promising new direction in analyzing simultaneously multiple networks.
  • We have developed an application that will take a "MEDLINE" output from the PubMed database and allows the user to cluster all non-trivial words of the abstracts of the PubMed output. The number of clusters to use can be selected by the user. A specific cluster may be selected, and the PMIDs and dates for all publications in the selected cluster are displayed underneath. See figure 2, where cluster 12 is selected. The application also has an "Abstracts" tab, where the abstracts for the selected cluster can be perused. Here, it is also possible to download a HTML file containing the PMID, date, title, and abstract for each publication in the selected cluster. A third tab is called "Titles", where all the titles for the selected cluster are displayed. Via a "Use Cluster" button, the selected Cluster can itself be clustered. A "Back" button allows the user to return to any previous state. Finally, it is also possible to exclude documents whose abstracts contain certain words (see figure 3). The application will allow researchers to enter general search terms in the PubMed search engine, then use the application to search for publications of special interest within those search terms.