-
Scholars have made handwritten notes and comments in books and manuscripts
for centuries. Today's blogs and news sites typically invite users to express
their opinions on the published content; URLs allow web resources to be shared
with accompanying annotations and comments using third-party services like
Twitter or Facebook. These contributions have until recently been constrained
within specific services, making them second-class citizens of the Web.
Web Annotations are now emerging as fully independent Linked Data in their
own right, no longer restricted to plain textual comments in application silos.
Annotations can now range from bookmarks and comments, to fine-grained
annotations of a selection of, for example, a section of a frame within a video
stream. Technologies and standards now exist to create, publish, syndicate,
mash-up and consume, finely targeted, semantically rich digital annotations on
practically any content, as first-class Web citizens. This development is being
driven by the need for collaboration and annotation reuse amongst domain
researchers, computer scientists, scientific publishers, and scholarly content
databases.
-
This paper contains a data mining approach to the Short Title Catalogue
Flanders (http://www.stcv.be/), which aims to record all books printed in
Flanders up to 1801 (24.850 editions, per 31/08/2018). More specifically, it
aims to analyse the Early Modern practice of 'quiring' gatherings in handpress
book production
-
A cross-disciplinary examination of the user behaviours involved in seeking
and evaluating data is surprisingly absent from the research data discussion.
This review explores the data retrieval literature to identify commonalities in
how users search for and evaluate observational research data. Two analytical
frameworks rooted in information retrieval and science technology studies are
used to identify key similarities in practices as a first step toward
developing a model describing data retrieval.
-
The Shannon-Weaver model of linear information transmission is extended with
two loops potentially generating redundancies: (i) meaning is provided locally
to the information from the perspective of hindsight, and (ii) meanings can be
codified differently and then refer to other horizons of meaning. Thus, three
layers are distinguished: variations in the communications, historical
organization at each moment of time, and evolutionary self-organization of the
codes of communication over time. Furthermore, the codes of communication can
functionally be different and then the system is both horizontally and
vertically differentiated. All these subdynamics operate in parallel and
necessarily generate uncertainty. However, meaningful information can be
considered as the specific selection of a signal from the noise; the codes of
communication are social constructs that can generate redundancy by giving
different meanings to the same information. Reflexively, one can translate
among codes in more elaborate discourses. The second (instantiating) layer can
be operationalized in terms of semantic maps using the vector space model; the
third in terms of mutual redundancy among the latent dimensions of the vector
space. Using Blaise Cronin's {\oe}uvre, the different operations of the three
layers are demonstrated empirically.
-
Context: A Multivocal Literature Review (MLR) is a form of a Systematic
Literature Review (SLR) which includes the grey literature (e.g., blog posts
and white papers) in addition to the published (formal) literature (e.g.,
journal and conference papers). MLRs are useful for both researchers and
practitioners since they provide summaries both the state-of-the art and
-practice in a given area. Objective: There are several guidelines to conduct
SLR studies in SE. However, given the facts that several phases of MLRs differ
from those of traditional SLRs, for instance with respect to the search process
and source quality assessment. Therefore, SLR guidelines are only partially
useful for conducting MLR studies. Our goal in this paper is to present
guidelines on how to conduct MLR studies in SE. Method: To develop the MLR
guidelines, we benefit from three inputs: (1) existing SLR guidelines in SE,
(2), a literature survey of MLR guidelines and experience papers in other
fields, and (3) our own experiences in conducting several MLRs in SE. All
derived guidelines are discussed in the context of three examples MLRs as
running examples (two from SE and one MLR from the medical sciences). Results:
The resulting guidelines cover all phases of conducting and reporting MLRs in
SE from the planning phase, over conducting the review to the final reporting
of the review. In particular, we believe that incorporating and adopting a vast
set of recommendations from MLR guidelines and experience papers in other
fields have enabled us to propose a set of guidelines with solid foundations.
Conclusion: Having been developed on the basis of three types of solid
experience and evidence, the provided MLR guidelines support researchers to
effectively and efficiently conduct new MLRs in any area of SE.
-
Binarization plays a key role in the automatic information retrieval from
document images. This process is usually performed in the first stages of
documents analysis systems, and serves as a basis for subsequent steps. Hence
it has to be robust in order to allow the full analysis workflow to be
successful. Several methods for document image binarization have been proposed
so far, most of which are based on hand-crafted image processing strategies.
Recently, Convolutional Neural Networks have shown an amazing performance in
many disparate duties related to computer vision. In this paper we discuss the
use of convolutional auto-encoders devoted to learning an end-to-end map from
an input image to its selectional output, in which activations indicate the
likelihood of pixels to be either foreground or background. Once trained,
documents can therefore be binarized by parsing them through the model and
applying a threshold. This approach has proven to outperform existing
binarization strategies in a number of document domains.
-
The increase in the number of researchers coupled with the ease of publishing
and distribution of scientific papers (due to technological advancements) has
resulted in a dramatic increase in astronomy literature. This has likely led to
the predicament that the body of the literature is too large for traditional
human consumption and that related and crucial knowledge is not discovered by
researchers. In addition to the increased production of astronomical
literature, recent decades have also brought several advancements in
computational linguistics. Especially, the machine-aided processing of
literature dissemination might make it possible to convert this stream of
papers into a coherent knowledge set. In this paper, we present the application
of computational linguistics techniques to astronomy literature. In particular,
we developed a tool that will find similar articles purely based on text
content from an input paper. We find that our technique performs robustly in
comparison with other tools recommending articles given a reference paper
(known as recommender system). Our novel tool shows the great power in
combining computational linguistics with astronomy literature and suggests that
additional research in this endeavor will likely produce even better tools that
will help researchers cope with the vast amounts of knowledge being produced.
-
As more scholarly content is born digital or converted to a digital format,
digital libraries are becoming increasingly vital to researchers seeking to
leverage scholarly big data for scientific discovery. Although scholarly
products are available in abundance-especially in environments created by the
advent of social networking services-little is known about international
scholarly information needs, information-seeking behavior, or information use.
The purpose of this paper is to address these gaps via an in-depth analysis of
the information needs and information-seeking behavior of researchers, both
students and faculty, at two universities, one in the U.S. and the other in
Qatar. Based on this analysis, the study identifies and describes new behavior
patterns on the part of researchers as they engage in the information-seeking
process. The analysis reveals that the use of academic social networks has
notable effects on various scholarly activities. Further, this study identifies
differences between students and faculty members in regard to their use of
academic social networks, and it identifies differences between researchers
according to discipline. Although the researchers who participated in the
present study represent a range of disciplinary and cultural backgrounds, the
study reports a number of similarities in terms of the researchers' scholarly
activities.
-
The latest developments in digital have provided large data sets that can
increasingly easily be accessed and used. These data sets often contain
indirect localisation information, such as historical addresses. Historical
geocoding is the process of transforming the indirect localisation information
to direct localisation that can be placed on a map, which enables spatial
analysis and cross-referencing. Many efficient geocoders exist for current
addresses, but they do not deal with the temporal aspect and are based on a
strict hierarchy (..., city, street, house number) that is hard or impossible
to use with historical data. Indeed historical data are full of uncertainties
(temporal aspect, semantic aspect, spatial precision, confidence in historical
source, ...) that can not be resolved, as there is no way to go back in time to
check. We propose an open source, open data, extensible solution for geocoding
that is based on the building of gazetteers composed of geohistorical objects
extracted from historical topographical maps. Once the gazetteers are
available, geocoding an historical address is a matter of finding the
geohistorical object in the gazetteers that is the best match to the historical
address. The matching criteriae are customisable and include several dimensions
(fuzzy semantic, fuzzy temporal, scale, spatial precision ...). As the goal is
to facilitate historical work, we also propose web-based user interfaces that
help geocode (one address or batch mode) and display over current or historical
topographical maps, so that they can be checked and collaboratively edited. The
system is tested on Paris city for the 19-20th centuries, shows high returns
rate and is fast enough to be used interactively.
-
The organization and evolution of science has recently become itself an
object of scientific quantitative investigation, thanks to the wealth of
information that can be extracted from scientific documents, such as citations
between papers and co-authorship between researchers. However, only few studies
have focused on the concepts that characterize full documents and that can be
extracted and analyzed, revealing the deeper organization of scientific
knowledge. Unfortunately, several concepts can be so common across documents
that they hinder the emergence of the underlying topical structure of the
document corpus, because they give rise to a large amount of spurious and
trivial relations among documents. To identify and remove common concepts, we
introduce a method to gauge their relevance according to an objective
information-theoretic measure related to the statistics of their occurrence
across the document corpus. After progressively removing concepts that,
according to this metric, can be considered as generic, we find that the topic
organization displays a correspondingly more refined structure.
-
This paper reconstructs the Freebase data dumps to understand the underlying
ontology behind Google's semantic search feature. The Freebase knowledge base
was a major Semantic Web and linked data technology that was acquired by Google
in 2010 to support the Google Knowledge Graph, the backend for Google search
results that include structured answers to queries instead of a series of links
to external resources. After its shutdown in 2016, Freebase is contained in a
data dump of 1.9 billion Resource Description Format (RDF) triples. A
recomposition of the Freebase ontology will be analyzed in relation to concepts
and insights from the literature on classification by Bowker and Star. This
paper will explore how the Freebase ontology is shaped by many of the forces
that also shape classification systems through a deep dive into the ontology
and a small correlational study. These findings will provide a glimpse into the
proprietary blackbox Knowledge Graph and what is meant by Google's mission to
""organize the world's information and make it universally accessible and
useful"".
-
As one of the richest sources of encyclopedic information on the Web,
Wikipedia generates an enormous amount of traffic. In this paper, we study
large-scale article access data of the English Wikipedia in order to compare
articles with respect to the two main paradigms of information seeking, i.e.,
search by formulating a query, and navigation by following hyperlinks. To this
end, we propose and employ two main metrics, namely (i) searchshare -- the
relative amount of views an article received by search --, and (ii) resistance
-- the ability of an article to relay traffic to other Wikipedia articles -- to
characterize articles. We demonstrate how articles in distinct topical
categories differ substantially in terms of these properties. For example,
architecture-related articles are often accessed through search and are
simultaneously a "dead end" for traffic, whereas historical articles about
military events are mainly navigated. We further link traffic differences to
varying network, content, and editing activity features. Lastly, we measure the
impact of the article properties by modeling access behavior on articles with a
gradient boosting approach. The results of this paper constitute a step towards
understanding human information seeking behavior on the Web.
-
In this paper, we proposed a novel framework which uses user interests
inferred from activities (a.k.a., activity interests) in multiple social
collaborative platforms to predict users' platform activities. Included in the
framework are two prediction approaches: (i) direct platform activity
prediction, which predicts a user's activities in a platform using his or her
activity interests from the same platform (e.g., predict if a user answers a
given Stack Overflow question using the user's interests inferred from his or
her prior answer and favorite activities in Stack Overflow), and (ii)
cross-platform activity prediction, which predicts a user's activities in a
platform using his or her activity interests from another platform (e.g.,
predict if a user answers a given Stack Overflow question using the user's
interests inferred from his or her fork and watch activities in GitHub). To
evaluate our proposed method, we conduct prediction experiments on two widely
used social collaborative platforms in the software development community:
GitHub and Stack Overflow. Our experiments show that combining both direct and
cross-platform activity prediction approaches yield the best accuracies for
predicting user activities in GitHub (AUC=0.75) and Stack Overflow (AUC=0.89).
-
Recently, a vast number of scientific publications have been produced in
cities in emerging countries. It has long been observed that the publication
output of Beijing has exceeded that of any other city in the world, including
such leading centres of science as Boston, New York, London, Paris, and Tokyo.
Researchers have suggested that, instead of focusing on cities' total
publication output, the quality of the output in terms of the number of highly
cited papers should be examined. However, in the period from 2014 to 2016,
Beijing produced as many highly cited papers as Boston, London, or New York. In
this paper, I propose another method to measure cities' publishing performance;
I focus on cities' publishing efficiency (i.e., the ratio of highly cited
articles to all articles produced in that city). First, I rank 554 cities based
on their publishing efficiency, then I reveal some general factors influencing
cities' publishing efficiency. The general factors examined in this paper are
as follows: the linguistic environment, cities' economic development level, the
location of excellent organisations, cities' international collaboration
patterns, and the productivity of scientific disciplines.
-
A recent independent study resulted in a ranking system which ranked
Astronomy and Computing (ASCOM) much higher than most of the older journals
highlighting its niche prominence. We investigate the notable ascendancy in
reputation of ASCOM by proposing a novel differential equation based modeling.
The modeling is a consequence of knowledge discovery from big data methods,
namely L1-SVD. We propose a growth model by accounting for the behavior of
parameters that contribute to the growth of a field. It is worthwhile to spend
some time in analyzing the cause and control variables behind rapid rise in the
reputation of a journal in a niche area. We intend to identify and probe the
parameters responsible for its growing influence. Delay differential equations
are used to model the change of influence on a journal's status by exploiting
the effects of historical data. The manuscript justifies the use of implicit
control variables and models those accordingly that demonstrate certain
behavior in the journal influence.
-
In science and beyond, numbers are omnipresent when it comes to justifying
different kinds of judgments. Which scientific author, hiring committee-member,
or advisory board panelist has not been confronted with page-long "publication
manuals", "assessment reports", "evaluation guidelines", calling for p-values,
citation rates, h-indices, or other statistics in order to motivate judgments
about the "quality" of findings, applicants, or institutions? Yet, many of
those relying on and calling for statistics do not even seem to understand what
information those numbers can actually convey, and what not. Focusing on the
uninformed usage of bibliometrics as worrysome outgrowth of the increasing
quantification of science and society, we place the abuse of numbers into
larger historical contexts and trends. These are characterized by a
technology-driven bureaucratization of science, obsessions with control and
accountability, and mistrust in human intuitive judgment. The ongoing digital
revolution increases those trends. We call for bringing sanity back into
scientific judgment exercises. Despite all number crunching, many judgments -
be it about scientific output, scientists, or research institutions - will
neither be unambiguous, uncontroversial, or testable by external standards, nor
can they be otherwise validated or objectified. Under uncertainty, good human
judgment remains, for the better, indispensable, but it can be aided, so we
conclude, by a toolbox of simple judgment tools, called heuristics. In the best
position to use those heuristics are research evaluators (1) who have expertise
in the to-be-evaluated area of research, (2) who have profound knowledge in
bibliometrics, and (3) who are statistically literate.
-
In Codice Ratio is a research project to study tools and techniques for
analyzing the contents of historical documents conserved in the Vatican Secret
Archives (VSA). In this paper, we present our efforts to develop a system to
support the transcription of medieval manuscripts. The goal is to provide
paleographers with a tool to reduce their efforts in transcribing large
volumes, as those stored in the VSA, producing good transcriptions for
significant portions of the manuscripts. We propose an original approach based
on character segmentation. Our solution is able to deal with the dirty
segmentation that inevitably occurs in handwritten documents. We use a
convolutional neural network to recognize characters and language models to
compose word transcriptions. Our approach requires minimal training efforts,
making the transcription process more scalable as the production of training
sets requires a few pages and can be easily crowdsourced. We have conducted
experiments on manuscripts from the Vatican Registers, an unreleased corpus
containing the correspondence of the popes. With training data produced by 120
high school students, our system has been able to produce good transcriptions
that can be used by paleographers as a solid basis to speedup the transcription
process at a large scale.
-
Trends are analysed in the annual number of documents published by Russian
institutions and indexed in Scopus and Web of Science, giving special attention
to the time period starting in the year 2013 in which the Project 5-100 was
launched by the Russian Government. Numbers are broken down by document type,
publication language, type of source, research discipline, country and source.
It is concluded that Russian publication counts strongly depend upon the
database used, and upon changes in database coverage, and that one should be
cautious when using indicators derived from WoS, and especially from Scopus, as
tools in the measurement of research performance and international orientation
of the Russian science system.
-
Open data and open-source software may be part of the solution to science's
"reproducibility crisis", but they are insufficient to guarantee
reproducibility. Requiring minimal end-user expertise, encapsulator creates a
"time capsule" with reproducible code in a self-contained computational
environment. encapsulator provides end-users with a fully-featured desktop
environment for reproducible research.
-
Thelwall (2017a, 2017b) proposed a new family of field- and time-normalized
indicators, which is intended for sparse data. These indicators are based on
units of analysis (e.g., institutions) rather than on the paper level. They
compare the proportion of mentioned papers (e.g., on Twitter) of a unit with
the proportion of mentioned papers in the corresponding fields and publication
years (the expected values). We propose a new indicator (Mantel-Haenszel
quotient, MHq) for the indicator family. The MHq goes back to the MH analysis.
This analysis is an established method, which can be used to pool the data from
several 2x2 cross tables based on different subgroups. We investigate (using
citations and assessments by peers, i.e., F1000Prime recommendations) whether
the indicator family (including the MHq) can distinguish between quality levels
defined by the assessments of peers. Thus, we test the convergent validity. We
find that the MHq is able to distinguish between quality levels (in most cases)
while other indicators of the family are not. Since our study approves the MHq
as a convergent valid indicator, we apply the MHq to four different Twitter
groups as defined by the company Altmetric (e.g., science communicators). Our
results show that there is a weak relationship between all four Twitter groups
and scientific quality, much weaker than between citations and scientific
quality. Therefore, our results discourage the use of Twitter counts in
research evaluation.
-
We conducted a large-scale analysis of around 10,000 scientific articles,
from the period 2007-2016, to study the bibliometric or formal aspects
influencing citations. A transversal analysis was conducted disaggregating the
articles into more than one hundred scientific areas and two groups, one
experimental and one control, each with a random sample of around five thousand
documents. The experimental group comprised a random sample of the top 1% most
cited articles in each field and year of publication (highly cited articles),
and the control group a random sample of the remaining articles in the Journal
Citation Reports (science and social science citation indexes in the Web of
Science database). As the main result, highly cited articles differ from
non-highly cited articles in most of the bibliometric aspects considered. There
are significant differences, below the 0.01 level, between the groups of
articles in many variables and areas. The highly cited articles are published
in journals of higher impact factor (33 percentile points above) and have 25%
higher co-authorship. The highly cited articles are also longer in terms of
number of pages (10% higher) and bibliographical references (35% more).
Finally, highly cited articles have slightly shorter titles (3% lower) but,
contrastingly, longer abstracts (10% higher).
-
The main objective of this paper is to empirically test whether the
identification of highly-cited documents through Google Scholar is feasible and
reliable. To this end, we carried out a longitudinal analysis (1950 to 2013),
running a generic query (filtered only by year of publication) to minimise the
effects of academic search engine optimisation. This gave us a final sample of
64,000 documents (1,000 per year). The strong correlation between a document's
citations and its position in the search results (r= -0.67) led us to conclude
that Google Scholar is able to identify highly-cited papers effectively. This,
combined with Google Scholar's unique coverage (no restrictions on document
type and source), makes the academic search engine an invaluable tool for
bibliometric research relating to the identification of the most influential
scientific documents. We find evidence, however, that Google Scholar ranks
those documents whose language (or geographical web domain) matches with the
user's interface language higher than could be expected based on citations.
Nonetheless, this language effect and other factors related to the Google
Scholar's operation, i.e. the proper identification of versions and the date of
publication, only have an incidental impact. They do not compromise the ability
of Google Scholar to identify the highly-cited papers.
-
This article describes a procedure to generate a snapshot of the structure of
a specific scientific community and their outputs based on the information
available in Google Scholar Citations (GSC). We call this method MADAP
(Multifaceted Analysis of Disciplines through Academic Profiles). The
international community of researchers working in Bibliometrics,
Scientometrics, Informetrics, Webometrics, and Altmetrics was selected as a
case study. The records of the top 1,000 most cited documents by these authors
according to GSC were manually processed to fill any missing information and
deduplicate fields like the journal titles and book publishers. The results
suggest that it is feasible to use GSC and the MADAP method to produce an
accurate depiction of the community of researchers working in Bibliometrics
(both specialists and occasional researchers) and their publication habits
(main publication venues such as journals and book publishers). Additionally,
the wide document coverage of Google Scholar (specially books and book
chapters) enables more comprehensive analyses of the documents published in a
specific discipline than were previously possible with other citation indexes,
finally shedding light on what until now had been a blind spot in most citation
analyses.
-
Understanding how a scientist develops new scientific collaborations or how
their papers receive new citations is a major challenge in scientometrics. The
approach being proposed simultaneously examines the growth processes of the
co-authorship and citation networks by analyzing the evolutions of the rich get
richer and the fit get richer phenomena. In particular, the preferential
attachment function and author fitnesses, which govern the two phenomena, are
estimated non-parametrically in each network. The approach is applied to the
co-authorship and citation networks of the flagship journal of the strategic
management scientific community, namely the Strategic Management Journal. The
results suggest that the abovementioned phenomena have been consistently
governing both temporal networks. The average of the attachment exponents in
the co-authorship network is 0.30 while it is 0.29 in the citation network.
This suggests that the rich get richer phenomenon has been weak in both
networks. The right tails of the distributions of author fitness in both
networks are heavy, which imply that the intrinsic scientific quality of each
author has been playing a crucial role in getting new citations and new
co-authorships. Since the total competitiveness in each temporal network is
founded to be rising with time, it is getting harder to receive a new citation
or to develop a new collaboration. Analyzing the average competency, it was
found that on average, while the veterans tend to be more competent at
developing new collaborations, the newcomers are likely better at acquiring new
citations. Furthermore, the author fitness in both networks has been consistent
with the history of the strategic management scientific community. This
suggests that coupling node fitnesses throughout different networks might be a
promising new direction in analyzing simultaneously multiple networks.
-
We have developed an application that will take a "MEDLINE" output from the
PubMed database and allows the user to cluster all non-trivial words of the
abstracts of the PubMed output. The number of clusters to use can be selected
by the user.
A specific cluster may be selected, and the PMIDs and dates for all
publications in the selected cluster are displayed underneath. See figure 2,
where cluster 12 is selected.
The application also has an "Abstracts" tab, where the abstracts for the
selected cluster can be perused. Here, it is also possible to download a HTML
file containing the PMID, date, title, and abstract for each publication in the
selected cluster.
A third tab is called "Titles", where all the titles for the selected cluster
are displayed.
Via a "Use Cluster" button, the selected Cluster can itself be clustered. A
"Back" button allows the user to return to any previous state.
Finally, it is also possible to exclude documents whose abstracts contain
certain words (see figure 3).
The application will allow researchers to enter general search terms in the
PubMed search engine, then use the application to search for publications of
special interest within those search terms.