• Even democracies endowed with the most active free press struggle to maintain the diversity of news coverage. Consolidation and market forces may cause only a few dominant players to control the news cycle. Editorial policies may be biased by corporate ownership relations, narrowing news coverage and focus. To an increasing degree, this problem also applies to social media news distribution, since it is subject to the same socio-economic drivers. To study the effects of consolidation and ownership on news diversity, we model the diversity of Chilean coverage on the basis of ownership records and social media data. We create similarity networks of news outlets on the basis of their ownership and the topics they cover. We then examine the relationships between the topology of ownership networks and content similarity to characterize how ownership affects news coverage. A network analysis reveals that Chilean media is highly concentrated both in terms of ownership as well as in terms of topics covered. Our method can be used to determine which groups of outlets and ownership exert the greatest influence on news coverage.
  • It is a long-standing question whether human sexual and reproductive cycles are affected predominantly by biology or culture. The literature is mixed with respect to whether biological or cultural factors best explain the reproduction cycle phenomenon, with biological explanations dominating the argument. The biological hypothesis proposes that human reproductive cycles are an adaptation to the seasonal cycles caused by hemisphere positioning, while the cultural hypothesis proposes that conception dates vary mostly due to cultural factors, such as vacation schedule or religious holidays. However, for many countries, common records used to investigate these hypotheses are incomplete or unavailable, biasing existing analysis towards primarily Christian countries in the Northern Hemisphere. Here we show that interest in sex peaks sharply online during major cultural and religious celebrations, regardless of hemisphere location. This online interest, when shifted by nine months, corresponds to documented human birth cycles, even after adjusting for numerous factors such as language, season, and amount of free time due to holidays. We further show that mood, measured independently on Twitter, contains distinct collective emotions associated with those cultural celebrations, and these collective moods correlate with sex search volume outside of these holidays as well. Our results provide converging evidence that the cyclic sexual and reproductive behavior of human populations is mostly driven by culture and that this interest in sex is associated with specific emotions, characteristic of, but not limited to, major cultural and religious celebrations.
  • Citations are commonly held to represent scientific impact. To date, however, there is no empirical evidence in support of this postulate that is central to research assessment exercises and Science of Science studies. Here, we report on the first empirical verification of the degree to which citation numbers represent scientific impact as it is actually perceived by experts in their respective field. We run a large-scale survey of about 2000 corresponding authors who performed a pairwise impact assessment task across more than 20000 scientific articles. Results of the survey show that citation data and perceived impact do not align well, unless one properly accounts for strong psychological biases that affect the opinions of experts with respect to their own papers vs. those of others. First, researchers tend to largely prefer their own publications to the most cited papers in their field of research. Second, there is only a mild positive correlation between the number of citations of top-cited papers in given research areas and expert preference in pairwise comparisons. This also applies to pairs of papers with several orders of magnitude differences in their total number of accumulated citations. However, when researchers were asked to choose among pairs of their own papers, thus eliminating the bias favouring one's own papers over those of others, they did systematically prefer the most cited article. We conclude that, when scientists have full information and are making unbiased choices, expert opinion on impact is congruent with citation numbers.
  • Human history has been marked by social instability and conflict, often driven by the irreconcilability of opposing sets of beliefs, ideologies, and religious dogmas. The dynamics of belief systems has been studied mainly from two distinct perspectives, namely how cognitive biases lead to individual belief rigidity and how social influence leads to social conformity. Here we propose a unifying framework that connects cognitive and social forces together in order to study the dynamics of societal belief evolution. Each individual is endowed with a network of interacting beliefs that evolves through interaction with other individuals in a social network. The adoption of beliefs is affected by both internal coherence and social conformity. Our framework explains how social instabilities can arise in otherwise homogeneous populations, how small numbers of zealots with highly coherent beliefs can overturn societal consensus, and how belief rigidity protects fringe groups and cults against invasion from mainstream beliefs, allowing them to persist and even thrive in larger societies. Our results suggest that strong consensus may be insufficient to guarantee social stability, that the cognitive coherence of belief-systems is vital in determining their ability to spread, and that coherent belief-systems may pose a serious problem for resolving social polarization, due to their ability to prevent consensus even under high levels of social exposure. We therefore argue that the inclusion of cognitive factors into a social model is crucial in providing a more complete picture of collective human dynamics.
  • Most individuals in social networks experience a so-called Friendship Paradox: they are less popular than their friends on average. This effect may explain recent findings that widespread social network media use leads to reduced happiness. However the relation between popularity and happiness is poorly understood. A Friendship paradox does not necessarily imply a Happiness paradox where most individuals are less happy than their friends. Here we report the first direct observation of a significant Happiness Paradox in a large-scale online social network of $39,110$ Twitter users. Our results reveal that popular individuals are indeed happier and that a majority of individuals experience a significant Happiness paradox. The magnitude of the latter effect is shaped by complex interactions between individual popularity, happiness, and the fact that users cluster assortatively by level of happiness. Our results indicate that the topology of online social networks and the distribution of happiness in some populations can cause widespread psycho-social effects that affect the well-being of billions of individuals.
  • Traditional fact checking by expert journalists cannot keep up with the enormous volume of information that is now generated online. Computational fact checking may significantly enhance our ability to evaluate the veracity of dubious information. Here we show that the complexities of human fact checking can be approximated quite well by finding the shortest path between concept nodes under properly defined semantic proximity metrics on knowledge graphs. Framed as a network problem this approach is feasible with efficient computational techniques. We evaluate this approach by examining tens of thousands of claims related to history, entertainment, geography, and biographical information using a public knowledge graph extracted from Wikipedia. Statements independently known to be true consistently receive higher support via our method than do false ones. These findings represent a significant step toward scalable computational fact-checking methods that may one day mitigate the spread of harmful misinformation.
  • Economies are instances of complex socio-technical systems that are shaped by the interactions of large numbers of individuals. The individual behavior and decision-making of consumer agents is determined by complex psychological dynamics that include their own assessment of present and future economic conditions as well as those of others, potentially leading to feedback loops that affect the macroscopic state of the economic system. We propose that the large-scale interactions of a nation's citizens with its online resources can reveal the complex dynamics of their collective psychology, including their assessment of future system states. Here we introduce a behavioral index of Chinese Consumer Confidence (C3I) that computationally relates large-scale online search behavior recorded by Google Trends data to the macroscopic variable of consumer confidence. Our results indicate that such computational indices may reveal the components and complex dynamics of consumer psychology as a collective socio-economic phenomenon, potentially leading to improved and more refined economic forecasting.
  • Public agencies like the U.S. National Science Foundation (NSF) and the National Institutes of Health (NIH) award tens of billions of dollars in annual science funding. How can this money be distributed as efficiently as possible to best promote scientific innovation and productivity? The present system relies primarily on peer review of project proposals. In 2010 alone, NSF convened more than 15,000 scientists to review 55,542 proposals. Although considered the scientific gold standard, peer review requires significant overhead costs, and may be subject to biases, inconsistencies, and oversights. We investigate a class of funding models in which all participants receive an equal portion of yearly funding, but are then required to anonymously donate a fraction of their funding to peers. The funding thus flows from one participant to the next, each acting as if he or she were a funding agency themselves. Here we show through a simulation conducted over large-scale citation data (37M articles, 770M citations) that such a distributed system for science may yield funding patterns similar to existing NIH and NSF distributions, but may do so at much lower overhead while exhibiting a range of other desirable features. Self-correcting mechanisms in scientific peer evaluation can yield an efficient and fair distribution of funding. The proposed model can be applied in many situations in which top-down or bottom-up allocation of public resources is either impractical or undesirable, e.g. public investments, distribution chains, and shared resource management.
  • We analyze the online response to the preprint publication of a cohort of 4,606 scientific articles submitted to the preprint database arXiv.org between October 2010 and May 2011. We study three forms of responses to these preprints: downloads on the arXiv.org site, mentions on the social media site Twitter, and early citations in the scholarly record. We perform two analyses. First, we analyze the delay and time span of article downloads and Twitter mentions following submission, to understand the temporal configuration of these reactions and whether one precedes or follows the other. Second, we run regression and correlation tests to investigate the relationship between Twitter mentions, arXiv downloads and article citations. We find that Twitter mentions and arXiv downloads of scholarly articles follow two distinct temporal patterns of activity, with Twitter mentions having shorter delays and narrower time spans than arXiv downloads. We also find that the volume of Twitter mentions is statistically correlated with arXiv downloads and early citations just months after the publication of a preprint, with a possible bias that favors highly mentioned articles.
  • Users frequently express their information needs by means of short and general queries that are difficult for ranking algorithms to interpret correctly. However, users' social contexts can offer important additional information about their information needs which can be leveraged by ranking algorithms to provide augmented, personalized results. Existing methods mostly rely on users' individual behavioral data such as clickstream and log data, but as a result suffer from data sparsity and privacy issues. Here, we propose a Community Tweets Voting Model (CTVM) to re-rank Google and Yahoo news search results on the basis of open, large-scale Twitter community data. Experimental results show that CTVM outperforms baseline rankings from Google and Yahoo for certain online communities. We propose an application scenario of CTVM and provide an agenda for further research.
  • Financial market prediction on the basis of online sentiment tracking has drawn a lot of attention recently. However, most results in this emerging domain rely on a unique, particular combination of data sets and sentiment tracking tools. This makes it difficult to disambiguate measurement and instrument effects from factors that are actually involved in the apparent relation between online sentiment and market values. In this paper, we survey a range of online data sets (Twitter feeds, news headlines, and volumes of Google search queries) and sentiment tracking methods (Twitter Investor Sentiment, Negative News Sentiment and Tweet & Google Search volumes of financial terms), and compare their value for financial prediction of market indices such as the Dow Jones Industrial Average, trading volumes, and market volatility (VIX), as well as gold prices. We also compare the predictive power of traditional investor sentiment survey data, i.e. Investor Intelligence and Daily Sentiment Index, against those of the mentioned set of online sentiment indicators. Our results show that traditional surveys of Investor Intelligence are lagging indicators of the financial markets. However, weekly Google Insight Search volumes on financial search queries do have predictive value. An indicator of Twitter Investor Sentiment and the frequency of occurrence of financial terms on Twitter in the previous 1-2 days are also found to be very statistically significant predictors of daily market log return. Survey sentiment indicators are however found not to be statistically significant predictors of financial market values, once we control for all other mood indicators as well as the VIX.
  • Social networks tend to disproportionally favor connections between individuals with either similar or dissimilar characteristics. This propensity, referred to as assortative mixing or homophily, is expressed as the correlation between attribute values of nearest neighbour vertices in a graph. Recent results indicate that beyond demographic features such as age, sex and race, even psychological states such as "loneliness" can be assortative in a social network. In spite of the increasing societal importance of online social networks it is unknown whether assortative mixing of psychological states takes place in situations where social ties are mediated solely by online networking services in the absence of physical contact. Here, we show that general happiness or Subjective Well-Being (SWB) of Twitter users, as measured from a 6 month record of their individual tweets, is indeed assortative across the Twitter social network. To our knowledge this is the first result that shows assortative mixing in online networks at the level of SWB. Our results imply that online social networks may be equally subject to the social mechanisms that cause assortative mixing in real social networks and that such assortative mixing takes place at the level of SWB. Given the increasing prevalence of online social networks, their propensity to connect users with similar levels of SWB may be an important instrument in better understanding how both positive and negative sentiments spread through online social ties. Future research may focus on how event-specific mood states can propagate and influence user behavior in "real life".
  • Scholarly usage data provides unique opportunities to address the known shortcomings of citation analysis. However, the collection, processing and analysis of usage data remains an area of active research. This article provides a review of the state-of-the-art in usage-based informetric, i.e. the use of usage data to study the scholarly process.
  • Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e., can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public's response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%.
  • Microblogging is a form of online communication by which users broadcast brief text updates, also known as tweets, to the public or a selected circle of contacts. A variegated mosaic of microblogging uses has emerged since the launch of Twitter in 2006: daily chatter, conversation, information sharing, and news commentary, among others. Regardless of their content and intended use, tweets often convey pertinent information about their author's mood status. As such, tweets can be regarded as temporally-authentic microscopic instantiations of public mood state. In this article, we perform a sentiment analysis of all public tweets broadcasted by Twitter users between August 1 and December 20, 2008. For every day in the timeline, we extract six dimensions of mood (tension, depression, anger, vigor, fatigue, confusion) using an extended version of the Profile of Mood States (POMS), a well-established psychometric instrument. We compare our results to fluctuations recorded by stock market and crude oil price indices and major events in media and popular culture, such as the U.S. Presidential Election of November 4, 2008 and Thanksgiving Day. We find that events in the social, political, cultural and economic sphere do have a significant, immediate and highly specific effect on the various dimensions of public mood. We speculate that large scale analyses of mood can provide a solid platform to model collective emotive trends in terms of their predictive value with regards to existing social as well as economic indicators.
  • The impact of scientific publications has traditionally been expressed in terms of citation counts. However, scientific activity has moved online over the past decade. To better capture scientific impact in the digital era, a variety of new impact measures has been proposed on the basis of social network analysis and usage log data. Here we investigate how these new measures relate to each other, and how accurately and completely they express scientific impact. We performed a principal component analysis of the rankings produced by 39 existing and proposed measures of scholarly impact that were calculated on the basis of both citation and usage log data. Our results indicate that the notion of scientific impact is a multi-dimensional construct that can not be adequately measured by any single indicator, although some measures are more suitable than others. The commonly used citation Impact Factor is not positioned at the core of this construct, but at its periphery, and should thus be used with caution.
  • In spite of its tremendous value, metadata is generally sparse and incomplete, thereby hampering the effectiveness of digital information services. Many of the existing mechanisms for the automated creation of metadata rely primarily on content analysis which can be costly and inefficient. The automatic metadata generation system proposed in this article leverages resource relationships generated from existing metadata as a medium for propagation from metadata-rich to metadata-poor resources. Because of its independence from content analysis, it can be applied to a wide variety of resource media types and is shown to be computationally inexpensive. The proposed method operates through two distinct phases. Occurrence and co-occurrence algorithms first generate an associative network of repository resources leveraging existing repository metadata. Second, using the associative network as a substrate, metadata associated with metadata-rich resources is propagated to metadata-poor resources by means of a discrete-form spreading activation algorithm. This article discusses the general framework for building associative networks, an algorithm for disseminating metadata through such networks, and the results of an experiment and validation of the proposed method using a standard bibliographic dataset.
  • The peer-review process is the most widely accepted certification mechanism for officially accepting the written results of researchers within the scientific community. An essential component of peer-review is the identification of competent referees to review a submitted manuscript. This article presents an algorithm to automatically determine the most appropriate reviewers for a manuscript by way of a co-authorship network data structure and a relative-rank particle-swarm algorithm. This approach is novel in that it is not limited to a pre-selected set of referees, is computationally efficient, requires no human-intervention, and, in some instances, can automatically identify conflict of interest situations. A useful application of this algorithm would be to open commentary peer-review systems because it provides a weighting for each referee with respects to their expertise in the domain of a manuscript. The algorithm is validated using referee bid data from the 2005 Joint Conference on Digital Libraries.
  • Scholarly usage data holds the potential to be used as a tool to study the dynamics of scholarship in real time, and to form the basis for the definition of novel metrics of scholarly impact. However, the formal groundwork to reliably and validly exploit usage data is lacking, and the exact nature, meaning and applicability of usage-based metrics is poorly understood. The MESUR project funded by the Andrew W. Mellon Foundation constitutes a systematic effort to define, validate and cross-validate a range of usage-based metrics of scholarly impact. MESUR has collected nearly 1 billion usage events as well as all associated bibliographic and citation data from significant publishers, aggregators and institutional consortia to construct a large-scale usage data reference set. This paper describes some major challenges related to aggregating and processing usage data, and discusses preliminary results obtained from analyzing the MESUR reference data set. The results confirm the intrinsic value of scholarly usage data, and support the feasibility of reliable and valid usage-based metrics of scholarly impact.
  • Large scale surveys of public mood are costly and often impractical to perform. However, the web is awash with material indicative of public mood such as blogs, emails, and web queries. Inexpensive content analysis on such extensive corpora can be used to assess public mood fluctuations. The work presented here is concerned with the analysis of the public mood towards the future. Using an extension of the Profile of Mood States questionnaire, we have extracted mood indicators from 10,741 emails submitted in 2006 to futureme.org, a web service that allows its users to send themselves emails to be delivered at a later date. Our results indicate long-term optimism toward the future, but medium-term apprehension and confusion.
  • Many systems can be described in terms of networks of discrete elements and their various relationships to one another. A semantic network, or multi-relational network, is a directed labeled graph consisting of a heterogeneous set of entities connected by a heterogeneous set of relationships. Semantic networks serve as a promising general-purpose modeling substrate for complex systems. Various standardized formats and tools are now available to support practical, large-scale semantic network models. First, the Resource Description Framework (RDF) offers a standardized semantic network data model that can be further formalized by ontology modeling languages such as RDF Schema (RDFS) and the Web Ontology Language (OWL). Second, the recent introduction of highly performant triple-stores (i.e. semantic network databases) allows semantic network models on the order of $10^9$ edges to be efficiently stored and manipulated. RDF and its related technologies are currently used extensively in the domains of computer science, digital library science, and the biological sciences. This article will provide an introduction to RDF/RDFS/OWL and an examination of its suitability to model discrete element complex systems.
  • Semantic network research has seen a resurgence from its early history in the cognitive sciences with the inception of the Semantic Web initiative. The Semantic Web effort has brought forth an array of technologies that support the encoding, storage, and querying of the semantic network data structure at the world stage. Currently, the popular conception of the Semantic Web is that of a data modeling medium where real and conceptual entities are related in semantically meaningful ways. However, new models have emerged that explicitly encode procedural information within the semantic network substrate. With these new technologies, the Semantic Web has evolved from a data modeling medium to a computational medium. This article provides a classification of existing computational modeling efforts and the requirements of supporting technologies that will aid in the further growth of this burgeoning domain.
  • There exist ample demonstrations that indicators of scholarly impact analogous to the citation-based ISI Impact Factor can be derived from usage data. However, contrary to the ISI IF which is based on citation data generated by the global community of scholarly authors, so far usage can only be practically recorded at a local level leading to community-specific assessments of scholarly impact that are difficult to generalize to the global scholarly community. We define a journal Usage Impact Factor which mimics the definition of the Thomson Scientific's ISI Impact Factor. Usage Impact Factor rankings are calculated on the basis of a large-scale usage data set recorded for the California State University system from 2003 to 2005. The resulting journal rankings are then compared to Thomson Scientific's ISI Impact Factor which is used as a baseline indicator of general impact. Our results indicate that impact as derived from California State University usage reflects the particular scientific and demographic characteristics of its communities.
  • The peer-review process, in its present form, has been repeatedly criticized. Of the many critiques ranging from publication delays to referee bias, this paper will focus specifically on the issue of how submitted manuscripts are distributed to qualified referees. Unqualified referees, without the proper knowledge of a manuscript's domain, may reject a perfectly valid study or potentially more damaging, unknowingly accept a faulty or fraudulent result. In this paper, referee competence is analyzed with respect to referee bid data collected from the 2005 Joint Conference on Digital Libraries (JCDL). The analysis of the referee bid behavior provides a validation of the intuition that referees are bidding on conference submissions with regards to the subject domain of the submission. Unfortunately, this relationship is not strong and therefore suggests that there exists other factors beyond subject domain that may be influencing referees to bid for particular submissions.
  • Although recording of usage data is common in scholarly information services, its exploitation for the creation of value-added services remains limited due to concerns regarding, among others, user privacy, data validity, and the lack of accepted standards for the representation, sharing and aggregation of usage data. This paper presents a technical, standards-based architecture for sharing usage information, which we have designed and implemented. In this architecture, OpenURL-compliant linking servers aggregate usage information of a specific user community as it navigates the distributed information environment that it has access to. This usage information is made OAI-PMH harvestable so that usage information exposed by many linking servers can be aggregated to facilitate the creation of value-added services with a reach beyond that of a single community or a single information service. This paper also discusses issues that were encountered when implementing the proposed approach, and it presents preliminary results obtained from analyzing a usage data set containing about 3,500,000 requests aggregated by a federation of linking servers at the California State University system over a 20 month period.