-
We consider the problem of graph matchability in non-identically distributed
networks. In a general class of edge-independent networks, we demonstrate that
graph matchability can be lost with high probability when matching the networks
directly. We further demonstrate that under mild model assumptions,
matchability is almost perfectly recovered by centering the networks using
Universal Singular Value Thresholding before matching. These theoretical
results are then demonstrated in both real and synthetic simulation settings.
We also recover analogous core-matchability results in a very general core-junk
network model, wherein some vertices do not correspond between the graph pair.
-
Analyzing massive complex networks yields promising insights about our
everyday lives. Building scalable algorithms to do so is a challenging task
that requires a careful analysis and an extensive evaluation. However,
engineering such algorithms is often hindered by the scarcity of
publicly~available~datasets.
Network generators serve as a tool to alleviate this problem by providing
synthetic instances with controllable parameters. However, many network
generators fail to provide instances on a massive scale due to their sequential
nature or resource constraints. Additionally, truly scalable network generators
are few and often limited in their realism.
In this work, we present novel generators for a variety of network models
that are frequently used as benchmarks. By making use of pseudorandomization
and divide-and-conquer schemes, our generators follow a communication-free
paradigm. The resulting generators are thus embarrassingly parallel and have a
near optimal scaling behavior. This allows us to generate instances of up to
$2^{43}$ vertices and $2^{47}$ edges in less than 22 minutes on 32768 cores.
Therefore, our generators allow new graph families to be used on an
unprecedented scale.
-
This paper introduces a theoretical framework for the analysis and control of
the stochastic susceptible-infected-removed (SIR) spreading process over a
network of heterogeneous agents. In our analysis, we analyze the exact
networked Markov process describing the SIR model, without resorting to
mean-field approximations, and introduce a convex optimization framework to
find an efficient allocation of resources to contain the expected number of
accumulated infections over time. Numerical simulations are presented to
illustrate the effectiveness of the obtained results.
-
Crowd employment is a new form of short-term and flexible employment which
has emerged during the past decade. In order to understand this new form of
employment, it is crucial to illuminate the underlying motivations of the
workforce involved in it. This paper introduces the Multidimensional
Crowdworker Motivation Scale (MCMS), a scale for measuring the motivation of
crowdworkers on micro-task platforms. The MCMS is theoretically grounded in
self-determination theory and tailored specifically to the context of paid
crowdsourced micro-labor. The scale measures the motivation of crowdworkers
along six motivational dimensions, ranging from amotivation to intrinsic
motivation. We validated the MCMS on data collected in ten countries and three
income groups. Factor analyses demonstrated that the MCMS's six dimensions
showed good model fit, validity, and reliability. Furthermore, our measurement
invariance tests showed that motivations measured with the MCMS are comparable
across countries and income groups, and we present a first cross-country
comparison of crowdworker motivations. This work constitutes an important first
step towards understanding the motivations of the international crowd
workforce.
-
We study the problem of inferring network topology from information cascades,
in which the amount of time taken for information to diffuse across an edge in
the network follows an unknown distribution. Unlike previous studies, which
assume knowledge of these distributions, we only require that diffusion along
different edges in the network be independent together with limited moment
information (e.g., the means). We introduce the concept of a separating vertex
set for a graph, which is a set of vertices in which for any two given distinct
vertices of the graph, there exists a vertex whose distance to them are
different. We show that a necessary condition for reconstructing a tree
perfectly using distance information between pairs of vertices is given by the
size of an observed separating vertex set. We then propose an algorithm to
recover the tree structure using infection times, whose differences have means
corresponding to the distance between two vertices. To improve the accuracy of
our algorithm, we propose the concept of redundant vertices, which allows us to
perform averaging to better estimate the distance between two vertices. Though
the theory is developed mainly for tree networks, we demonstrate how the
algorithm can be extended heuristically to general graphs. Simulations using
synthetic and real networks, and experiments using real-world data suggest that
our proposed algorithm performs better than some current state-of-the-art
network reconstruction methods.
-
The classic Alternating Direction Method of Multipliers (ADMM) is a popular
framework to solve linear-equality constrained problems. In this paper, we
extend the ADMM naturally to nonlinear equality-constrained problems, called
neADMM. The difficulty of neADMM is to solve nonconvex subproblems. We provide
globally optimal solutions to them in two important applications. Experiments
on synthetic and real-world datasets demonstrate excellent performance and
scalability of our proposed neADMM over existing state-of-the-start methods.
-
Although viral spreading processes taking place in networks are often
analyzed using Markovian models in which both the transmission and the recovery
times follow exponential distributions, empirical studies show that, in many
real scenarios, the distribution of these times are not necessarily
exponential. To overcome this limitation, we first introduce a generalized
susceptible-infected-susceptible (SIS) spreading model that allows transmission
and recovery times to follow phase-type distributions. In this context, we
derive a lower bound on the exponential decay rate towards the infection-free
equilibrium of the spreading model without relying on mean-field
approximations. Based on our results, we illustrate how the particular shape of
the transmission/recovery distribution influences the exponential rate of
convergence towards the equilibrium.
-
In many domain applications, a continuous timeline of human locations is
critical; for example for understanding possible locations where a disease may
spread, or the flow of traffic. While data sources such as GPS trackers or Call
Data Records are temporally-rich, they are expensive, often not publicly
available or garnered only in select locations, restricting their wide use.
Conversely, geo-located social media data are publicly and freely available,
but present challenges especially for full timeline inference due to their
sparse nature. We propose a stochastic framework, Intermediate Location
Computing (ILC) which uses prior knowledge about human mobility patterns to
predict every missing location from an individual's social media timeline. We
compare ILC with a state-of-the-art RNN baseline as well as methods that are
optimized for next-location prediction only. For three major cities, ILC
predicts the top 1 location for all missing locations in a timeline, at 1 and
2-hour resolution, with up to 77.2% accuracy (up to 6% better accuracy than all
compared methods). Specifically, ILC also outperforms the RNN in settings of
low data; both cases of very small number of users (under 50), as well as
settings with more users, but with sparser timelines. In general, the RNN model
needs a higher number of users to achieve the same performance as ILC. Overall,
this work illustrates the tradeoff between prior knowledge of heuristics and
more data, for an important societal problem of filling in entire timelines
using freely available, but sparse social media data.
-
As a powerful representation paradigm for networked and multi-typed data, the
heterogeneous information network (HIN) is ubiquitous. Meanwhile, defining
proper relevance measures has always been a fundamental problem and of great
pragmatic importance for network mining tasks. Inspired by our probabilistic
interpretation of existing path-based relevance measures, we propose to study
HIN relevance from a probabilistic perspective. We also identify, from
real-world data, and propose to model cross-meta-path synergy, which is a
characteristic important for defining path-based HIN relevance and has not been
modeled by existing methods. A generative model is established to derive a
novel path-based relevance measure, which is data-driven and tailored for each
HIN. We develop an inference algorithm to find the maximum a posteriori (MAP)
estimate of the model parameters, which entails non-trivial tricks. Experiments
on two real-world datasets demonstrate the effectiveness of the proposed model
and relevance measure.
-
Collective, especially group-based, managerial decision making is crucial in
organizations. Using an evolutionary theoretic approach to collective decision
making, agent-based simulations were conducted to investigate how human
collective decision making would be affected by the agents' diversity in
problem understanding and/or behavior in discussion, as well as by their social
network structure. Simulation results indicated that groups with consistent
problem understanding tended to produce higher utility values of ideas and
displayed better decision convergence, but only if there was no group-level
bias in collective problem understanding. Simulation results also indicated the
importance of balance between selection-oriented (i.e., exploitative) and
variation-oriented (i.e., explorative) behaviors in discussion to achieve
quality final decisions. Expanding the group size and introducing non-trivial
social network structure generally improved the quality of ideas at the cost of
decision convergence. Simulations with different social network topologies
revealed collective decision making on small-world networks with high local
clustering tended to achieve highest decision quality more often than on random
or scale-free networks. Implications of this evolutionary theory and simulation
approach for future managerial research on collective, group, and multi-level
decision making are discussed.
-
Web-based services often run randomized experiments to improve their
products. A popular way to run these experiments is to use geographical regions
as units of experimentation, since this does not require tracking of individual
users or browser cookies. Since users may issue queries from multiple
geographical locations, geo-regions cannot be considered independent and
interference may be present in the experiment. In this paper, we study this
problem, and first present GeoCUTS, a novel algorithm that forms geographical
clusters to minimize interference while preserving balance in cluster size. We
use a random sample of anonymized traffic from Google Search to form a graph
representing user movements, then construct a geographically coherent
clustering of the graph. Our main technical contribution is a statistical
framework to measure the effectiveness of clusterings. Furthermore, we perform
empirical evaluations showing that the performance of GeoCUTS is comparable to
hand-crafted geo-regions with respect to both novel and existing metrics.
-
Modern society depends on the flow of information over online social
networks, and users of popular platforms generate significant behavioral data
about themselves and their social ties. However, it remains unclear what
fundamental limits exist when using these data to predict the activities and
interests of individuals, and to what accuracy such predictions can be made
using an individual's social ties. Here we show that 95% of the potential
predictive accuracy for an individual is achievable using their social ties
only, without requiring that individual's data. We use information theoretic
tools to estimate the predictive information within the writings of Twitter
users, providing an upper bound on the available predictive information that
holds for any predictive or machine learning methods. As few as 8-9 of an
individual's contacts are sufficient to obtain predictability comparable to
that of the individual alone. Distinct temporal and social effects are visible
by measuring information flow along social ties, allowing us to better study
the dynamics of online activity. Our results have distinct privacy
implications: information is so strongly embedded in a social network that in
principle one can profile an individual from their available social ties even
when the individual forgoes the platform completely.
-
We sketch the history of spectral ranking, a general umbrella name for
techniques that apply the theory of linear maps (in particular, eigenvalues and
eigenvectors) to matrices that do not represent geometric transformations, but
rather some kind of relationship between entities. Albeit recently made famous
by the ample press coverage of Google's PageRank algorithm, spectral ranking
was devised more than a century ago, and has been studied in tournament
ranking, psychology, social sciences, bibliometrics, economy and choice theory.
We describe the contribution given by previous scholars in precise and modern
mathematical terms: along the way, we show how to express in a general way
damped rankings, such as Katz's index, as dominant eigenvectors of perturbed
matrices, and then use results on the Drazin inverse to go back to the dominant
eigenvectors by a limit process. The result suggests a regularized definition
of spectral ranking that yields for a general matrix a unique vector depending
on a boundary condition.
-
Centrality is widely recognized as one of the most critical measures to
provide insight in the structure and function of complex networks. While
various centrality measures have been proposed for single-layer networks, a
general framework for studying centrality in multilayer networks (i.e.,
multicentrality) is still lacking. In this study, a tensor-based framework is
introduced to study eigenvector multicentrality, which enables the
quantification of the impact of interlayer influence on multicentrality,
providing a systematic way to describe how multicentrality propagates across
different layers. This framework can leverage prior knowledge about the
interplay among layers to better characterize multicentrality for varying
scenarios. Two interesting cases are presented to illustrate how to model
multilayer influence by choosing appropriate functions of interlayer influence
and design algorithms to calculate eigenvector multicentrality. This framework
is applied to analyze several empirical multilayer networks, and the results
corroborate that it can quantify the influence among layers and multicentrality
of nodes effectively.
-
In this paper, we study the Cost-aware Target Viral Marketing (CTVM) problem,
a generalization of Influence Maximization (IM). CTVM asks for the most
cost-effective users to influence the most relevant users. In contrast to the
vast literature, we attempt to offer exact solutions. As the problem is
NP-hard, thus, exact solutions are intractable, we propose TipTop, a
$(1-\epsilon)$-optimal solution for arbitrary $\epsilon>0$ that scales to very
large networks such as Twitter. At the heart of TipTop lies an innovative
technique that reduces the number of samples as much as possible. This allows
us to exactly solve CTVM on a much smaller space of generated samples using
Integer Programming. Furthermore, TipTop lends a tool for researchers to
benchmark their solutions against the optimal one in large-scale networks,
which is currently not available.
-
We study an independent best-response dynamics on network games in which the
nodes (players) decide to revise their strategies independently with some
probability. We provide several bounds on the convergence time to an
equilibrium as a function of this probability, the degree of the network, and
the potential of the underlying games. These dynamics are somewhat more
suitable for distributed environments than the classical better- and
best-response dynamics where players revise their strategies "sequentially",
i.e., no two players revise their strategies simultaneously.
-
Various models have been recently proposed to reflect and predict different
properties of complex networks. However, the community structure, which is one
of the most important properties, is not well studied and modeled. In this
paper, we suggest a principle called "preferential placement", which allows to
model a realistic clustering structure. We provide an extensive empirical
analysis of the obtained structure as well as some theoretical results.
-
Recent progress in applying complex network theory to problems in quantum
information has resulted in a beneficial crossover. Complex network methods
have successfully been applied to transport and entanglement models while
information physics is setting the stage for a theory of complex systems with
quantum information-inspired methods. Novel quantum induced effects have been
predicted in random graphs---where edges represent entangled links---and
quantum computer algorithms have been proposed to offer enhancement for several
network problems. Here we review the results at the cutting edge, pinpointing
the similarities and the differences found at the intersection of these two
fields.
-
Communities are an important feature of social networks. In fact, it seems
that communities are necessary for a social network to be efficient. However,
there exist very few formal studies of the actual role of communities in social
networks, how they emerge, and how they are structured. The goal of this paper
is to propose a mathematical model to study communities in social networks. For
this, we consider a particular case of a social network, namely information
networks. We assume that there is a population of agents who are interested in
obtaining content. Agents differ in the type of content they are interested in.
The goal of agents is to form communities in order to maximize their utility
for obtaining and producing content. We use this model to characterize the
structure of communities that emerge in this setting. While the proposed model
is very simple, the obtained results suggest that it indeed is able to capture
key properties of information communities.
-
We investigate a scalable $M$-channel critically sampled filter bank for
graph signals, where each of the $M$ filters is supported on a different
subband of the graph Laplacian spectrum. For analysis, the graph signal is
filtered on each subband and downsampled on a corresponding set of vertices.
However, the classical synthesis filters are replaced with interpolation
operators. For small graphs, we use a full eigendecomposition of the graph
Laplacian to partition the graph vertices such that the $m^{th}$ set comprises
a uniqueness set for signals supported on the $m^{th}$ subband. The resulting
transform is critically sampled, the dictionary atoms are orthogonal to those
supported on different bands, and graph signals are perfectly reconstructable
from their analysis coefficients. We also investigate fast versions of the
proposed transform that scale efficiently for large, sparse graphs. Issues that
arise in this context include designing the filter bank to be more amenable to
polynomial approximation, estimating the number of samples required for each
band, performing non-uniform random sampling for the filtered signals on each
band, and using efficient reconstruction methods. We empirically explore the
joint vertex-frequency localization of the dictionary atoms, the sparsity of
the analysis coefficients for different classes of signals, the reconstruction
error resulting from the numerical approximations, and the ability of the
proposed transform to compress piecewise-smooth graph signals. The proposed
filter bank also yields a fast, approximate graph Fourier transform with a
coarse resolution in the spectral domain.
-
Positioning data offer a remarkable source of information to analyze crowds
urban dynamics. However, discovering urban activity patterns from the emergent
behavior of crowds involves complex system modeling. An alternative approach is
to adopt computational techniques belonging to the emergent paradigm, which
enables self-organization of data and allows adaptive analysis. Specifically,
our approach is based on stigmergy. By using stigmergy each sample position is
associated with a digital pheromone deposit, which progressively evaporates and
aggregates with other deposits according to their spatiotemporal proximity.
Based on this principle, we exploit positioning data to identify high density
areas (hotspots) and characterize their activity over time. This
characterization allows the comparison of dynamics occurring in different days,
providing a similarity measure exploitable by clustering techniques. Thus, we
cluster days according to their activity behavior, discovering unexpected urban
activity patterns. As a case study, we analyze taxi traces in New York City
during 2015.
-
We consider the problem of influence maximization in fixed networks for
contagion models in an adversarial setting. The goal is to select an optimal
set of nodes to seed the influence process, such that the number of influenced
nodes at the conclusion of the campaign is as large as possible. We formulate
the problem as a repeated game between a player and adversary, where the
adversary specifies the edges along which the contagion may spread, and the
player chooses sets of nodes to influence in an online fashion. We establish
upper and lower bounds on the minimax pseudo-regret in both undirected and
directed networks.
-
Accounting for undecided and uncertain voters is a challenging issue for
predicting election results from public opinion polls. Undecided voters typify
the uncertainty of swing voters in polls but are often ignored or allocated to
each candidate in a simple, deterministic manner. Historically this may have
been adequate because the undecided were comparatively small enough to assume
that they do not affect the relative proportions of the decided voters.
However, in the presence of high numbers of undecided voters, these static
rules may in fact bias election predictions from election poll authors and
meta-poll analysts. In this paper, we examine the effect of undecided voters in
the 2016 US presidential election to the previous three presidential elections.
We show there were a relatively high number of undecided voters over the
campaign and on election day, and that the allocation of undecided voters in
this election was not consistent with two-party proportional (or even)
allocations. We find evidence that static allocation regimes are inadequate for
election prediction models and that probabilistic allocations may be superior.
We also estimate the bias attributable to polling agencies, often referred to
as "house effects".
-
Social media are massive marketplaces where ideas and news compete for our
attention. Previous studies have shown that quality is not a necessary
condition for online virality and that knowledge about peer choices can distort
the relationship between quality and popularity. However, these results do not
explain the viral spread of low-quality information, such as the digital
misinformation that threatens our democracy. We investigate quality
discrimination in a stylized model of online social network, where individual
agents prefer quality information, but have behavioral limitations in managing
a heavy flow of information. We measure the relationship between the quality of
an idea and its likelihood to become prevalent at the system level. We find
that both information overload and limited attention contribute to a
degradation in the market's discriminative power. A good tradeoff between
discriminative power and diversity of information is possible according to the
model. However, calibration with empirical data characterizing information load
and finite attention in real social media reveals a weak correlation between
quality and popularity of information. In these realistic conditions, the model
predicts that high-quality information has little advantage over low-quality
information.
-
Segregation is the separation of social groups in the physical or in the
online world. Segregation discovery consists of finding contexts of
segregation. In the modern digital society, discovering segregation is
challenging, due to the large amount and the variety of social data. We present
a tool in support of segregation discovery from relational and graph data. The
SCube system builds on attributed graph clustering and frequent itemset mining.
It offers to the analyst a multi-dimensional segregation data cube for
exploratory data analysis. The demonstration first guides the audience through
the relevant social science concepts. Then, it focuses on scenarios around case
studies of gender occupational segregation. Two real and large datasets about
the boards of directors of Italian and Estonian companies will be explored in
search of segregation contexts. The architecture of the SCube system and its
computational efficiency challenges and solutions are discussed.