• We consider the problem of graph matchability in non-identically distributed networks. In a general class of edge-independent networks, we demonstrate that graph matchability can be lost with high probability when matching the networks directly. We further demonstrate that under mild model assumptions, matchability is almost perfectly recovered by centering the networks using Universal Singular Value Thresholding before matching. These theoretical results are then demonstrated in both real and synthetic simulation settings. We also recover analogous core-matchability results in a very general core-junk network model, wherein some vertices do not correspond between the graph pair.
  • Analyzing massive complex networks yields promising insights about our everyday lives. Building scalable algorithms to do so is a challenging task that requires a careful analysis and an extensive evaluation. However, engineering such algorithms is often hindered by the scarcity of publicly~available~datasets. Network generators serve as a tool to alleviate this problem by providing synthetic instances with controllable parameters. However, many network generators fail to provide instances on a massive scale due to their sequential nature or resource constraints. Additionally, truly scalable network generators are few and often limited in their realism. In this work, we present novel generators for a variety of network models that are frequently used as benchmarks. By making use of pseudorandomization and divide-and-conquer schemes, our generators follow a communication-free paradigm. The resulting generators are thus embarrassingly parallel and have a near optimal scaling behavior. This allows us to generate instances of up to $2^{43}$ vertices and $2^{47}$ edges in less than 22 minutes on 32768 cores. Therefore, our generators allow new graph families to be used on an unprecedented scale.
  • This paper introduces a theoretical framework for the analysis and control of the stochastic susceptible-infected-removed (SIR) spreading process over a network of heterogeneous agents. In our analysis, we analyze the exact networked Markov process describing the SIR model, without resorting to mean-field approximations, and introduce a convex optimization framework to find an efficient allocation of resources to contain the expected number of accumulated infections over time. Numerical simulations are presented to illustrate the effectiveness of the obtained results.
  • Crowd employment is a new form of short-term and flexible employment which has emerged during the past decade. In order to understand this new form of employment, it is crucial to illuminate the underlying motivations of the workforce involved in it. This paper introduces the Multidimensional Crowdworker Motivation Scale (MCMS), a scale for measuring the motivation of crowdworkers on micro-task platforms. The MCMS is theoretically grounded in self-determination theory and tailored specifically to the context of paid crowdsourced micro-labor. The scale measures the motivation of crowdworkers along six motivational dimensions, ranging from amotivation to intrinsic motivation. We validated the MCMS on data collected in ten countries and three income groups. Factor analyses demonstrated that the MCMS's six dimensions showed good model fit, validity, and reliability. Furthermore, our measurement invariance tests showed that motivations measured with the MCMS are comparable across countries and income groups, and we present a first cross-country comparison of crowdworker motivations. This work constitutes an important first step towards understanding the motivations of the international crowd workforce.
  • We study the problem of inferring network topology from information cascades, in which the amount of time taken for information to diffuse across an edge in the network follows an unknown distribution. Unlike previous studies, which assume knowledge of these distributions, we only require that diffusion along different edges in the network be independent together with limited moment information (e.g., the means). We introduce the concept of a separating vertex set for a graph, which is a set of vertices in which for any two given distinct vertices of the graph, there exists a vertex whose distance to them are different. We show that a necessary condition for reconstructing a tree perfectly using distance information between pairs of vertices is given by the size of an observed separating vertex set. We then propose an algorithm to recover the tree structure using infection times, whose differences have means corresponding to the distance between two vertices. To improve the accuracy of our algorithm, we propose the concept of redundant vertices, which allows us to perform averaging to better estimate the distance between two vertices. Though the theory is developed mainly for tree networks, we demonstrate how the algorithm can be extended heuristically to general graphs. Simulations using synthetic and real networks, and experiments using real-world data suggest that our proposed algorithm performs better than some current state-of-the-art network reconstruction methods.
  • The classic Alternating Direction Method of Multipliers (ADMM) is a popular framework to solve linear-equality constrained problems. In this paper, we extend the ADMM naturally to nonlinear equality-constrained problems, called neADMM. The difficulty of neADMM is to solve nonconvex subproblems. We provide globally optimal solutions to them in two important applications. Experiments on synthetic and real-world datasets demonstrate excellent performance and scalability of our proposed neADMM over existing state-of-the-start methods.
  • Although viral spreading processes taking place in networks are often analyzed using Markovian models in which both the transmission and the recovery times follow exponential distributions, empirical studies show that, in many real scenarios, the distribution of these times are not necessarily exponential. To overcome this limitation, we first introduce a generalized susceptible-infected-susceptible (SIS) spreading model that allows transmission and recovery times to follow phase-type distributions. In this context, we derive a lower bound on the exponential decay rate towards the infection-free equilibrium of the spreading model without relying on mean-field approximations. Based on our results, we illustrate how the particular shape of the transmission/recovery distribution influences the exponential rate of convergence towards the equilibrium.
  • In many domain applications, a continuous timeline of human locations is critical; for example for understanding possible locations where a disease may spread, or the flow of traffic. While data sources such as GPS trackers or Call Data Records are temporally-rich, they are expensive, often not publicly available or garnered only in select locations, restricting their wide use. Conversely, geo-located social media data are publicly and freely available, but present challenges especially for full timeline inference due to their sparse nature. We propose a stochastic framework, Intermediate Location Computing (ILC) which uses prior knowledge about human mobility patterns to predict every missing location from an individual's social media timeline. We compare ILC with a state-of-the-art RNN baseline as well as methods that are optimized for next-location prediction only. For three major cities, ILC predicts the top 1 location for all missing locations in a timeline, at 1 and 2-hour resolution, with up to 77.2% accuracy (up to 6% better accuracy than all compared methods). Specifically, ILC also outperforms the RNN in settings of low data; both cases of very small number of users (under 50), as well as settings with more users, but with sparser timelines. In general, the RNN model needs a higher number of users to achieve the same performance as ILC. Overall, this work illustrates the tradeoff between prior knowledge of heuristics and more data, for an important societal problem of filling in entire timelines using freely available, but sparse social media data.
  • As a powerful representation paradigm for networked and multi-typed data, the heterogeneous information network (HIN) is ubiquitous. Meanwhile, defining proper relevance measures has always been a fundamental problem and of great pragmatic importance for network mining tasks. Inspired by our probabilistic interpretation of existing path-based relevance measures, we propose to study HIN relevance from a probabilistic perspective. We also identify, from real-world data, and propose to model cross-meta-path synergy, which is a characteristic important for defining path-based HIN relevance and has not been modeled by existing methods. A generative model is established to derive a novel path-based relevance measure, which is data-driven and tailored for each HIN. We develop an inference algorithm to find the maximum a posteriori (MAP) estimate of the model parameters, which entails non-trivial tricks. Experiments on two real-world datasets demonstrate the effectiveness of the proposed model and relevance measure.
  • Collective, especially group-based, managerial decision making is crucial in organizations. Using an evolutionary theoretic approach to collective decision making, agent-based simulations were conducted to investigate how human collective decision making would be affected by the agents' diversity in problem understanding and/or behavior in discussion, as well as by their social network structure. Simulation results indicated that groups with consistent problem understanding tended to produce higher utility values of ideas and displayed better decision convergence, but only if there was no group-level bias in collective problem understanding. Simulation results also indicated the importance of balance between selection-oriented (i.e., exploitative) and variation-oriented (i.e., explorative) behaviors in discussion to achieve quality final decisions. Expanding the group size and introducing non-trivial social network structure generally improved the quality of ideas at the cost of decision convergence. Simulations with different social network topologies revealed collective decision making on small-world networks with high local clustering tended to achieve highest decision quality more often than on random or scale-free networks. Implications of this evolutionary theory and simulation approach for future managerial research on collective, group, and multi-level decision making are discussed.
  • Web-based services often run randomized experiments to improve their products. A popular way to run these experiments is to use geographical regions as units of experimentation, since this does not require tracking of individual users or browser cookies. Since users may issue queries from multiple geographical locations, geo-regions cannot be considered independent and interference may be present in the experiment. In this paper, we study this problem, and first present GeoCUTS, a novel algorithm that forms geographical clusters to minimize interference while preserving balance in cluster size. We use a random sample of anonymized traffic from Google Search to form a graph representing user movements, then construct a geographically coherent clustering of the graph. Our main technical contribution is a statistical framework to measure the effectiveness of clusterings. Furthermore, we perform empirical evaluations showing that the performance of GeoCUTS is comparable to hand-crafted geo-regions with respect to both novel and existing metrics.
  • Modern society depends on the flow of information over online social networks, and users of popular platforms generate significant behavioral data about themselves and their social ties. However, it remains unclear what fundamental limits exist when using these data to predict the activities and interests of individuals, and to what accuracy such predictions can be made using an individual's social ties. Here we show that 95% of the potential predictive accuracy for an individual is achievable using their social ties only, without requiring that individual's data. We use information theoretic tools to estimate the predictive information within the writings of Twitter users, providing an upper bound on the available predictive information that holds for any predictive or machine learning methods. As few as 8-9 of an individual's contacts are sufficient to obtain predictability comparable to that of the individual alone. Distinct temporal and social effects are visible by measuring information flow along social ties, allowing us to better study the dynamics of online activity. Our results have distinct privacy implications: information is so strongly embedded in a social network that in principle one can profile an individual from their available social ties even when the individual forgoes the platform completely.
  • We sketch the history of spectral ranking, a general umbrella name for techniques that apply the theory of linear maps (in particular, eigenvalues and eigenvectors) to matrices that do not represent geometric transformations, but rather some kind of relationship between entities. Albeit recently made famous by the ample press coverage of Google's PageRank algorithm, spectral ranking was devised more than a century ago, and has been studied in tournament ranking, psychology, social sciences, bibliometrics, economy and choice theory. We describe the contribution given by previous scholars in precise and modern mathematical terms: along the way, we show how to express in a general way damped rankings, such as Katz's index, as dominant eigenvectors of perturbed matrices, and then use results on the Drazin inverse to go back to the dominant eigenvectors by a limit process. The result suggests a regularized definition of spectral ranking that yields for a general matrix a unique vector depending on a boundary condition.
  • Centrality is widely recognized as one of the most critical measures to provide insight in the structure and function of complex networks. While various centrality measures have been proposed for single-layer networks, a general framework for studying centrality in multilayer networks (i.e., multicentrality) is still lacking. In this study, a tensor-based framework is introduced to study eigenvector multicentrality, which enables the quantification of the impact of interlayer influence on multicentrality, providing a systematic way to describe how multicentrality propagates across different layers. This framework can leverage prior knowledge about the interplay among layers to better characterize multicentrality for varying scenarios. Two interesting cases are presented to illustrate how to model multilayer influence by choosing appropriate functions of interlayer influence and design algorithms to calculate eigenvector multicentrality. This framework is applied to analyze several empirical multilayer networks, and the results corroborate that it can quantify the influence among layers and multicentrality of nodes effectively.
  • In this paper, we study the Cost-aware Target Viral Marketing (CTVM) problem, a generalization of Influence Maximization (IM). CTVM asks for the most cost-effective users to influence the most relevant users. In contrast to the vast literature, we attempt to offer exact solutions. As the problem is NP-hard, thus, exact solutions are intractable, we propose TipTop, a $(1-\epsilon)$-optimal solution for arbitrary $\epsilon>0$ that scales to very large networks such as Twitter. At the heart of TipTop lies an innovative technique that reduces the number of samples as much as possible. This allows us to exactly solve CTVM on a much smaller space of generated samples using Integer Programming. Furthermore, TipTop lends a tool for researchers to benchmark their solutions against the optimal one in large-scale networks, which is currently not available.
  • We study an independent best-response dynamics on network games in which the nodes (players) decide to revise their strategies independently with some probability. We provide several bounds on the convergence time to an equilibrium as a function of this probability, the degree of the network, and the potential of the underlying games. These dynamics are somewhat more suitable for distributed environments than the classical better- and best-response dynamics where players revise their strategies "sequentially", i.e., no two players revise their strategies simultaneously.
  • Various models have been recently proposed to reflect and predict different properties of complex networks. However, the community structure, which is one of the most important properties, is not well studied and modeled. In this paper, we suggest a principle called "preferential placement", which allows to model a realistic clustering structure. We provide an extensive empirical analysis of the obtained structure as well as some theoretical results.
  • Recent progress in applying complex network theory to problems in quantum information has resulted in a beneficial crossover. Complex network methods have successfully been applied to transport and entanglement models while information physics is setting the stage for a theory of complex systems with quantum information-inspired methods. Novel quantum induced effects have been predicted in random graphs---where edges represent entangled links---and quantum computer algorithms have been proposed to offer enhancement for several network problems. Here we review the results at the cutting edge, pinpointing the similarities and the differences found at the intersection of these two fields.
  • Communities are an important feature of social networks. In fact, it seems that communities are necessary for a social network to be efficient. However, there exist very few formal studies of the actual role of communities in social networks, how they emerge, and how they are structured. The goal of this paper is to propose a mathematical model to study communities in social networks. For this, we consider a particular case of a social network, namely information networks. We assume that there is a population of agents who are interested in obtaining content. Agents differ in the type of content they are interested in. The goal of agents is to form communities in order to maximize their utility for obtaining and producing content. We use this model to characterize the structure of communities that emerge in this setting. While the proposed model is very simple, the obtained results suggest that it indeed is able to capture key properties of information communities.
  • We investigate a scalable $M$-channel critically sampled filter bank for graph signals, where each of the $M$ filters is supported on a different subband of the graph Laplacian spectrum. For analysis, the graph signal is filtered on each subband and downsampled on a corresponding set of vertices. However, the classical synthesis filters are replaced with interpolation operators. For small graphs, we use a full eigendecomposition of the graph Laplacian to partition the graph vertices such that the $m^{th}$ set comprises a uniqueness set for signals supported on the $m^{th}$ subband. The resulting transform is critically sampled, the dictionary atoms are orthogonal to those supported on different bands, and graph signals are perfectly reconstructable from their analysis coefficients. We also investigate fast versions of the proposed transform that scale efficiently for large, sparse graphs. Issues that arise in this context include designing the filter bank to be more amenable to polynomial approximation, estimating the number of samples required for each band, performing non-uniform random sampling for the filtered signals on each band, and using efficient reconstruction methods. We empirically explore the joint vertex-frequency localization of the dictionary atoms, the sparsity of the analysis coefficients for different classes of signals, the reconstruction error resulting from the numerical approximations, and the ability of the proposed transform to compress piecewise-smooth graph signals. The proposed filter bank also yields a fast, approximate graph Fourier transform with a coarse resolution in the spectral domain.
  • Positioning data offer a remarkable source of information to analyze crowds urban dynamics. However, discovering urban activity patterns from the emergent behavior of crowds involves complex system modeling. An alternative approach is to adopt computational techniques belonging to the emergent paradigm, which enables self-organization of data and allows adaptive analysis. Specifically, our approach is based on stigmergy. By using stigmergy each sample position is associated with a digital pheromone deposit, which progressively evaporates and aggregates with other deposits according to their spatiotemporal proximity. Based on this principle, we exploit positioning data to identify high density areas (hotspots) and characterize their activity over time. This characterization allows the comparison of dynamics occurring in different days, providing a similarity measure exploitable by clustering techniques. Thus, we cluster days according to their activity behavior, discovering unexpected urban activity patterns. As a case study, we analyze taxi traces in New York City during 2015.
  • We consider the problem of influence maximization in fixed networks for contagion models in an adversarial setting. The goal is to select an optimal set of nodes to seed the influence process, such that the number of influenced nodes at the conclusion of the campaign is as large as possible. We formulate the problem as a repeated game between a player and adversary, where the adversary specifies the edges along which the contagion may spread, and the player chooses sets of nodes to influence in an online fashion. We establish upper and lower bounds on the minimax pseudo-regret in both undirected and directed networks.
  • Accounting for undecided and uncertain voters is a challenging issue for predicting election results from public opinion polls. Undecided voters typify the uncertainty of swing voters in polls but are often ignored or allocated to each candidate in a simple, deterministic manner. Historically this may have been adequate because the undecided were comparatively small enough to assume that they do not affect the relative proportions of the decided voters. However, in the presence of high numbers of undecided voters, these static rules may in fact bias election predictions from election poll authors and meta-poll analysts. In this paper, we examine the effect of undecided voters in the 2016 US presidential election to the previous three presidential elections. We show there were a relatively high number of undecided voters over the campaign and on election day, and that the allocation of undecided voters in this election was not consistent with two-party proportional (or even) allocations. We find evidence that static allocation regimes are inadequate for election prediction models and that probabilistic allocations may be superior. We also estimate the bias attributable to polling agencies, often referred to as "house effects".
  • Social media are massive marketplaces where ideas and news compete for our attention. Previous studies have shown that quality is not a necessary condition for online virality and that knowledge about peer choices can distort the relationship between quality and popularity. However, these results do not explain the viral spread of low-quality information, such as the digital misinformation that threatens our democracy. We investigate quality discrimination in a stylized model of online social network, where individual agents prefer quality information, but have behavioral limitations in managing a heavy flow of information. We measure the relationship between the quality of an idea and its likelihood to become prevalent at the system level. We find that both information overload and limited attention contribute to a degradation in the market's discriminative power. A good tradeoff between discriminative power and diversity of information is possible according to the model. However, calibration with empirical data characterizing information load and finite attention in real social media reveals a weak correlation between quality and popularity of information. In these realistic conditions, the model predicts that high-quality information has little advantage over low-quality information.
  • Segregation is the separation of social groups in the physical or in the online world. Segregation discovery consists of finding contexts of segregation. In the modern digital society, discovering segregation is challenging, due to the large amount and the variety of social data. We present a tool in support of segregation discovery from relational and graph data. The SCube system builds on attributed graph clustering and frequent itemset mining. It offers to the analyst a multi-dimensional segregation data cube for exploratory data analysis. The demonstration first guides the audience through the relevant social science concepts. Then, it focuses on scenarios around case studies of gender occupational segregation. Two real and large datasets about the boards of directors of Italian and Estonian companies will be explored in search of segregation contexts. The architecture of the SCube system and its computational efficiency challenges and solutions are discussed.