• Estimating distributions of node characteristics (labels) such as number of connections or citizenship of users in a social network via edge and node sampling is a vital part of the study of complex networks. Due to its low cost, sampling via a random walk (RW) has been proposed as an attractive solution to this task. Most RW methods assume either that the network is undirected or that walkers can traverse edges regardless of their direction. Some RW methods have been designed for directed networks where edges coming into a node are not directly observable. In this work, we propose Directed Unbiased Frontier Sampling (DUFS), a sampling method based on a large number of coordinated walkers, each starting from a node chosen uniformly at random. It is applicable to directed networks with invisible incoming edges because it constructs, in real-time, an undirected graph consistent with the walkers trajectories, and due to the use of random jumps which prevent walkers from being trapped. DUFS generalizes previous RW methods and is suited for undirected networks and to directed networks regardless of in-edges visibility. We also propose an improved estimator of node label distributions that combines information from the initial walker locations with subsequent RW observations. We evaluate DUFS, compare it to other RW methods, investigate the impact of its parameters on estimation accuracy and provide practical guidelines for choosing them. In estimating out-degree distributions, DUFS yields significantly better estimates of the head of the distribution than other methods, while matching or exceeding estimation accuracy of the tail. Last, we show that DUFS outperforms uniform node sampling when estimating distributions of node labels of the top 10% largest degree nodes, even when sampling a node uniformly has the same cost as RW steps.
  • We investigate the rest-frame UV morphologies of a large sample of Lyman-a emitters (LAEs) from z~2 to z~6, selected in a uniform way with 16 different narrow- and medium-bands over the full COSMOS field. We use 3045 LAEs with HST coverage in a stacking analysis and find that they have M_UV~-20, below M*_UV at these redshifts. We also focus our analysis on a subsample of 780 individual galaxies with i_AB<25 for which GALFIT converges for 429 of them. The individual median size (re~1 kpc), ellipticities (slightly elongated with (b/a)~0.45), S\'ersic index (disk-like with n<2) and light concentration (comparable to that of disk or irregular galaxies, with C~2.7) of LAEs show mild evolution from z~2 to z~6. LAEs with the highest rest-frame equivalent widths (EW) are the smallest/most compact (re~0.8 kpc, compared to re~1.5 kpc for the lower EW LAEs). When stacking our samples in bins of fixed Lya luminosity and Lya EW we find evidence for redshift evolution in n and C, but not in galaxy sizes. The evolution seems to be stronger for LAEs with 25<EW<100 {\AA}. When compared to other SFGs, LAEs are found to be smaller at all redshifts. The difference between the two populations changes with redshift, from a factor of ~1 at z>5 to SFGs being a factor of ~2-4 larger than LAEs for z<2. This means that at the highest redshifts, where typical sizes approach those of LAEs, the fraction of galaxies showing Lya in emission (and with a high Lya escape fraction) should be much higher, consistent with observations.
  • We present and explore deep narrow- and medium-band data obtained with the Subaru and the Isaac Newton telescopes in the ~2 deg$^2$ COSMOS field. We use these data as an extremely wide, low-resolution (R~20-80) IFU survey to slice through the COSMOS field and obtain a large sample of ~4000 Lyman-$\alpha$ (Lya) emitters from z~2 to z~6 in 16 redshift slices (SC4K). We present new Lya luminosity functions (LFs) covering a co-moving volume of ~10$^8$Mpc$^3$. SC4K extensively complements ultra-deep surveys, jointly covering over 4 dex in Lya luminosity and revealing a global (2.5<z<6) synergy LF with $\alpha=-1.93\pm0.12$, $\log_{10}\Phi^*=-3.45^{+0.22}_{-0.29}$ Mpc$^{-3}$ and $\log_{10}L^*=42.93^{+0.15}_{-0.11}$ erg/s. The Schechter component of the Lya LF reveals a factor ~5 rise in $L^*$ and a ~7x decline in $\Phi^*$ from z~2 to z~6. The data reveal an extra power-law (or Schechter) component above L~10$^{43.3}$ erg/s at z~2.2-3.5 and we show that it is partially driven by X-ray and radio AGN, as their Lya LF resembles the excess. The power-law component vanishes and/or is below our detection limits above z>3.5, likely linked with the evolution of the AGN population. The Lya luminosity density rises by a factor ~2 from z~2 to z~3 but is then found to be roughly constant (~$1.1\times10^{40}$ erg s$^{-1}$ Mpc$^{-3}$) to z~6, despite the ~0.7 dex drop in UV luminosity density. The Lya/UV luminosity density ratio rises from $4\pm1$% to $30\pm6$% from z~2.2 to z~6. Our results imply a rise of a factor of ~2 in the global ionisation efficiency ($\xi_{\rm ion}$) and a factor ~$4\pm1$ in the Lya escape fraction from z~2 to z~6, hinting for evolution in both the typical burstiness/stellar populations and even more so in the typical ISM conditions allowing Ly$\alpha$ photons to escape.
  • A major branch of anomaly detection methods rely on dynamic networks: raw sequential data is first converted to a series of networks, then critical change points are identified in the evolving network structure. However, existing approaches use the first-order network (FON) to represent the underlying raw data, which may lose important higher-order sequence patterns, making higher-order anomalies undetectable in subsequent analysis. By replacing FON with higher-order network (HONs), we show that existing anomaly detection algorithms can better capture higher-order anomalies that may otherwise be ignored. We show that the existing HON construction algorithm cannot be used for the anomaly detection task due to the extra parameters and poor scalability; we introduce a parameter-free algorithm that constructs HON in big data sets. Using a large-scale synthetic data set with 11 billion web clickstreams, we demonstrate how the proposed method can capture variable orders of anomalies. Using a real-world taxi trajectory data, we show how the proposed method amplifies higher-order anomaly signals. Finally, we provide complexity analysis and benchmarking to show how one can incorporating higher-order dependencies with a small overhead.
  • We propose a Las Vegas transformation of Markov Chain Monte Carlo (MCMC) estimators of Restricted Boltzmann Machines (RBMs). We denote our approach Markov Chain Las Vegas (MCLV). MCLV gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has maximum number of Markov chain steps K (referred as MCLV-K). We present a MCLV-K gradient estimator (LVS-K) for RBMs and explore the correspondence and differences between LVS-K and Contrastive Divergence (CD-K), with LVS-K significantly outperforming CD-K training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models.
  • Applications in various domains rely on processing graph streams, e.g., communication logs of a cloud-troubleshooting system, road-network traffic updates, and interactions on a social network. A labeled-graph stream refers to a sequence of streamed edges that form a labeled graph. Label-aware applications need to filter the graph stream before performing a graph operation. Due to the large volume and high velocity of these streams, it is often more practical to incrementally build a lossy-compressed version of the graph, and use this lossy version to approximately evaluate graph queries. Challenges arise when the queries are unknown in advance but are associated with filtering predicates based on edge labels. Surprisingly common, and especially challenging, are labeled-graph streams that have highly skewed label distributions that might also vary over time. This paper introduces Self-Balanced Graph Sketch (SBG-Sketch, for short), a graphical sketch for summarizing and querying labeled-graph streams that can cope with all these challenges. SBG-Sketch maintains synopsis for both the edge attributes (e.g., edge weight) as well as the topology of the streamed graph. SBG-Sketch allows efficient processing of graph-traversal queries, e.g., reachability queries. Experimental results over a variety of real graph streams show SBG-Sketch to reduce the estimation errors of state-of-the-art methods by up to 99%.
  • Research in statistical relational learning has produced a number of methods for learning relational models from large-scale network data. While these methods have been successfully applied in various domains, they have been developed under the unrealistic assumption of full data access. In practice, however, the data are often collected by crawling the network, due to proprietary access, limited resources, and privacy concerns. Recently, we showed that the parameter estimates for relational Bayes classifiers computed from network samples collected by existing network crawlers can be quite inaccurate, and developed a crawl-aware estimation method for such models (Yang, Ribeiro, and Neville, 2017). In this work, we extend the methodology to learning relational logistic regression models via stochastic gradient descent from partial network crawls, and show that the proposed method yields accurate parameter estimates and confidence intervals.
  • Cumulative advantage (CA) refers to the notion that accumulated resources foster the accumulation of further resources in competitions, a phenomenon that has been empirically observed in various contexts. The oldest and arguably simplest mathematical model that embodies this general principle is the P\'olya urn process, which finds applications in a myriad of problems. The original model captures the dynamics of competitions between two equally fit agents under linear CA effects, which can be readily generalized to incorporate different fitnesses and nonlinear CA effects. We study two statistics of competitions under the generalized model, namely duration (i.e., time of the last tie) and intensity (i.e., number of ties). We give rigorous mathematical characterizations of the tail distributions of both duration and intensity under the various regimes for fitness and nonlinearity, which reveal very interesting behaviors. For example, fitness superiority induces much shorter competitions in the sublinear regime while much longer competitions in the superlinear regime. Our findings can shed light on the application of P\'olya urn processes in more general contexts where fitness and nonlinearity may be present.
  • Active search (AS) on graphs focuses on collecting certain labeled nodes (targets) given global knowledge of the network topology and its edge weights under a query budget. However, in most networks, nodes, topology and edge weights are all initially unknown. We introduce selective harvesting, a variant of AS where the next node to be queried must be chosen among the neighbors of the current queried node set; the available training data for deciding which node to query is restricted to the subgraph induced by the queried set (and their node attributes) and their neighbors (without any node or edge attributes). Therefore, selective harvesting is a sequential decision problem, where we must decide which node to query at each step. A classifier trained in this scenario suffers from a tunnel vision effect: without recourse to independent sampling, the urge to query promising nodes forces classifiers to gather increasingly biased training data, which we show significantly hurts the performance of AS methods and standard classifiers. We find that it is possible to collect a much larger set of targets by using multiple classifiers, not by combining their predictions as an ensemble, but switching between classifiers used at each step, as a way to ease the tunnel vision effect. We discover that switching classifiers collects more targets by (a) diversifying the training data and (b) broadening the choices of nodes that can be queried next. This highlights an exploration, exploitation, and diversification trade-off in our problem that goes beyond the exploration and exploitation duality found in classic sequential decision problems. From these observations we propose D3TS, a method based on multi-armed bandits for non-stationary stochastic processes that enforces classifier diversity, matching or exceeding the performance of competing methods on seven real network datasets in our evaluation.
  • The goal of cluster analysis in survival data is to identify clusters that are decidedly associated with the survival outcome. Previous research has explored this problem primarily in the medical domain with relatively small datasets, but the need for such a clustering methodology could arise in other domains with large datasets, such as social networks. Concretely, we wish to identify different survival classes in a social network by clustering the users based on their lifespan in the network. In this paper, we propose a decision tree based algorithm that uses a global normalization of $p$-values to identify clusters with significantly different survival distributions. We evaluate the clusters from our model with the help of a simple survival prediction task and show that our model outperforms other competing methods.
  • Dynamic networks are increasingly being usedd to model real world datasets. A challenging task in their analysis is to detect and characterize clusters. It is useful for analyzing real-world data such as detecting evolving communities in networks. We propose a temporal clustering framework based on a set of network generative models to address this problem. We use PARAFAC decomposition to learn network models from datasets.We then use $K$-means for clustering, the Silhouette criterion to determine the number of clusters, and a similarity score to order the clusters and retain the significant ones. In order to address the time-dependent aspect of these clusters, we propose a segmentation algorithm to detect their formations, dissolutions and lifetimes. Synthetic networks with ground truth and real-world datasets are used to test our method against state-of-the-art, and the results show that our method has better performance in clustering and lifetime detection than previous methods.
  • Which song will Smith listen to next? Which restaurant will Alice go to tomorrow? Which product will John click next? These applications have in common the prediction of user trajectories that are in a constant state of flux over a hidden network (e.g. website links, geographic location). What users are doing now may be unrelated to what they will be doing in an hour from now. Mindful of these challenges we propose TribeFlow, a method designed to cope with the complex challenges of learning personalized predictive models of non-stationary, transient, and time-heterogeneous user trajectories. TribeFlow is a general method that can perform next product recommendation, next song recommendation, next location prediction, and general arbitrary-length user trajectory prediction without domain-specific knowledge. TribeFlow is more accurate and up to 413x faster than top competitors.
  • Online social networks (OSN) contain extensive amount of information about the underlying society that is yet to be explored. One of the most feasible technique to fetch information from OSN, crawling through Application Programming Interface (API) requests, poses serious concerns over the the guarantees of the estimates. In this work, we focus on making reliable statistical inference with limited API crawls. Based on regenerative properties of the random walks, we propose an unbiased estimator for the aggregated sum of functions over edges and proved the connection between variance of the estimator and spectral gap. In order to facilitate Bayesian inference on the true value of the estimator, we derive the approximate posterior distribution of the estimate. Later the proposed ideas are validated with numerical experiments on inference problems in real-world networks.
  • The role of skill (fitness) and luck (randomness) as driving forces on the dynamics of resource accumulation in a myriad of systems have long puzzled scientists. Fueled by undisputed inequalities that emerge from actual competitions, there is a pressing need for better understanding the effects of skill and luck in resource accumulation. When such competitions are driven by externalities such as cumulative advantage (CA), the rich-get-richer effect, little is known with respect to fundamental properties such as their duration and intensity. In this work we provide a mathematical understanding of how CA exacerbates the role of luck in detriment of skill in simple and well-studied competition models. We show, for instance, that if two agents are competing for resources that arrive sequentially at each time unit, an early stroke of luck can place the less skilled in the lead for an extremely long period of time, a phenomenon we call "struggle of the fittest". In the absence of CA, the more skilled quickly prevails despite any early stroke of luck that the less skilled may have. We prove that duration of a simple skill and luck competition model exhibit power law tails when CA is present, regardless of skill difference, which is in sharp contrast to exponential tails when CA is absent. Our findings have important implications to competitions not only in complex social systems but also in contexts that leverage such models.
  • How does a new startup drive the popularity of competing websites into oblivion like Facebook famously did to MySpace? This question is of great interest to academics, technologists, and financial investors alike. In this work we exploit the singular way in which Facebook wiped out the popularity of MySpace, Hi5, Friendster, and Multiply to guide the design of a new popularity competition model. Our model provides new insights into what Nobel Laureate Herbert A. Simon called the "marketplace of attention," which we recast as the attention-activity marketplace. Our model design is further substantiated by user-level activity of 250,000 MySpace users obtained between 2004 and 2009. The resulting model not only accurately fits the observed Daily Active Users (DAU) of Facebook and its competitors but also predicts their fate four years into the future.
  • How many listens will an artist receive on a online radio? How about plays on a YouTube video? How many of these visits are new or returning users? Modeling and mining popularity dynamics of social activity has important implications for researchers, content creators and providers. We here investigate the effect of revisits (successive visits from a single user) on content popularity. Using four datasets of social activity, with up to tens of millions media objects (e.g., YouTube videos, Twitter hashtags or LastFM artists), we show the effect of revisits in the popularity evolution of such objects. Secondly, we propose the Phoenix-R model which captures the popularity dynamics of individual objects. Phoenix-R has the desired properties of being: (1) parsimonious, being based on the minimum description length principle, and achieving lower root mean squared error than state-of-the-art baselines; (2) applicable, the model is effective for predicting future popularity values of objects.
  • Preferential attachment (PA) models of network structure are widely used due to their explanatory power and conceptual simplicity. PA models are able to account for the scale-free degree distributions observed in many real-world large networks through the remarkably simple mechanism of sequentially introducing nodes that attach preferentially to high-degree nodes. The ability to efficiently generate instances from PA models is a key asset in understanding both the models themselves and the real networks that they represent. Surprisingly, little attention has been paid to the problem of efficient instance generation. In this paper, we show that the complexity of generating network instances from a PA model depends on the preference function of the model, provide efficient data structures that work under any preference function, and present empirical results from an implementation based on these data structures. We demonstrate that, by indexing growing networks with a simple augmented heap, we can implement a network generator which scales many orders of magnitude beyond existing capabilities ($10^6$ -- $10^8$ nodes). We show the utility of an efficient and general PA network generator by investigating the consequences of varying the preference functions of an existing model. We also provide "quicknet", a freely-available open-source implementation of the methods described in this work.
  • Exploring statistics of locally connected subgraph patterns (also known as network motifs) has helped researchers better understand the structure and function of biological and online social networks (OSNs). Nowadays the massive size of some critical networks -- often stored in already overloaded relational databases -- effectively limits the rate at which nodes and edges can be explored, making it a challenge to accurately discover subgraph statistics. In this work, we propose sampling methods to accurately estimate subgraph statistics from as few queried nodes as possible. We present sampling algorithms that efficiently and accurately estimate subgraph properties of massive networks. Our algorithms require no pre-computation or complete network topology information. At the same time, we provide theoretical guarantees of convergence. We perform experiments using widely known data sets, and show that for the same accuracy, our algorithms require an order of magnitude less queries (samples) than the current state-of-the-art algorithms.
  • Recommendation systems for online dating have recently attracted much attention from the research community. In this paper we proposed a two-side matching framework for online dating recommendations and design an LDA model to learn the user preferences from the observed user messaging behavior and user profile features. Experimental results using data from a large online dating website shows that two-sided matching improves significantly the rate of successful matches by as much as 45%. Finally, using simulated matchings we show that the the LDA model can correctly capture user preferences.
  • Algorithms for identifying the infection states of nodes in a network are crucial for understanding and containing infections. Often, however, only a relatively small set of nodes have a known infection state. Moreover, the length of time that each node has been infected is also unknown. This missing data -- infection state of most nodes and infection time of the unobserved infected nodes -- poses a challenge to the study of real-world cascades. In this work, we develop techniques to identify the latent infected nodes in the presence of missing infection time-and-state data. Based on the likely epidemic paths predicted by the simple susceptible-infected epidemic model, we propose a measure (Infection Betweenness) for uncovering these unknown infection states. Our experimental results using machine learning algorithms show that Infection Betweenness is the most effective feature for identifying latent infected nodes.
  • Driven by outstanding success stories of Internet startups such as Facebook and The Huffington Post, recent studies have thoroughly described their growth. These highly visible online success stories, however, overshadow an untold number of similar ventures that fail. The study of website popularity is ultimately incomplete without general mechanisms that can describe both successes and failures. In this work we present six years of the daily number of users (DAU) of twenty-two membership-based websites - encompassing online social networks, grassroots movements, online forums, and membership-only Internet stores - well balanced between successes and failures. We then propose a combination of reaction-diffusion-decay processes whose resulting equations seem not only to describe well the observed DAU time series but also provide means to roughly predict their evolution. This model allows an approximate automatic DAU-based classification of websites into self-sustainable v.s. unsustainable and whether the startup growth is mostly driven by marketing & media campaigns or word-of-mouth adoptions.
  • Online dating sites have become popular platforms for people to look for potential romantic partners. It is important to understand users' dating preferences in order to make better recommendations on potential dates. The message sending and replying actions of a user are strong indicators for what he/she is looking for in a potential date and reflect the user's actual dating preferences. We study how users' online dating behaviors correlate with various user attributes using a large real-world dateset from a major online dating site in China. Many of our results on user messaging behavior align with notions in social and evolutionary psychology: males tend to look for younger females while females put more emphasis on the socioeconomic status (e.g., income, education level) of a potential date. In addition, we observe that the geographic distance between two users and the photo count of users play an important role in their dating behaviors. Our results show that it is important to differentiate between users' true preferences and random selection. Some user behaviors in choosing attributes in a potential date may largely be a result of random selection. We also find that both males and females are more likely to reply to users whose attributes come closest to the stated preferences of the receivers, and there is significant discrepancy between a user's stated dating preference and his/her actual online dating behavior. These results can provide valuable guidelines to the design of a recommendation engine for potential dates.
  • Characterizing large online social networks (OSNs) through node querying is a challenging task. OSNs often impose severe constraints on the query rate, hence limiting the sample size to a small fraction of the total network. Various ad-hoc subgraph sampling methods have been proposed, but many of them give biased estimates and no theoretical basis on the accuracy. In this work, we focus on developing sampling methods for OSNs where querying a node also reveals partial structural information about its neighbors. Our methods are optimized for NoSQL graph databases (if the database can be accessed directly), or utilize Web API available on most major OSNs for graph sampling. We show that our sampling method has provable convergence guarantees on being an unbiased estimator, and it is more accurate than current state-of-the-art methods. We characterize metrics such as node label density estimation and edge label density estimation, two of the most fundamental network characteristics from which other network characteristics can be derived. We evaluate our methods on-the-fly over several live networks using their native APIs. Our simulation studies over a variety of offline datasets show that by including neighborhood information, our method drastically (4-fold) reduces the number of samples required to achieve the same estimation accuracy of state-of-the-art methods.
  • Time-varying networks describe a wide array of systems whose constituents and interactions evolve over time. They are defined by an ordered stream of interactions between nodes, yet they are often represented in terms of a sequence of static networks, each aggregating all edges and nodes present in a time interval of size \Delta t. In this work we quantify the impact of an arbitrary \Delta t on the description of a dynamical process taking place upon a time-varying network. We focus on the elementary random walk, and put forth a simple mathematical framework that well describes the behavior observed on real datasets. The analytical description of the bias introduced by time integrating techniques represents a step forward in the correct characterization of dynamical processes on time-varying graphs.
  • With the main goal of finding bulgeless galaxies harbouring super massive black holes and showing, at most, just residual star formation activity, we have selected a sample of massive bulgeless red sequence galaxies from the SDSS-DR7, based on the NYU-VAGC catalogue. Multivavelength data were retrieved using EURO-VO tools, and the objects are characterised in terms of degree of star formation and the presence of an AGN. We have found seven objects that are quenched massive galaxies, that have no prominent bulge and that show signs of extra activity in their nuclei, five of them being central in their halo. These objects are rather robust candidates for rare systems that, though devoid of a significant bulge, harbor a supermassive black hole with an activity level likely capable of having halted the star formation through feedback.