
Estimating distributions of node characteristics (labels) such as number of
connections or citizenship of users in a social network via edge and node
sampling is a vital part of the study of complex networks. Due to its low cost,
sampling via a random walk (RW) has been proposed as an attractive solution to
this task. Most RW methods assume either that the network is undirected or that
walkers can traverse edges regardless of their direction. Some RW methods have
been designed for directed networks where edges coming into a node are not
directly observable. In this work, we propose Directed Unbiased Frontier
Sampling (DUFS), a sampling method based on a large number of coordinated
walkers, each starting from a node chosen uniformly at random. It is applicable
to directed networks with invisible incoming edges because it constructs, in
realtime, an undirected graph consistent with the walkers trajectories, and
due to the use of random jumps which prevent walkers from being trapped. DUFS
generalizes previous RW methods and is suited for undirected networks and to
directed networks regardless of inedges visibility. We also propose an
improved estimator of node label distributions that combines information from
the initial walker locations with subsequent RW observations. We evaluate DUFS,
compare it to other RW methods, investigate the impact of its parameters on
estimation accuracy and provide practical guidelines for choosing them. In
estimating outdegree distributions, DUFS yields significantly better estimates
of the head of the distribution than other methods, while matching or exceeding
estimation accuracy of the tail. Last, we show that DUFS outperforms uniform
node sampling when estimating distributions of node labels of the top 10%
largest degree nodes, even when sampling a node uniformly has the same cost as
RW steps.

We investigate the restframe UV morphologies of a large sample of Lymana
emitters (LAEs) from z~2 to z~6, selected in a uniform way with 16 different
narrow and mediumbands over the full COSMOS field. We use 3045 LAEs with HST
coverage in a stacking analysis and find that they have M_UV~20, below M*_UV
at these redshifts. We also focus our analysis on a subsample of 780 individual
galaxies with i_AB<25 for which GALFIT converges for 429 of them. The
individual median size (re~1 kpc), ellipticities (slightly elongated with
(b/a)~0.45), S\'ersic index (disklike with n<2) and light concentration
(comparable to that of disk or irregular galaxies, with C~2.7) of LAEs show
mild evolution from z~2 to z~6. LAEs with the highest restframe equivalent
widths (EW) are the smallest/most compact (re~0.8 kpc, compared to re~1.5 kpc
for the lower EW LAEs). When stacking our samples in bins of fixed Lya
luminosity and Lya EW we find evidence for redshift evolution in n and C, but
not in galaxy sizes. The evolution seems to be stronger for LAEs with 25<EW<100
{\AA}. When compared to other SFGs, LAEs are found to be smaller at all
redshifts. The difference between the two populations changes with redshift,
from a factor of ~1 at z>5 to SFGs being a factor of ~24 larger than LAEs for
z<2. This means that at the highest redshifts, where typical sizes approach
those of LAEs, the fraction of galaxies showing Lya in emission (and with a
high Lya escape fraction) should be much higher, consistent with observations.

We present and explore deep narrow and mediumband data obtained with the
Subaru and the Isaac Newton telescopes in the ~2 deg$^2$ COSMOS field. We use
these data as an extremely wide, lowresolution (R~2080) IFU survey to slice
through the COSMOS field and obtain a large sample of ~4000 Lyman$\alpha$
(Lya) emitters from z~2 to z~6 in 16 redshift slices (SC4K). We present new Lya
luminosity functions (LFs) covering a comoving volume of ~10$^8$Mpc$^3$. SC4K
extensively complements ultradeep surveys, jointly covering over 4 dex in Lya
luminosity and revealing a global (2.5<z<6) synergy LF with
$\alpha=1.93\pm0.12$, $\log_{10}\Phi^*=3.45^{+0.22}_{0.29}$ Mpc$^{3}$ and
$\log_{10}L^*=42.93^{+0.15}_{0.11}$ erg/s. The Schechter component of the Lya
LF reveals a factor ~5 rise in $L^*$ and a ~7x decline in $\Phi^*$ from z~2 to
z~6. The data reveal an extra powerlaw (or Schechter) component above
L~10$^{43.3}$ erg/s at z~2.23.5 and we show that it is partially driven by
Xray and radio AGN, as their Lya LF resembles the excess. The powerlaw
component vanishes and/or is below our detection limits above z>3.5, likely
linked with the evolution of the AGN population. The Lya luminosity density
rises by a factor ~2 from z~2 to z~3 but is then found to be roughly constant
(~$1.1\times10^{40}$ erg s$^{1}$ Mpc$^{3}$) to z~6, despite the ~0.7 dex drop
in UV luminosity density. The Lya/UV luminosity density ratio rises from
$4\pm1$% to $30\pm6$% from z~2.2 to z~6. Our results imply a rise of a factor
of ~2 in the global ionisation efficiency ($\xi_{\rm ion}$) and a factor
~$4\pm1$ in the Lya escape fraction from z~2 to z~6, hinting for evolution in
both the typical burstiness/stellar populations and even more so in the typical
ISM conditions allowing Ly$\alpha$ photons to escape.

A major branch of anomaly detection methods rely on dynamic networks: raw
sequential data is first converted to a series of networks, then critical
change points are identified in the evolving network structure. However,
existing approaches use the firstorder network (FON) to represent the
underlying raw data, which may lose important higherorder sequence patterns,
making higherorder anomalies undetectable in subsequent analysis. By replacing
FON with higherorder network (HONs), we show that existing anomaly detection
algorithms can better capture higherorder anomalies that may otherwise be
ignored. We show that the existing HON construction algorithm cannot be used
for the anomaly detection task due to the extra parameters and poor
scalability; we introduce a parameterfree algorithm that constructs HON in big
data sets. Using a largescale synthetic data set with 11 billion web
clickstreams, we demonstrate how the proposed method can capture variable
orders of anomalies. Using a realworld taxi trajectory data, we show how the
proposed method amplifies higherorder anomaly signals. Finally, we provide
complexity analysis and benchmarking to show how one can incorporating
higherorder dependencies with a small overhead.

We propose a Las Vegas transformation of Markov Chain Monte Carlo (MCMC)
estimators of Restricted Boltzmann Machines (RBMs). We denote our approach
Markov Chain Las Vegas (MCLV). MCLV gives statistical guarantees in exchange
for random running times. MCLV uses a stopping set built from the training data
and has maximum number of Markov chain steps K (referred as MCLVK). We present
a MCLVK gradient estimator (LVSK) for RBMs and explore the correspondence and
differences between LVSK and Contrastive Divergence (CDK), with LVSK
significantly outperforming CDK training RBMs over the MNIST dataset,
indicating MCLV to be a promising direction in learning generative models.

Applications in various domains rely on processing graph streams, e.g.,
communication logs of a cloudtroubleshooting system, roadnetwork traffic
updates, and interactions on a social network. A labeledgraph stream refers to
a sequence of streamed edges that form a labeled graph. Labelaware
applications need to filter the graph stream before performing a graph
operation. Due to the large volume and high velocity of these streams, it is
often more practical to incrementally build a lossycompressed version of the
graph, and use this lossy version to approximately evaluate graph queries.
Challenges arise when the queries are unknown in advance but are associated
with filtering predicates based on edge labels. Surprisingly common, and
especially challenging, are labeledgraph streams that have highly skewed label
distributions that might also vary over time. This paper introduces
SelfBalanced Graph Sketch (SBGSketch, for short), a graphical sketch for
summarizing and querying labeledgraph streams that can cope with all these
challenges. SBGSketch maintains synopsis for both the edge attributes (e.g.,
edge weight) as well as the topology of the streamed graph. SBGSketch allows
efficient processing of graphtraversal queries, e.g., reachability queries.
Experimental results over a variety of real graph streams show SBGSketch to
reduce the estimation errors of stateoftheart methods by up to 99%.

Research in statistical relational learning has produced a number of methods
for learning relational models from largescale network data. While these
methods have been successfully applied in various domains, they have been
developed under the unrealistic assumption of full data access. In practice,
however, the data are often collected by crawling the network, due to
proprietary access, limited resources, and privacy concerns. Recently, we
showed that the parameter estimates for relational Bayes classifiers computed
from network samples collected by existing network crawlers can be quite
inaccurate, and developed a crawlaware estimation method for such models
(Yang, Ribeiro, and Neville, 2017). In this work, we extend the methodology to
learning relational logistic regression models via stochastic gradient descent
from partial network crawls, and show that the proposed method yields accurate
parameter estimates and confidence intervals.

Cumulative advantage (CA) refers to the notion that accumulated resources
foster the accumulation of further resources in competitions, a phenomenon that
has been empirically observed in various contexts. The oldest and arguably
simplest mathematical model that embodies this general principle is the P\'olya
urn process, which finds applications in a myriad of problems. The original
model captures the dynamics of competitions between two equally fit agents
under linear CA effects, which can be readily generalized to incorporate
different fitnesses and nonlinear CA effects. We study two statistics of
competitions under the generalized model, namely duration (i.e., time of the
last tie) and intensity (i.e., number of ties). We give rigorous mathematical
characterizations of the tail distributions of both duration and intensity
under the various regimes for fitness and nonlinearity, which reveal very
interesting behaviors. For example, fitness superiority induces much shorter
competitions in the sublinear regime while much longer competitions in the
superlinear regime. Our findings can shed light on the application of P\'olya
urn processes in more general contexts where fitness and nonlinearity may be
present.

Active search (AS) on graphs focuses on collecting certain labeled nodes
(targets) given global knowledge of the network topology and its edge weights
under a query budget. However, in most networks, nodes, topology and edge
weights are all initially unknown. We introduce selective harvesting, a variant
of AS where the next node to be queried must be chosen among the neighbors of
the current queried node set; the available training data for deciding which
node to query is restricted to the subgraph induced by the queried set (and
their node attributes) and their neighbors (without any node or edge
attributes). Therefore, selective harvesting is a sequential decision problem,
where we must decide which node to query at each step. A classifier trained in
this scenario suffers from a tunnel vision effect: without recourse to
independent sampling, the urge to query promising nodes forces classifiers to
gather increasingly biased training data, which we show significantly hurts the
performance of AS methods and standard classifiers. We find that it is possible
to collect a much larger set of targets by using multiple classifiers, not by
combining their predictions as an ensemble, but switching between classifiers
used at each step, as a way to ease the tunnel vision effect. We discover that
switching classifiers collects more targets by (a) diversifying the training
data and (b) broadening the choices of nodes that can be queried next. This
highlights an exploration, exploitation, and diversification tradeoff in our
problem that goes beyond the exploration and exploitation duality found in
classic sequential decision problems. From these observations we propose D3TS,
a method based on multiarmed bandits for nonstationary stochastic processes
that enforces classifier diversity, matching or exceeding the performance of
competing methods on seven real network datasets in our evaluation.

The goal of cluster analysis in survival data is to identify clusters that
are decidedly associated with the survival outcome. Previous research has
explored this problem primarily in the medical domain with relatively small
datasets, but the need for such a clustering methodology could arise in other
domains with large datasets, such as social networks. Concretely, we wish to
identify different survival classes in a social network by clustering the users
based on their lifespan in the network. In this paper, we propose a decision
tree based algorithm that uses a global normalization of $p$values to identify
clusters with significantly different survival distributions. We evaluate the
clusters from our model with the help of a simple survival prediction task and
show that our model outperforms other competing methods.

Dynamic networks are increasingly being usedd to model real world datasets. A
challenging task in their analysis is to detect and characterize clusters. It
is useful for analyzing realworld data such as detecting evolving communities
in networks. We propose a temporal clustering framework based on a set of
network generative models to address this problem. We use PARAFAC decomposition
to learn network models from datasets.We then use $K$means for clustering, the
Silhouette criterion to determine the number of clusters, and a similarity
score to order the clusters and retain the significant ones. In order to
address the timedependent aspect of these clusters, we propose a segmentation
algorithm to detect their formations, dissolutions and lifetimes. Synthetic
networks with ground truth and realworld datasets are used to test our method
against stateoftheart, and the results show that our method has better
performance in clustering and lifetime detection than previous methods.

Which song will Smith listen to next? Which restaurant will Alice go to
tomorrow? Which product will John click next? These applications have in common
the prediction of user trajectories that are in a constant state of flux over a
hidden network (e.g. website links, geographic location). What users are doing
now may be unrelated to what they will be doing in an hour from now. Mindful of
these challenges we propose TribeFlow, a method designed to cope with the
complex challenges of learning personalized predictive models of
nonstationary, transient, and timeheterogeneous user trajectories. TribeFlow
is a general method that can perform next product recommendation, next song
recommendation, next location prediction, and general arbitrarylength user
trajectory prediction without domainspecific knowledge. TribeFlow is more
accurate and up to 413x faster than top competitors.

Online social networks (OSN) contain extensive amount of information about
the underlying society that is yet to be explored. One of the most feasible
technique to fetch information from OSN, crawling through Application
Programming Interface (API) requests, poses serious concerns over the the
guarantees of the estimates. In this work, we focus on making reliable
statistical inference with limited API crawls. Based on regenerative properties
of the random walks, we propose an unbiased estimator for the aggregated sum of
functions over edges and proved the connection between variance of the
estimator and spectral gap. In order to facilitate Bayesian inference on the
true value of the estimator, we derive the approximate posterior distribution
of the estimate. Later the proposed ideas are validated with numerical
experiments on inference problems in realworld networks.

The role of skill (fitness) and luck (randomness) as driving forces on the
dynamics of resource accumulation in a myriad of systems have long puzzled
scientists. Fueled by undisputed inequalities that emerge from actual
competitions, there is a pressing need for better understanding the effects of
skill and luck in resource accumulation. When such competitions are driven by
externalities such as cumulative advantage (CA), the richgetricher effect,
little is known with respect to fundamental properties such as their duration
and intensity. In this work we provide a mathematical understanding of how CA
exacerbates the role of luck in detriment of skill in simple and wellstudied
competition models. We show, for instance, that if two agents are competing for
resources that arrive sequentially at each time unit, an early stroke of luck
can place the less skilled in the lead for an extremely long period of time, a
phenomenon we call "struggle of the fittest". In the absence of CA, the more
skilled quickly prevails despite any early stroke of luck that the less skilled
may have. We prove that duration of a simple skill and luck competition model
exhibit power law tails when CA is present, regardless of skill difference,
which is in sharp contrast to exponential tails when CA is absent. Our findings
have important implications to competitions not only in complex social systems
but also in contexts that leverage such models.

How does a new startup drive the popularity of competing websites into
oblivion like Facebook famously did to MySpace? This question is of great
interest to academics, technologists, and financial investors alike. In this
work we exploit the singular way in which Facebook wiped out the popularity of
MySpace, Hi5, Friendster, and Multiply to guide the design of a new popularity
competition model. Our model provides new insights into what Nobel Laureate
Herbert A. Simon called the "marketplace of attention," which we recast as the
attentionactivity marketplace. Our model design is further substantiated by
userlevel activity of 250,000 MySpace users obtained between 2004 and 2009.
The resulting model not only accurately fits the observed Daily Active Users
(DAU) of Facebook and its competitors but also predicts their fate four years
into the future.

How many listens will an artist receive on a online radio? How about plays on
a YouTube video? How many of these visits are new or returning users? Modeling
and mining popularity dynamics of social activity has important implications
for researchers, content creators and providers. We here investigate the effect
of revisits (successive visits from a single user) on content popularity. Using
four datasets of social activity, with up to tens of millions media objects
(e.g., YouTube videos, Twitter hashtags or LastFM artists), we show the effect
of revisits in the popularity evolution of such objects. Secondly, we propose
the PhoenixR model which captures the popularity dynamics of individual
objects. PhoenixR has the desired properties of being: (1) parsimonious, being
based on the minimum description length principle, and achieving lower root
mean squared error than stateoftheart baselines; (2) applicable, the model
is effective for predicting future popularity values of objects.

Preferential attachment (PA) models of network structure are widely used due
to their explanatory power and conceptual simplicity. PA models are able to
account for the scalefree degree distributions observed in many realworld
large networks through the remarkably simple mechanism of sequentially
introducing nodes that attach preferentially to highdegree nodes. The ability
to efficiently generate instances from PA models is a key asset in
understanding both the models themselves and the real networks that they
represent. Surprisingly, little attention has been paid to the problem of
efficient instance generation. In this paper, we show that the complexity of
generating network instances from a PA model depends on the preference function
of the model, provide efficient data structures that work under any preference
function, and present empirical results from an implementation based on these
data structures. We demonstrate that, by indexing growing networks with a
simple augmented heap, we can implement a network generator which scales many
orders of magnitude beyond existing capabilities ($10^6$  $10^8$ nodes). We
show the utility of an efficient and general PA network generator by
investigating the consequences of varying the preference functions of an
existing model. We also provide "quicknet", a freelyavailable opensource
implementation of the methods described in this work.

Exploring statistics of locally connected subgraph patterns (also known as
network motifs) has helped researchers better understand the structure and
function of biological and online social networks (OSNs). Nowadays the massive
size of some critical networks  often stored in already overloaded relational
databases  effectively limits the rate at which nodes and edges can be
explored, making it a challenge to accurately discover subgraph statistics. In
this work, we propose sampling methods to accurately estimate subgraph
statistics from as few queried nodes as possible. We present sampling
algorithms that efficiently and accurately estimate subgraph properties of
massive networks. Our algorithms require no precomputation or complete network
topology information. At the same time, we provide theoretical guarantees of
convergence. We perform experiments using widely known data sets, and show that
for the same accuracy, our algorithms require an order of magnitude less
queries (samples) than the current stateoftheart algorithms.

Recommendation systems for online dating have recently attracted much
attention from the research community. In this paper we proposed a twoside
matching framework for online dating recommendations and design an LDA model to
learn the user preferences from the observed user messaging behavior and user
profile features. Experimental results using data from a large online dating
website shows that twosided matching improves significantly the rate of
successful matches by as much as 45%. Finally, using simulated matchings we
show that the the LDA model can correctly capture user preferences.

Algorithms for identifying the infection states of nodes in a network are
crucial for understanding and containing infections. Often, however, only a
relatively small set of nodes have a known infection state. Moreover, the
length of time that each node has been infected is also unknown. This missing
data  infection state of most nodes and infection time of the unobserved
infected nodes  poses a challenge to the study of realworld cascades.
In this work, we develop techniques to identify the latent infected nodes in
the presence of missing infection timeandstate data. Based on the likely
epidemic paths predicted by the simple susceptibleinfected epidemic model, we
propose a measure (Infection Betweenness) for uncovering these unknown
infection states. Our experimental results using machine learning algorithms
show that Infection Betweenness is the most effective feature for identifying
latent infected nodes.

Driven by outstanding success stories of Internet startups such as Facebook
and The Huffington Post, recent studies have thoroughly described their growth.
These highly visible online success stories, however, overshadow an untold
number of similar ventures that fail. The study of website popularity is
ultimately incomplete without general mechanisms that can describe both
successes and failures. In this work we present six years of the daily number
of users (DAU) of twentytwo membershipbased websites  encompassing online
social networks, grassroots movements, online forums, and membershiponly
Internet stores  well balanced between successes and failures. We then propose
a combination of reactiondiffusiondecay processes whose resulting equations
seem not only to describe well the observed DAU time series but also provide
means to roughly predict their evolution. This model allows an approximate
automatic DAUbased classification of websites into selfsustainable v.s.
unsustainable and whether the startup growth is mostly driven by marketing &
media campaigns or wordofmouth adoptions.

Online dating sites have become popular platforms for people to look for
potential romantic partners. It is important to understand users' dating
preferences in order to make better recommendations on potential dates. The
message sending and replying actions of a user are strong indicators for what
he/she is looking for in a potential date and reflect the user's actual dating
preferences. We study how users' online dating behaviors correlate with various
user attributes using a large realworld dateset from a major online dating
site in China. Many of our results on user messaging behavior align with
notions in social and evolutionary psychology: males tend to look for younger
females while females put more emphasis on the socioeconomic status (e.g.,
income, education level) of a potential date. In addition, we observe that the
geographic distance between two users and the photo count of users play an
important role in their dating behaviors. Our results show that it is important
to differentiate between users' true preferences and random selection. Some
user behaviors in choosing attributes in a potential date may largely be a
result of random selection. We also find that both males and females are more
likely to reply to users whose attributes come closest to the stated
preferences of the receivers, and there is significant discrepancy between a
user's stated dating preference and his/her actual online dating behavior.
These results can provide valuable guidelines to the design of a recommendation
engine for potential dates.

Characterizing large online social networks (OSNs) through node querying is a
challenging task. OSNs often impose severe constraints on the query rate, hence
limiting the sample size to a small fraction of the total network. Various
adhoc subgraph sampling methods have been proposed, but many of them give
biased estimates and no theoretical basis on the accuracy. In this work, we
focus on developing sampling methods for OSNs where querying a node also
reveals partial structural information about its neighbors. Our methods are
optimized for NoSQL graph databases (if the database can be accessed directly),
or utilize Web API available on most major OSNs for graph sampling. We show
that our sampling method has provable convergence guarantees on being an
unbiased estimator, and it is more accurate than current stateoftheart
methods. We characterize metrics such as node label density estimation and edge
label density estimation, two of the most fundamental network characteristics
from which other network characteristics can be derived. We evaluate our
methods onthefly over several live networks using their native APIs. Our
simulation studies over a variety of offline datasets show that by including
neighborhood information, our method drastically (4fold) reduces the number of
samples required to achieve the same estimation accuracy of stateoftheart
methods.

Timevarying networks describe a wide array of systems whose constituents and
interactions evolve over time. They are defined by an ordered stream of
interactions between nodes, yet they are often represented in terms of a
sequence of static networks, each aggregating all edges and nodes present in a
time interval of size \Delta t. In this work we quantify the impact of an
arbitrary \Delta t on the description of a dynamical process taking place upon
a timevarying network. We focus on the elementary random walk, and put forth a
simple mathematical framework that well describes the behavior observed on real
datasets. The analytical description of the bias introduced by time integrating
techniques represents a step forward in the correct characterization of
dynamical processes on timevarying graphs.

With the main goal of finding bulgeless galaxies harbouring super massive
black holes and showing, at most, just residual star formation activity, we
have selected a sample of massive bulgeless red sequence galaxies from the
SDSSDR7, based on the NYUVAGC catalogue. Multivavelength data were retrieved
using EUROVO tools, and the objects are characterised in terms of degree of
star formation and the presence of an AGN. We have found seven objects that are
quenched massive galaxies, that have no prominent bulge and that show signs of
extra activity in their nuclei, five of them being central in their halo. These
objects are rather robust candidates for rare systems that, though devoid of a
significant bulge, harbor a supermassive black hole with an activity level
likely capable of having halted the star formation through feedback.