
Given a webscale graph that grows over time, how should its edges be stored
and processed on multiple machines for rapid and accurate estimation of the
count of triangles? The count of triangles (i.e., cliques of size three) has
proven useful in many applications, including anomaly detection, community
detection, and link recommendation. For triangle counting in large and dynamic
graphs, recent work has focused largely on streaming algorithms and distributed
algorithms. To achieve the advantages of both approaches, we propose DiSLR, a
distributed streaming algorithm that estimates the counts of global triangles
and local triangles associated with each node. Making one pass over the input
stream, DiSLR carefully processes and stores the edges across multiple machines
so that the redundant use of computational and storage resources is minimized.
Compared to its best competitors, DiSLR is (a) Accurate: giving up to 39X
smaller estimation error, (b) Fast: up to 10.4X faster, scaling linearly with
the number of edges in the input stream, and (c) Theoretically sound: yielding
unbiased estimates with variances decreasing faster as the number of machines
is scaled up.

How can we detect fraudulent lockstep behavior in largescale multiaspect
data (i.e., tensors)? Can we detect it when data are too large to fit in memory
or even on a disk? Past studies have shown that dense subtensors in realworld
tensors (e.g., social media, Wikipedia, TCP dumps, etc.) signal anomalous or
fraudulent behavior such as retweet boosting, bot activities, and network
attacks. Thus, various approaches, including tensor decomposition and search,
have been proposed for detecting dense subtensors rapidly and accurately.
However, existing methods have low accuracy, or they assume that tensors are
small enough to fit in main memory, which is unrealistic in many realworld
applications such as social media and web. To overcome these limitations, we
propose DCUBE, a diskbased densesubtensor detection method, which also can
run in a distributed manner across multiple machines. Compared to
stateoftheart methods, DCUBE is (1) Memory Efficient: requires up to 1,600X
less memory and handles 1,000X larger data (2.6TB), (2) Fast: up to 7X faster
due to its nearlinear scalability, (3) Provably Accurate: gives a guarantee on
the densities of the detected subtensors, and (4) Effective: spotted network
attacks from TCP dumps and synchronized behavior in rating data most
accurately.

Why is a given node in a timeevolving graph ($t$graph) marked as an anomaly
by an offtheshelf detection algorithm? Is it because of the number of its
outgoing or incoming edges, or their timings? How can we best convince a human
analyst that the node is anomalous? Our work aims to provide succinct,
interpretable, and simple explanations of anomalous behavior in $t$graphs
(communications, IPIP interactions, etc.) while respecting the limited
attention of human analysts. Specifically, we extract key features from such
graphs, and propose to output a few pair (scatter) plots from this feature
space which "best" explain known anomalies. To this end, our work has four main
contributions: (a) problem formulation: we introduce an "analystfriendly"
problem formulation for explaining anomalies via pair plots, (b) explanation
algorithm: we propose a plotselection objective and the LookOut algorithm to
approximate it with optimality guarantees, (c) generality: our explanation
algorithm is both domain and detectoragnostic, and (d) scalability: we show
that LookOut scales linearly on the number of edges of the input graph. Our
experiments show that LookOut performs nearideally in terms of maximizing
explanation objective on several real datasets including Enron email and DBLP
coauthorship. Furthermore, LookOut produces fast, visually interpretable and
intuitive results in explaining "groundtruth" anomalies from Enron, DBLP and
LBNL (computer network) data.

Most past work on social network link fraud detection tries to separate
genuine users from fraudsters, implicitly assuming that there is only one type
of fraudulent behavior. But is this assumption true? And, in either case, what
are the characteristics of such fraudulent behaviors? In this work, we set up
honeypots ("dummy" social network accounts), and buy fake followers (after
careful IRB approval). We report the signs of such behaviors including oddities
in local network connectivity, account attributes, and similarities and
differences across fraud providers. Most valuably, we discover and characterize
several types of fraud behaviors. We discuss how to leverage our insights in
practice by engineering strongly performing entropybased features and
demonstrating high classification accuracy. Our contributions are (a)
instrumentation: we detail our experimental setup and carefully engineered data
collection process to scrape Twitter data while respecting API ratelimits, (b)
observations on fraud multimodality: we analyze our honeypot fraudster
ecosystem and give surprising insights into the multifaceted behaviors of these
fraudster types, and (c) features: we propose novel features that give strong
(>0.95 precision/recall) discriminative power on groundtruth Twitter data.

Information cascades are ubiquitous in both physical society and online
social media, taking on large variations in structures, dynamics and semantics.
Although the dynamics and semantics of information cascades have been studied,
the structural patterns and their correlations with dynamics and semantics are
largely unknown. Here we explore a largescale dataset including $432$ million
information cascades with explicit records of spreading traces, spreading
behaviors, information content as well as user profiles. We find that the
structural complexity of information cascades is far beyond the previous
conjectures. We first propose a tendimensional metric to quantify the
structural characteristics of information cascades, reflecting cascade size,
silhouette, direction and activity aspects. We find that bimodal law governs
majority of the metrics, information flows in cascades have four directions,
and the selfloop number and average activity of cascades follows power law. We
then analyze the highorder structural patterns of information cascades.
Finally, we evaluate to what extent the structural features of information
cascades can explain its dynamic patterns and semantics, and finally uncover
some notable implications of structural patterns in information cascades. Our
discoveries also provide a foundation for the microscopic mechanisms for
information spreading, potentially leading to implications for cascade
prediction and outlier detection.

Consider a stream of retweet events  how can we spot fraudulent lockstep
behavior in such multiaspect data (i.e., tensors) evolving over time? Can we
detect it in real time, with an accuracy guarantee? Past studies have shown
that dense subtensors tend to indicate anomalous or even fraudulent behavior in
many tensor data, including social media, Wikipedia, and TCP dumps. Thus,
several algorithms have been proposed for detecting dense subtensors rapidly
and accurately. However, existing algorithms assume that tensors are static,
while many realworld tensors, including those mentioned above, evolve over
time.
We propose DenseStream, an incremental algorithm that maintains and updates a
dense subtensor in a tensor stream (i.e., a sequence of changes in a tensor),
and DenseAlert, an incremental algorithm spotting the sudden appearances of
dense subtensors. Our algorithms are: (1) Fast and 'any time': updates by our
algorithms are up to a million times faster than the fastest batch algorithms,
(2) Provably accurate: our algorithms guarantee a lower bound on the density of
the subtensor they maintain, and (3) Effective: our DenseAlert successfully
spots anomalies in realworld tensors, especially those overlooked by existing
algorithms.

As online fraudsters invest more resources, including purchasing large pools
of fake user accounts and dedicated IPs, fraudulent attacks become less obvious
and their detection becomes increasingly challenging. Existing approaches such
as average degree maximization suffer from the bias of including more nodes
than necessary, resulting in lower accuracy and increased need for manual
verification. Hence, we propose HoloScope, which uses information from graph
topology and temporal spikes to more accurately detect groups of fraudulent
users. In terms of graph topology, we introduce "contrast suspiciousness," a
dynamic weighting approach, which allows us to more accurately detect
fraudulent blocks, particularly lowdensity blocks. In terms of temporal
spikes, HoloScope takes into account the sudden bursts and drops of fraudsters'
attacking patterns. In addition, we provide theoretical bounds for how much
this increases the time cost needed for fraudsters to conduct adversarial
attacks. Additionally, from the perspective of ratings, HoloScope incorporates
the deviation of rating scores in order to catch fraudsters more accurately.
Moreover, HoloScope has a concise framework and subquadratic time complexity,
making the algorithm reproducible and scalable. Extensive experiments showed
that HoloScope achieved significant accuracy improvements on synthetic and real
data, compared with stateoftheart fraud detection methods.

Rating platforms enable largescale collection of user opinion about items
(products, other users, etc.). However, many untrustworthy users give
fraudulent ratings for excessive monetary gains. In the paper, we present
FairJudge, a system to identify such fraudulent users. We propose three
metrics: (i) the fairness of a user that quantifies how trustworthy the user is
in rating the products, (ii) the reliability of a rating that measures how
reliable the rating is, and (iii) the goodness of a product that measures the
quality of the product. Intuitively, a user is fair if it provides reliable
ratings that are close to the goodness of the product. We formulate a mutually
recursive definition of these metrics, and further address cold start problems
and incorporate behavioral properties of users and products in the formulation.
We propose an iterative algorithm, FairJudge, to predict the values of the
three metrics. We prove that FairJudge is guaranteed to converge in a bounded
number of iterations, with linear time complexity. By conducting five different
experiments on five rating platforms, we show that FairJudge significantly
outperforms nine existing algorithms in predicting fair and unfair users. We
reported the 100 most unfair users in the Flipkart network to their review
fraud investigators, and 80 users were correctly identified (80% accuracy). The
FairJudge algorithm is already being deployed at Flipkart.

What is the best way to describe a user in a social network with just a few
numbers? Mathematically, this is equivalent to assigning a vector
representation to each node in a graph, a process called graph embedding. We
propose a novel framework, GEMD that unifies most of the past algorithms such
as LapEigs, DeepWalk and node2vec. GEMD achieves its goal by decomposing any
graph embedding algorithm into three building blocks: node proximity function,
warping function and loss function. Based on thorough analysis of GEMD, we
propose a novel algorithm, called UltimateWalk, which outperforms the
mostrecently proposed stateoftheart DeepWalk and node2vec. The
contributions of this work are: (1) The proposed framework, GEMD unifies the
past graph embedding algorithms and provides a general recipe of how to design
a graph embedding; (2) the nonlinearlity in the warping function contributes
significantly to the quality of embedding and the exponential function is
empirically optimal; (3) the proposed algorithm, UltimateWalk is oneclick (no
userdefined parameters), scalable and has a closedform solution.

Tensors or {\em multiway arrays} are functions of three or more indices
$(i,j,k,\cdots)$  similar to matrices (twoway arrays), which are functions
of two indices $(r,c)$ for (row,column). Tensors have a rich history,
stretching over almost a century, and touching upon numerous disciplines; but
they have only recently become ubiquitous in signal and data analytics at the
confluence of signal processing, statistics, data mining and machine learning.
This overview article aims to provide a good starting point for researchers and
practitioners interested in learning about and working with tensors. As such,
it focuses on fundamentals and motivation (using various application examples),
aiming to strike an appropriate balance of breadth {\em and depth} that will
enable someone having taken first graduate courses in matrix algebra and
probability to get started doing research and/or developing tensor algorithms
and software. Some background in applied optimization is useful but not
strictly required. The material covered includes tensor rank and rank
decomposition; basic tensor factorization models and their relationships and
properties (including fairly good coverage of identifiability); broad coverage
of algorithms ranging from alternating optimization to stochastic gradient;
statistical performance analysis; and applications ranging from source
separation to collaborative filtering, mixture and topic modeling,
classification, and multilinear subspace learning.

How can we design a product or movie that will attract, for example, the
interest of Pennsylvania adolescents or liberal newspaper critics? What should
be the genre of that movie and who should be in the cast? In this work, we seek
to identify how we can design new movies with features tailored to a specific
user population. We formulate the movie design as an optimization problem over
the inference of userfeature scores and selection of the features that
maximize the number of attracted users. Our approach, PNP, is based on a
heterogeneous, tripartite graph of users, movies and features (e.g., actors,
directors, genres), where users rate movies and features contribute to movies.
We learn the preferences by leveraging user similarities defined through
different types of relations, and show that our method outperforms
stateoftheart approaches, including matrix factorization and other
heterogeneous graphbased analysis. We evaluate PNP on publicly available
realworld data and show that it is highly scalable and effectively provides
movie designs oriented towards different groups of users, including men, women,
and adolescents.

'Alice' is submitting one web search per five minutes, for three hours in a
row  is it normal? How to detect abnormal search behaviors, among Alice and
other users? Is there any distinct pattern in Alice's (or other users') search
behavior? We studied what is probably the largest, publicly available, query
log that contains more than 30 million queries from 0.6 million users. In this
paper, we present a novel, userand grouplevel framework, M3A: Model,
MetaModel and Anomaly detection. For each user, we discover and explain a
surprising, bimodal pattern of the interarrival time (IAT) of landed queries
(queries with user clickthrough). Specifically, the model CamelLog is
proposed to describe such an IAT distribution; we then notice the correlations
among its parameters at the group level. Thus, we further propose the metamodel
MetaClick, to capture and explain the twodimensional, heavytail distribution
of the parameters. Combining CamelLog and MetaClick, the proposed M3A has the
following strong points: (1) the accurate modeling of marginal IAT
distribution, (2) quantitative interpretations, and (3) anomaly detection.

Review fraud is a pervasive problem in online commerce, in which fraudulent
sellers write or purchase fake reviews to manipulate perception of their
products and services. Fake reviews are often detected based on several signs,
including 1) they occur in short bursts of time; 2) fraudulent user accounts
have skewed rating distributions. However, these may both be true in any given
dataset. Hence, in this paper, we propose an approach for detecting fraudulent
reviews which combines these 2 approaches in a principled manner, allowing
successful detection even when one of these signs is not present. To combine
these 2 approaches, we formulate our Bayesian Inference for Rating Data (BIRD)
model, a flexible Bayesian model of user rating behavior. Based on our model we
formulate a likelihoodbased suspiciousness metric, Normalized Expected
Surprise Total (NEST). We propose a lineartime algorithm for performing
Bayesian inference using our model and computing the metric. Experiments on
real data show that BIRDNEST successfully spots review fraud in large,
realworld graphs: the 50 most suspicious users of the Flipkart platform
flagged by our algorithm were investigated and all identified as fraudulent by
domain experts at Flipkart.

Which song will Smith listen to next? Which restaurant will Alice go to
tomorrow? Which product will John click next? These applications have in common
the prediction of user trajectories that are in a constant state of flux over a
hidden network (e.g. website links, geographic location). What users are doing
now may be unrelated to what they will be doing in an hour from now. Mindful of
these challenges we propose TribeFlow, a method designed to cope with the
complex challenges of learning personalized predictive models of
nonstationary, transient, and timeheterogeneous user trajectories. TribeFlow
is a general method that can perform next product recommendation, next song
recommendation, next location prediction, and general arbitrarylength user
trajectory prediction without domainspecific knowledge. TribeFlow is more
accurate and up to 413x faster than top competitors.

Given a network with attributed edges, how can we identify anomalous
behavior? Networks with edge attributes are commonplace in the real world. For
example, edges in ecommerce networks often indicate how users rated products
and services in terms of number of stars, and edges in online social and
phonecall networks contain temporal information about when friendships were
formed and when users communicated with each other  in such cases, edge
attributes capture information about how the adjacent nodes interact with other
entities in the network. In this paper, we aim to utilize exactly this
information to discern suspicious from typical node behavior. Our work has a
number of notable contributions, including (a) formulation: while most other
graphbased anomaly detection works use structural graph connectivity or node
information, we focus on the new problem of leveraging edge information, (b)
methodology: we introduce EdgeCentric, an intuitive and scalable
compressionbased approach for detecting edgeattributed graph anomalies, and
(c) practicality: we show that EdgeCentric successfully spots numerous such
anomalies in several large, edgeattributed realworld graphs, including the
Flipkart ecommerce graph with over 3 million product reviews between 1.1
million users and 545 thousand products, where it achieved 0.87 precision over
the top 100 results.

Given a large social or computer network, how can we visualize it, find
patterns, outliers, communities? Although several graph visualization tools
exist, they cannot handle large graphs with hundred thousand nodes and possibly
million edges. Such graphs bring two challenges: interactive visualization
demands prohibitive processing power and, even if we could interactively update
the visualization, the user would be overwhelmed by the excessive number of
graphical items. To cope with this problem, we propose a formal innovation on
the use of graph hierarchies that leads to GMine system. GMine promotes
scalability using a hierarchy of graph partitions, promotes concomitant
presentation for the graph hierarchy and for the original graph, and extends
analytical possibilities with the integration of the graph partitions in an
interactive environment.

Several graph visualization tools exist. However, they are not able to handle
large graphs, and/or they do not allow interaction. We are interested on large
graphs, with hundreds of thousands of nodes. Such graphs bring two challenges:
the first one is that any straightforward interactive manipulation will be
prohibitively slow. The second one is sensory overload: even if we could plot
and replot the graph quickly, the user would be overwhelmed with the vast
volume of information because the screen would be too cluttered as nodes and
edges overlap each other. GMine system addresses both these issues, by using
summarization and multiresolution. GMine offers multiresolution graph
exploration by partitioning a given graph into a hierarchy of
communitieswithincommunities and storing it into a novel Rtreelike
structure which we name GTree. GMine offers summarization by implementing an
innovative subgraph extraction algorithm and then visualizing its output.

Current applications have produced graphs on the order of hundreds of
thousands of nodes and millions of edges. To take advantage of such graphs, one
must be able to find patterns, outliers and communities. These tasks are better
performed in an interactive environment, where human expertise can guide the
process. For large graphs, though, there are some challenges: the excessive
processing requirements are prohibitive, and drawing hundredthousand nodes
results in cluttered images hard to comprehend. To cope with these problems, we
propose an innovative framework suited for any kind of treelike graph visual
design. GMine integrates (a) a representation for graphs organized as
hierarchies of partitions  the concepts of SuperGraph and GraphTree; and (b)
a graph summarization methodology  CEPS. Our graph representation deals with
the problem of tracing the connection aspects of a graph hierarchy with sub
linear complexity, allowing one to grasp the neighborhood of a single node or
of a group of nodes in a single click. As a proof of concept, the visual
environment of GMine is instantiated as a system in which large graphs can be
investigated globally and locally.

How can we tell when accounts are fake or real in a social network? And how
can we tell which accounts belong to liberal, conservative or centrist users?
Often, we can answer such questions and label nodes in a network based on the
labels of their neighbors and appropriate assumptions of homophily ("birds of a
feather flock together") or heterophily ("opposites attract"). One of the most
widely used methods for this kind of inference is Belief Propagation (BP) which
iteratively propagates the information from a few nodes with explicit labels
throughout a network until convergence. One main problem with BP, however, is
that there are no known exact guarantees of convergence in graphs with loops.
This paper introduces Linearized Belief Propagation (LinBP), a linearization
of BP that allows a closedform solution via intuitive matrix equations and,
thus, comes with convergence guarantees. It handles homophily, heterophily, and
more general cases that arise in multiclass settings. Plus, it allows a
compact implementation in SQL. The paper also introduces Singlepass Belief
Propagation (SBP), a "localized" version of LinBP that propagates information
across every edge at most once and for which the final class assignments depend
only on the nearest labeled neighbors. In addition, SBP allows fast incremental
updates in dynamic networks. Our runtime experiments show that LinBP and SBP
are orders of magnitude faster than standard

How can we detect suspicious users in large online networks? Online
popularity of a user or product (via follows, pagelikes, etc.) can be
monetized on the premise of higher ad clickthrough rates or increased sales.
Web services and social networks which incentivize popularity thus suffer from
a major problem of fake connections from link fraudsters looking to make a
quick buck. Typical methods of catching this suspicious behavior use spectral
techniques to spot large groups of often blatantly fraudulent (but sometimes
honest) users. However, smallscale, stealthy attacks may go unnoticed due to
the nature of lowrank eigenanalysis used in practice.
In this work, we take an adversarial approach to find and prove claims about
the weaknesses of modern, stateoftheart spectral methods and propose fBox,
an algorithm designed to catch smallscale, stealth attacks that slip below the
radar. Our algorithm has the following desirable properties: (a) it has
theoretical underpinnings, (b) it is shown to be highly effective on real data
and (c) it is scalable (linear on the input size). We evaluate fBox on a large,
public 41.7 million node, 1.5 billion edge whofollowswhom social graph from
Twitter in 2010 and with high precision identify many suspicious accounts which
have persisted without suspension even to this day.

It is our view that the state of the art in constructing a large collection
of graph algorithms in terms of linear algebraic operations is mature enough to
support the emergence of a standard set of primitive building blocks. This
paper is a position paper defining the problem and announcing our intention to
launch an open effort to define this standard.

How does a new startup drive the popularity of competing websites into
oblivion like Facebook famously did to MySpace? This question is of great
interest to academics, technologists, and financial investors alike. In this
work we exploit the singular way in which Facebook wiped out the popularity of
MySpace, Hi5, Friendster, and Multiply to guide the design of a new popularity
competition model. Our model provides new insights into what Nobel Laureate
Herbert A. Simon called the "marketplace of attention," which we recast as the
attentionactivity marketplace. Our model design is further substantiated by
userlevel activity of 250,000 MySpace users obtained between 2004 and 2009.
The resulting model not only accurately fits the observed Daily Active Users
(DAU) of Facebook and its competitors but also predicts their fate four years
into the future.

How many listens will an artist receive on a online radio? How about plays on
a YouTube video? How many of these visits are new or returning users? Modeling
and mining popularity dynamics of social activity has important implications
for researchers, content creators and providers. We here investigate the effect
of revisits (successive visits from a single user) on content popularity. Using
four datasets of social activity, with up to tens of millions media objects
(e.g., YouTube videos, Twitter hashtags or LastFM artists), we show the effect
of revisits in the popularity evolution of such objects. Secondly, we propose
the PhoenixR model which captures the popularity dynamics of individual
objects. PhoenixR has the desired properties of being: (1) parsimonious, being
based on the minimum description length principle, and achieving lower root
mean squared error than stateoftheart baselines; (2) applicable, the model
is effective for predicting future popularity values of objects.

How can we succinctly describe a millionnode graph with a few simple
sentences? How can we measure the "importance" of a set of discovered subgraphs
in a large graph? These are exactly the problems we focus on. Our main ideas
are to construct a "vocabulary" of subgraphtypes that often occur in real
graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the
most succinct description of a graph in terms of this vocabulary. We measure
success in a wellfounded way by means of the Minimum Description Length (MDL)
principle: a subgraph is included in the summary if it decreases the total
description length of the graph.
Our contributions are threefold: (a) formulation: we provide a principled
encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop
\method, an efficient method to minimize the description cost, and (c)
applicability: we report experimental results on multimillionedge real
graphs, including Flickr and the Notre Dame web graph.

With the advancement of information systems, means of communications are
becoming cheaper, faster and more available. Today, millions of people carrying
smartphones or tablets are able to communicate at practically any time and
anywhere they want. Among others, they can access their emails, comment on
weblogs, watch and post comments on videos, make phone calls or text messages
almost ubiquitously. Given this scenario, in this paper we tackle a fundamental
aspect of this new era of communication: how the time intervals between
communication events behave for different technologies and means of
communications? Are there universal patterns for the interevent time
distribution (IED)? In which ways interevent times behave differently among
particular technologies? To answer these questions, we analyze eight different
datasets from real and modern communication data and we found four well defined
patterns that are seen in all the eight datasets. Moreover, we propose the use
of the SelfFeeding Process (SFP) to generate interevent times between
communications. The SFP is extremely parsimonious point process that requires
at most two parameters and is able to generate interevent times with all the
universal properties we observed in the data. We show the potential application
of SFP by proposing a framework to generate a synthetic dataset containing
realistic communication events of any one of the analyzed means of
communications (e.g. phone calls, emails, comments on blogs) and an algorithm
to detect anomalies.