
With the success of modern internet based platform, such as Amazon Mechanical
Turk, it is now normal to collect a large number of hand labeled samples from
nonexperts. The Dawid Skene algorithm, which is based on Expectation
Maximization update, has been widely used for inferring the true labels from
noisy crowdsourced labels. However, DawidSkene scheme requires all the data to
perform each EM iteration, and can be infeasible for streaming data or large
scale data. In this paper, we provide an online version of Dawid Skene
algorithm that only requires one data frame for each iteration. Further, we
prove that under mild conditions, the online DawidSkene scheme with projection
converges to a stationary point of the marginal loglikelihood of the observed
data. Our experiments demonstrate that the online Dawid Skene scheme achieves
state of the art performance comparing with other methods based on the Dawid
Skene scheme.

In many problems in machine learning and operations research, we need to
optimize a function whose input is a random variable or a probability density
function, i.e. to solve optimization problems in an infinite dimensional space.
On the other hand, online learning has the advantage of dealing with streaming
examples, and better model a changing environ ment. In this paper, we extend
the celebrated online gradient descent algorithm to Hilbert spaces (function
spaces), and analyze the convergence guarantee of the algorithm. Finally, we
demonstrate that our algorithms can be useful in several important problems.

We consider the problem of minimizing the sum of an average function of a
large number of smooth convex components and a general, possibly
nondifferentiable, convex function. Although many methods have been proposed
to solve this problem with the assumption that the sum is strongly convex, few
methods support the nonstrongly convex case. Adding a small quadratic
regularization is a common devise used to tackle nonstrongly convex problems;
however, it may cause loss of sparsity of solutions or weaken the performance
of the algorithms. Avoiding this devise, we propose an accelerated randomized
mirror descent method for solving this problem without the strongly convex
assumption. Our method extends the deterministic accelerated proximal gradient
methods of Paul Tseng and can be applied even when proximal points are computed
inexactly. We also propose a scheme for solving the problem when the component
functions are nonsmooth.

In this paper, a multistate diagnosis and prognosis (MDP) framework is
proposed for tool condition monitoring via a deep belief network based
multistate approach (DBNMS). For fault diagnosis, a costsensitive deep belief
network (namely ECSDBN) is applied to deal with the imbalanced data problem
for tool state estimation. An appropriate prognostic degradation model is then
applied for tool wear estimation based on the different tool states. The
proposed framework has the advantage of automatic feature representation
learning and shows better performance in accuracy and robustness. The
effectiveness of the proposed DBNMS is validated using a realworld dataset
obtained from the gun drilling process. This dataset contains a large amount of
measured signals involving different tool geometries under various operating
conditions. The DBNMS is examined for both the tool state estimation and tool
wear estimation tasks. In the experimental studies, the prediction results are
evaluated and compared with popular machine learning approaches, which show the
superior performance of the proposed DBNMS approach.

We consider the problem of representing collective behavior of large
populations and predicting the evolution of a population distribution over a
discrete state space. A discrete time mean field game (MFG) is motivated as an
interpretable model founded on game theory for understanding the aggregate
effect of individual actions and predicting the temporal evolution of
population distributions. We achieve a synthesis of MFG and Markov decision
processes (MDP) by showing that a special MFG is reducible to an MDP. This
enables us to broaden the scope of mean field game theory and infer MFG models
of large realworld systems via deep inverse reinforcement learning. Our method
learns both the reward function and forward dynamics of an MFG from real data,
and we report the first empirical test of a mean field game model of a
realworld social media population.

While optimizing convex objective (loss) functions has been a powerhouse for
machine learning for at least two decades, nonconvex loss functions have
attracted fast growing interests recently, due to many desirable properties
such as superior robustness and classification accuracy, compared with their
convex counterparts. The main obstacle for nonconvex estimators is that it is
in general intractable to find the optimal solution. In this paper, we study
the computational issues for some nonconvex Mestimators. In particular, we
show that the stochastic variance reduction methods converge to the global
optimal with linear rate, by exploiting the statistical property of the
population loss. En route, we improve the convergence analysis for the batch
gradient method in \cite{mei2016landscape}.

We study poolbased active learning with abstention feedbacks, where a
labeler can abstain from labeling a queried example with some unknown
abstention rate. This is an important problem with many useful applications. We
take a Bayesian approach to the problem and develop two new greedy algorithms
that learn both the classification problem and the unknown abstention rate at
the same time. These are achieved by simply incorporating the estimated
abstention rate into the greedy criteria. We prove that both of our algorithms
have nearoptimality guarantees: they respectively achieve a
${(1\frac{1}{e})}$ constant factor approximation of the optimal expected or
worstcase value of a useful utility function. Our experiments show the
algorithms perform well in various practical scenarios.

Sequential hypothesis test and changepoint detection when the distribution
parameters are unknown is a fundamental problem in statistics and machine
learning. We show that for such problems, detection procedures based on
sequential likelihood ratios with simple onesample update estimates such as
online mirror descent are nearly secondorder optimal. This means that the
upper bound for the algorithm performance meets the lower bound asymptotically
up to a loglog factor in the falsealarm rate when it tends to zero. This is a
blessing, since although the generalized likelihood ratio(GLR) statistics are
optimal theoretically, but they cannot be computed recursively, and their exact
computation usually requires infinite memory of historical data. We prove the
nearly secondorder optimality by making a connection between sequential
analysis and online convex optimization and leveraging the logarithmic regret
bound property of online mirror descent algorithm. Numerical examples validate
our theory.

We investigate a projection free method, namely conditional gradient sliding
on batched, stochastic and finitesum nonconvex problem. CGS is a smart
combination of Nesterov's accelerated gradient method and FrankWolfe (FW)
method, and outperforms FW in the convex setting by saving gradient
computations. However, the study of CGS in the nonconvex setting is limited.
In this paper, we propose the nonconvex conditional gradient sliding (NCGS)
which surpasses the nonconvex FrankWolfe method in batched, stochastic and
finitesum setting.

SVRG and its variants are among the state of art optimization algorithms for
large scale machine learning problems. It is well known that SVRG converges
linearly when the objective function is strongly convex. However this setup can
be restrictive, and does not include several important formulations such as
Lasso, group Lasso, logistic regression, and some nonconvex models including
corrected Lasso and SCAD. In this paper, we prove that, for a class of
statistical Mestimators covering examples mentioned above, SVRG solves the
formulation with {\em a linear convergence rate} without strong convexity or
even convexity. Our analysis makes use of {\em restricted strong convexity},
under which we show that SVRG converges linearly to the fundamental statistical
precision of the model, i.e., the difference between true unknown parameter
$\theta^*$ and the optimal solution $\hat{\theta}$ of the model.

We propose the first multistage intervention framework that tackles fake news
in social networks by combining reinforcement learning with a point process
network activity model. The spread of fake news and mitigation events within
the network is modeled by a multivariate Hawkes process with additional
exogenous control terms. By choosing a feature representation of states,
defining mitigation actions and constructing reward functions to measure the
effectiveness of mitigation activities, we map the problem of fake news
mitigation into the reinforcement learning framework. We develop a policy
iteration method unique to the multivariate networked point process, with the
goal of optimizing the actions for maximal total reward under budget
constraints. Our method shows promising performance in realtime intervention
experiments on a Twitter network to mitigate a surrogate fake news campaign,
and outperforms alternatives on synthetic datasets.

We study reinforcement learning under model misspecification, where we do not
have access to the true environment but only to a reasonably close
approximation to it. We address this problem by extending the framework of
robust MDPs to the modelfree Reinforcement Learning setting, where we do not
have access to the model parameters, but can only sample states from it. We
define robust versions of Qlearning, SARSA, and TDlearning and prove
convergence to an approximately optimal robust policy and approximate value
function respectively. We scale up the robust algorithms to large MDPs via
function approximation and prove convergence under two different settings. We
prove convergence of robust approximate policy iteration and robust approximate
value iteration for linear architectures (under mild assumptions). We also
define a robust loss function, the mean squared robust projected Bellman error
and give stochastic gradient descent algorithms that are guaranteed to converge
to a local minimum.

We study the worstcase adaptive optimization problem with budget constraint
that is useful for modeling various practical applications in artificial
intelligence and machine learning. We investigate the nearoptimality of greedy
algorithms for this problem with both modular and nonmodular cost functions.
In both cases, we prove that two simple greedy algorithms are not nearoptimal
but the best between them is nearoptimal if the utility function satisfies
pointwise submodularity and pointwise costsensitive submodularity
respectively. This implies a combined algorithm that is nearoptimal with
respect to the optimal algorithm that uses half of the budget. We discuss
applications of our theoretical results and also report experiments comparing
the greedy algorithms on the active learning problem.

We consider largescale Markov decision processes (MDPs) with a risk measure
of variability in cost, under the riskaware MDPs paradigm. Previous studies
showed that riskaware MDPs, based on a minimax approach to handling risk, can
be solved using dynamic programming for small to medium sized problems.
However, due to the "curse of dimensionality", MDPs that model reallife
problems are typically prohibitively large for such approaches. In this paper,
we employ an approximate dynamic programming approach, and develop a family of
simulationbased algorithms to approximately solve largescale riskaware MDPs.
In parallel, we develop a unified convergence analysis technique to derive
sample complexity bounds for this new family of algorithms.

In this paper, we consider stochastic dual coordinate (SDCA) {\em without}
strongly convex assumption or convex assumption. We show that SDCA converges
linearly under mild conditions termed restricted strong convexity. This covers
a wide array of popular statistical models including Lasso, group Lasso, and
logistic regression with $\ell_1$ regularization, corrected Lasso and linear
regression with SCAD regularizer. This significantly improves previous
convergence results on SDCA for problems that are not strongly convex. As a by
product, we derive a dual free form of SDCA that can handle general
regularization term, which is of interest by itself.

SAGA is a fast incremental gradient method on the finite sum problem and its
effectiveness has been tested on a vast of applications. In this paper, we
analyze SAGA on a class of nonstrongly convex and nonconvex statistical
problem such as Lasso, group Lasso, Logistic regression with $\ell_1$
regularization, linear regression with SCAD regularization and Correct Lasso.
We prove that SAGA enjoys the linear convergence rate up to the statistical
estimation accuracy, under the assumption of restricted strong convexity (RSC).
It significantly extends the applicability of SAGA in convex and nonconvex
optimization.

We consider the problem of learning from noisy data in practical settings
where the size of data is too large to store on a single machine. More
challenging, the data coming from the wild may contain malicious outliers. To
address the scalability and robustness issues, we present an online robust
learning (ORL) approach. ORL is simple to implement and has provable robustness
guarantee  in stark contrast to existing online learning approaches that are
generally fragile to outliers. We specialize the ORL approach for two concrete
cases: online robust principal component analysis and online linear regression.
We demonstrate the efficiency and robustness advantages of ORL through
comprehensive simulations and predicting image tags on a largescale data set.
We also discuss extension of the ORL to distributed learning and provide
experimental evaluations.

We develop a unified and systematic framework for performing online
nonnegative matrix factorization under a wide variety of important divergences.
The online nature of our algorithm makes it particularly amenable to
largescale data. We prove that the sequence of learned dictionaries converges
almost surely to the set of critical points of the expected loss function. We
do so by leveraging the theory of stochastic approximations and projected
dynamical systems. This result substantially generalizes the previous results
obtained only for the squared$\ell_2$ loss. Moreover, the novel techniques
involved in our analysis open new avenues for analyzing similar matrix
factorization problems. The computational efficiency and the quality of the
learned dictionary of our algorithm are verified empirically on both synthetic
and real datasets. In particular, on the tasks of topic learning, shadow
removal and image denoising, our algorithm achieves superior tradeoffs between
the quality of learned dictionary and running time over the batch and other
online NMF algorithms.

The question why deep learning algorithms perform so well in practice has
attracted increasing research interest. However, most of wellestablished
approaches, such as hypothesis capacity, robustness or sparseness, have not
provided complete explanations, due to the high complexity of the deep learning
algorithms and their inherent randomness. In this work, we introduce a new
approach~\textendash~ensemble robustness~\textendash~towards characterizing the
generalization performance of generic deep learning algorithms. Ensemble
robustness concerns robustness of the \emph{population} of the hypotheses that
may be output by a learning algorithm. Through the lens of ensemble robustness,
we reveal that a stochastic learning algorithm can generalize well as long as
its sensitiveness to adversarial perturbation is bounded in average, or
equivalently, the performance variance of the algorithm is small. Quantifying
ensemble robustness of various deep learning algorithms may be difficult
analytically. However, extensive simulations for seven common deep learning
algorithms for different network architectures provide supporting evidence for
our claims. Furthermore, our work explains the good performance of several
published deep learning algorithms.

LowRank Representation~(LRR) has been a significant method for segmenting
data that are generated from a union of subspaces. It is also known that
solving LRR is challenging in terms of time complexity and memory footprint, in
that the size of the nuclear norm regularized matrix is $n$by$n$ (where $n$
is the number of samples). In this paper, we thereby develop a novel online
implementation of LRR that reduces the memory cost from $O(n^2)$ to $O(pd)$,
with $p$ being the ambient dimension and $d$ being some estimated rank~($d < p
\ll n$). We also establish the theoretical guarantee that the sequence of
solutions produced by our algorithm converges to a stationary point of the
expected loss function asymptotically. Extensive experiments on synthetic and
realistic datasets further substantiate that our algorithm is fast, robust and
memory efficient.

Maxnorm regularizer has been extensively studied in the last decade as it
promotes an effective lowrank estimation for the underlying data. However,
such maxnorm regularized problems are typically formulated and solved in a
batch manner, which prevents it from processing big data due to possible memory
budget. In this paper, hence, we propose an online algorithm that is scalable
to largescale setting. Particularly, we consider the matrix decomposition
problem as an example, although a simple variant of the algorithm and analysis
can be adapted to other important problems such as matrix completion. The
crucial technique in our implementation is to reformulating the maxnorm to an
equivalent matrix factorization form, where the factors consist of a (possibly
overcomplete) basis component and a coefficients one. In this way, we may
maintain the basis component in the memory and optimize over it and the
coefficients for each sample alternatively. Since the memory footprint of the
basis component is independent of the sample size, our algorithm is appealing
when manipulating a large collection of samples. We prove that the sequence of
the solutions (i.e., the basis component) produced by our algorithm converges
to a stationary point of the expected loss function asymptotically. Numerical
study demonstrates encouraging results for the efficacy and robustness of our
algorithm compared to the widely used nuclear norm solvers.

This paper considers the problem of matrix completion when some number of the
columns are completely and arbitrarily corrupted, potentially by a malicious
adversary. It is wellknown that standard algorithms for matrix completion can
return arbitrarily poor results, if even a single column is corrupted. One
direct application comes from robust collaborative filtering. Here, some number
of users are socalled manipulators who try to skew the predictions of the
algorithm by calibrating their inputs to the system. In this paper, we develop
an efficient algorithm for this problem based on a combination of a trimming
procedure and a convex program that minimizes the nuclear norm and the
$\ell_{1,2}$ norm. Our theoretical results show that given a vanishing fraction
of observed entries, it is nevertheless possible to complete the underlying
matrix even when the number of corrupted columns grows. Significantly, our
results hold without any assumptions on the locations or values of the observed
entries of the manipulated columns. Moreover, we show by an
informationtheoretic argument that our guarantees are nearly optimal in terms
of the fraction of sampled entries on the authentic columns, the fraction of
corrupted columns, and the rank of the underlying matrix. Our results therefore
sharply characterize the tradeoffs between sample, robustness and rank in
matrix completion.

Whereas classical Markov decision processes maximize the expected reward, we
consider minimizing the risk. We propose to evaluate the risk associated to a
given policy over a longenough time horizon with the help of a central limit
theorem. The proposed approach works whether the transition probabilities are
known or not. We also provide a gradientbased policy improvement algorithm
that converges to a local optimum of the risk objective.

In this paper, we first prove the local wellposedness of the 2D
incompressible NavierStokes equations with variable viscosity in critical
Besov spaces with negative regularity indices, without smallness assumption on
the variation of the density. The key is to prove for $p\in(1,4)$ and
$a\in\dot{B}_{p,1}^{\frac2p}(\mathbb{R}^2)$ that the solution mapping
$\mathcal{H}_a:F\mapsto\nabla\Pi$ to the 2D elliptic equation
$\mathrm{div}\big((1+a)\nabla\Pi\big)=\mathrm{div} F$ is bounded on
$\dot{B}_{p,1}^{\frac2p1}(\mathbb{R}^2)$. More precisely, we prove that
$$\\nabla\Pi\_{\dot{B}_{p,1}^{\frac2p1}}\leq
C\big(1+\a\_{\dot{B}_{p,1}^{\frac2p}}\big)^2\F\_{\dot{B}_{p,1}^{\frac2p1}}.$$
The proof of the uniqueness of solution to (1.2) relies on a Lagrangian
approach [15][17]. When the viscosity coefficient $\mu(\rho)$ is a positive
constant, we prove that (1.2) is globally wellposed.

This paper studies Markov Decision Processes under parameter uncertainty. We
adapt the distributionally robust optimization framework, and assume that the
uncertain parameters are random variables following an unknown distribution,
and seeks the strategy which maximizes the expected performance under the most
adversarial distribution. In particular, we generalize previous study
\cite{xu2012distributionally} which concentrates on distribution sets with very
special structure to much more generic class of distribution sets, and show
that the optimal strategy can be obtained efficiently under mild technical
condition. This significantly extends the applicability of distributionally
robust MDP to incorporate probabilistic information of uncertainty in a more
flexible way.