• ### Online Crowdsourcing(1512.02393)

Feb. 8, 2019 cs.LG
With the success of modern internet based platform, such as Amazon Mechanical Turk, it is now normal to collect a large number of hand labeled samples from non-experts. The Dawid- Skene algorithm, which is based on Expectation- Maximization update, has been widely used for inferring the true labels from noisy crowdsourced labels. However, Dawid-Skene scheme requires all the data to perform each EM iteration, and can be infeasible for streaming data or large scale data. In this paper, we provide an online version of Dawid- Skene algorithm that only requires one data frame for each iteration. Further, we prove that under mild conditions, the online Dawid-Skene scheme with projection converges to a stationary point of the marginal log-likelihood of the observed data. Our experiments demonstrate that the online Dawid- Skene scheme achieves state of the art performance comparing with other methods based on the Dawid- Skene scheme.
• ### Online Gradient Descent in Function Space(1512.02394)

Feb. 8, 2019 cs.LG
In many problems in machine learning and operations research, we need to optimize a function whose input is a random variable or a probability density function, i.e. to solve optimization problems in an infinite dimensional space. On the other hand, online learning has the advantage of dealing with streaming examples, and better model a changing environ- ment. In this paper, we extend the celebrated online gradient descent algorithm to Hilbert spaces (function spaces), and analyze the convergence guarantee of the algorithm. Finally, we demonstrate that our algorithms can be useful in several important problems.
• ### Accelerated Randomized Mirror Descent Algorithms For Composite Non-strongly Convex Optimization(1605.06892)

Dec. 31, 2018 math.OC, stat.ML
We consider the problem of minimizing the sum of an average function of a large number of smooth convex components and a general, possibly non-differentiable, convex function. Although many methods have been proposed to solve this problem with the assumption that the sum is strongly convex, few methods support the non-strongly convex case. Adding a small quadratic regularization is a common devise used to tackle non-strongly convex problems; however, it may cause loss of sparsity of solutions or weaken the performance of the algorithms. Avoiding this devise, we propose an accelerated randomized mirror descent method for solving this problem without the strongly convex assumption. Our method extends the deterministic accelerated proximal gradient methods of Paul Tseng and can be applied even when proximal points are computed inexactly. We also propose a scheme for solving the problem when the component functions are non-smooth.
• ### A Multi-State Diagnosis and Prognosis Framework with Feature Learning for Tool Condition Monitoring(1805.00367)

April 30, 2018 cs.LG, eess.SP
In this paper, a multi-state diagnosis and prognosis (MDP) framework is proposed for tool condition monitoring via a deep belief network based multi-state approach (DBNMS). For fault diagnosis, a cost-sensitive deep belief network (namely ECS-DBN) is applied to deal with the imbalanced data problem for tool state estimation. An appropriate prognostic degradation model is then applied for tool wear estimation based on the different tool states. The proposed framework has the advantage of automatic feature representation learning and shows better performance in accuracy and robustness. The effectiveness of the proposed DBNMS is validated using a real-world dataset obtained from the gun drilling process. This dataset contains a large amount of measured signals involving different tool geometries under various operating conditions. The DBNMS is examined for both the tool state estimation and tool wear estimation tasks. In the experimental studies, the prediction results are evaluated and compared with popular machine learning approaches, which show the superior performance of the proposed DBNMS approach.
• ### Learning Deep Mean Field Games for Modeling Large Population Behavior(1711.03156)

April 22, 2018 cs.LG, cs.CE
We consider the problem of representing collective behavior of large populations and predicting the evolution of a population distribution over a discrete state space. A discrete time mean field game (MFG) is motivated as an interpretable model founded on game theory for understanding the aggregate effect of individual actions and predicting the temporal evolution of population distributions. We achieve a synthesis of MFG and Markov decision processes (MDP) by showing that a special MFG is reducible to an MDP. This enables us to broaden the scope of mean field game theory and infer MFG models of large real-world systems via deep inverse reinforcement learning. Our method learns both the reward function and forward dynamics of an MFG from real data, and we report the first empirical test of a mean field game model of a real-world social media population.
• ### Fast Global Convergence via Landscape of Empirical Loss(1802.04617)

Feb. 13, 2018 cs.LG, stat.ML
While optimizing convex objective (loss) functions has been a powerhouse for machine learning for at least two decades, non-convex loss functions have attracted fast growing interests recently, due to many desirable properties such as superior robustness and classification accuracy, compared with their convex counterparts. The main obstacle for non-convex estimators is that it is in general intractable to find the optimal solution. In this paper, we study the computational issues for some non-convex M-estimators. In particular, we show that the stochastic variance reduction methods converge to the global optimal with linear rate, by exploiting the statistical property of the population loss. En route, we improve the convergence analysis for the batch gradient method in \cite{mei2016landscape}.
• ### Bayesian Pool-based Active Learning With Abstention Feedbacks(1705.08481)

Sept. 27, 2017 cs.LG, stat.ML
We study pool-based active learning with abstention feedbacks, where a labeler can abstain from labeling a queried example with some unknown abstention rate. This is an important problem with many useful applications. We take a Bayesian approach to the problem and develop two new greedy algorithms that learn both the classification problem and the unknown abstention rate at the same time. These are achieved by simply incorporating the estimated abstention rate into the greedy criteria. We prove that both of our algorithms have near-optimality guarantees: they respectively achieve a ${(1-\frac{1}{e})}$ constant factor approximation of the optimal expected or worst-case value of a useful utility function. Our experiments show the algorithms perform well in various practical scenarios.
• ### On near optimality of one-sample update for joint detection and estimation(1705.06995)

Sept. 1, 2017 math.ST, stat.TH, cs.LG
Sequential hypothesis test and change-point detection when the distribution parameters are unknown is a fundamental problem in statistics and machine learning. We show that for such problems, detection procedures based on sequential likelihood ratios with simple one-sample update estimates such as online mirror descent are nearly second-order optimal. This means that the upper bound for the algorithm performance meets the lower bound asymptotically up to a log-log factor in the false-alarm rate when it tends to zero. This is a blessing, since although the generalized likelihood ratio(GLR) statistics are optimal theoretically, but they cannot be computed recursively, and their exact computation usually requires infinite memory of historical data. We prove the nearly second-order optimality by making a connection between sequential analysis and online convex optimization and leveraging the logarithmic regret bound property of online mirror descent algorithm. Numerical examples validate our theory.
• ### Non-convex Conditional Gradient Sliding(1708.04783)

Aug. 16, 2017 math.OC
We investigate a projection free method, namely conditional gradient sliding on batched, stochastic and finite-sum non-convex problem. CGS is a smart combination of Nesterov's accelerated gradient method and Frank-Wolfe (FW) method, and outperforms FW in the convex setting by saving gradient computations. However, the study of CGS in the non-convex setting is limited. In this paper, we propose the non-convex conditional gradient sliding (NCGS) which surpasses the non-convex Frank-Wolfe method in batched, stochastic and finite-sum setting.
• ### Linear Convergence of SVRG in Statistical Estimation(1611.01957)

July 27, 2017 cs.LG, stat.ML
SVRG and its variants are among the state of art optimization algorithms for large scale machine learning problems. It is well known that SVRG converges linearly when the objective function is strongly convex. However this setup can be restrictive, and does not include several important formulations such as Lasso, group Lasso, logistic regression, and some non-convex models including corrected Lasso and SCAD. In this paper, we prove that, for a class of statistical M-estimators covering examples mentioned above, SVRG solves the formulation with {\em a linear convergence rate} without strong convexity or even convexity. Our analysis makes use of {\em restricted strong convexity}, under which we show that SVRG converges linearly to the fundamental statistical precision of the model, i.e., the difference between true unknown parameter $\theta^*$ and the optimal solution $\hat{\theta}$ of the model.
• ### Fake News Mitigation via Point Process Based Intervention(1703.07823)

June 19, 2017 cs.SI, cs.LG
We propose the first multistage intervention framework that tackles fake news in social networks by combining reinforcement learning with a point process network activity model. The spread of fake news and mitigation events within the network is modeled by a multivariate Hawkes process with additional exogenous control terms. By choosing a feature representation of states, defining mitigation actions and constructing reward functions to measure the effectiveness of mitigation activities, we map the problem of fake news mitigation into the reinforcement learning framework. We develop a policy iteration method unique to the multivariate networked point process, with the goal of optimizing the actions for maximal total reward under budget constraints. Our method shows promising performance in real-time intervention experiments on a Twitter network to mitigate a surrogate fake news campaign, and outperforms alternatives on synthetic datasets.
• ### Reinforcement Learning under Model Mismatch(1706.04711)

June 15, 2017 cs.LG, stat.ML
We study reinforcement learning under model misspecification, where we do not have access to the true environment but only to a reasonably close approximation to it. We address this problem by extending the framework of robust MDPs to the model-free Reinforcement Learning setting, where we do not have access to the model parameters, but can only sample states from it. We define robust versions of Q-learning, SARSA, and TD-learning and prove convergence to an approximately optimal robust policy and approximate value function respectively. We scale up the robust algorithms to large MDPs via function approximation and prove convergence under two different settings. We prove convergence of robust approximate policy iteration and robust approximate value iteration for linear architectures (under mild assumptions). We also define a robust loss function, the mean squared robust projected Bellman error and give stochastic gradient descent algorithms that are guaranteed to converge to a local minimum.
• ### Adaptive Maximization of Pointwise Submodular Functions With Budget Constraint(1603.09029)

May 22, 2017 cs.AI, cs.DM, math.OC, stat.ML
We study the worst-case adaptive optimization problem with budget constraint that is useful for modeling various practical applications in artificial intelligence and machine learning. We investigate the near-optimality of greedy algorithms for this problem with both modular and non-modular cost functions. In both cases, we prove that two simple greedy algorithms are not near-optimal but the best between them is near-optimal if the utility function satisfies pointwise submodularity and pointwise cost-sensitive submodularity respectively. This implies a combined algorithm that is near-optimal with respect to the optimal algorithm that uses half of the budget. We discuss applications of our theoretical results and also report experiments comparing the greedy algorithms on the active learning problem.
• ### Approximate Value Iteration for Risk-aware Markov Decision Processes(1701.01290)

May 16, 2017 math.OC, cs.SY
We consider large-scale Markov decision processes (MDPs) with a risk measure of variability in cost, under the risk-aware MDPs paradigm. Previous studies showed that risk-aware MDPs, based on a minimax approach to handling risk, can be solved using dynamic programming for small to medium sized problems. However, due to the "curse of dimensionality", MDPs that model real-life problems are typically prohibitively large for such approaches. In this paper, we employ an approximate dynamic programming approach, and develop a family of simulation-based algorithms to approximately solve large-scale risk-aware MDPs. In parallel, we develop a unified convergence analysis technique to derive sample complexity bounds for this new family of algorithms.
• ### Linear convergence of SDCA in statistical estimation(1701.07808)

April 2, 2017 cs.LG, stat.ML
In this paper, we consider stochastic dual coordinate (SDCA) {\em without} strongly convex assumption or convex assumption. We show that SDCA converges linearly under mild conditions termed restricted strong convexity. This covers a wide array of popular statistical models including Lasso, group Lasso, and logistic regression with $\ell_1$ regularization, corrected Lasso and linear regression with SCAD regularizer. This significantly improves previous convergence results on SDCA for problems that are not strongly convex. As a by product, we derive a dual free form of SDCA that can handle general regularization term, which is of interest by itself.
• ### SAGA and Restricted Strong Convexity(1702.05683)

Feb. 27, 2017 stat.ML
SAGA is a fast incremental gradient method on the finite sum problem and its effectiveness has been tested on a vast of applications. In this paper, we analyze SAGA on a class of non-strongly convex and non-convex statistical problem such as Lasso, group Lasso, Logistic regression with $\ell_1$ regularization, linear regression with SCAD regularization and Correct Lasso. We prove that SAGA enjoys the linear convergence rate up to the statistical estimation accuracy, under the assumption of restricted strong convexity (RSC). It significantly extends the applicability of SAGA in convex and non-convex optimization.
• ### Outlier Robust Online Learning(1701.00251)

Jan. 1, 2017 cs.LG, stat.ML
We consider the problem of learning from noisy data in practical settings where the size of data is too large to store on a single machine. More challenging, the data coming from the wild may contain malicious outliers. To address the scalability and robustness issues, we present an online robust learning (ORL) approach. ORL is simple to implement and has provable robustness guarantee -- in stark contrast to existing online learning approaches that are generally fragile to outliers. We specialize the ORL approach for two concrete cases: online robust principal component analysis and online linear regression. We demonstrate the efficiency and robustness advantages of ORL through comprehensive simulations and predicting image tags on a large-scale data set. We also discuss extension of the ORL to distributed learning and provide experimental evaluations.
• ### Online Nonnegative Matrix Factorization with General Divergences(1608.00075)

Aug. 2, 2016 cs.IT, math.IT, cs.NA, math.OC, stat.ML
We develop a unified and systematic framework for performing online nonnegative matrix factorization under a wide variety of important divergences. The online nature of our algorithm makes it particularly amenable to large-scale data. We prove that the sequence of learned dictionaries converges almost surely to the set of critical points of the expected loss function. We do so by leveraging the theory of stochastic approximations and projected dynamical systems. This result substantially generalizes the previous results obtained only for the squared-$\ell_2$ loss. Moreover, the novel techniques involved in our analysis open new avenues for analyzing similar matrix factorization problems. The computational efficiency and the quality of the learned dictionary of our algorithm are verified empirically on both synthetic and real datasets. In particular, on the tasks of topic learning, shadow removal and image denoising, our algorithm achieves superior trade-offs between the quality of learned dictionary and running time over the batch and other online NMF algorithms.
• ### Ensemble Robustness of Deep Learning Algorithms(1602.02389)

June 29, 2016 cs.CV, cs.LG, stat.ML
The question why deep learning algorithms perform so well in practice has attracted increasing research interest. However, most of well-established approaches, such as hypothesis capacity, robustness or sparseness, have not provided complete explanations, due to the high complexity of the deep learning algorithms and their inherent randomness. In this work, we introduce a new approach~\textendash~ensemble robustness~\textendash~towards characterizing the generalization performance of generic deep learning algorithms. Ensemble robustness concerns robustness of the \emph{population} of the hypotheses that may be output by a learning algorithm. Through the lens of ensemble robustness, we reveal that a stochastic learning algorithm can generalize well as long as its sensitiveness to adversarial perturbation is bounded in average, or equivalently, the performance variance of the algorithm is small. Quantifying ensemble robustness of various deep learning algorithms may be difficult analytically. However, extensive simulations for seven common deep learning algorithms for different network architectures provide supporting evidence for our claims. Furthermore, our work explains the good performance of several published deep learning algorithms.
• ### Online Low-Rank Subspace Clustering by Basis Dictionary Pursuit(1503.08356)

May 26, 2016 stat.ML
Low-Rank Representation~(LRR) has been a significant method for segmenting data that are generated from a union of subspaces. It is also known that solving LRR is challenging in terms of time complexity and memory footprint, in that the size of the nuclear norm regularized matrix is $n$-by-$n$ (where $n$ is the number of samples). In this paper, we thereby develop a novel online implementation of LRR that reduces the memory cost from $O(n^2)$ to $O(pd)$, with $p$ being the ambient dimension and $d$ being some estimated rank~($d < p \ll n$). We also establish the theoretical guarantee that the sequence of solutions produced by our algorithm converges to a stationary point of the expected loss function asymptotically. Extensive experiments on synthetic and realistic datasets further substantiate that our algorithm is fast, robust and memory efficient.
• ### Online Optimization for Large-Scale Max-Norm Regularization(1406.3190)

May 14, 2016 cs.LG, stat.ML
Max-norm regularizer has been extensively studied in the last decade as it promotes an effective low-rank estimation for the underlying data. However, such max-norm regularized problems are typically formulated and solved in a batch manner, which prevents it from processing big data due to possible memory budget. In this paper, hence, we propose an online algorithm that is scalable to large-scale setting. Particularly, we consider the matrix decomposition problem as an example, although a simple variant of the algorithm and analysis can be adapted to other important problems such as matrix completion. The crucial technique in our implementation is to reformulating the max-norm to an equivalent matrix factorization form, where the factors consist of a (possibly overcomplete) basis component and a coefficients one. In this way, we may maintain the basis component in the memory and optimize over it and the coefficients for each sample alternatively. Since the memory footprint of the basis component is independent of the sample size, our algorithm is appealing when manipulating a large collection of samples. We prove that the sequence of the solutions (i.e., the basis component) produced by our algorithm converges to a stationary point of the expected loss function asymptotically. Numerical study demonstrates encouraging results for the efficacy and robustness of our algorithm compared to the widely used nuclear norm solvers.
• ### Matrix completion with column manipulation: Near-optimal sample-robustness-rank tradeoffs(1102.2254)

April 24, 2016 cs.IT, math.IT, stat.ML
This paper considers the problem of matrix completion when some number of the columns are completely and arbitrarily corrupted, potentially by a malicious adversary. It is well-known that standard algorithms for matrix completion can return arbitrarily poor results, if even a single column is corrupted. One direct application comes from robust collaborative filtering. Here, some number of users are so-called manipulators who try to skew the predictions of the algorithm by calibrating their inputs to the system. In this paper, we develop an efficient algorithm for this problem based on a combination of a trimming procedure and a convex program that minimizes the nuclear norm and the $\ell_{1,2}$ norm. Our theoretical results show that given a vanishing fraction of observed entries, it is nevertheless possible to complete the underlying matrix even when the number of corrupted columns grows. Significantly, our results hold without any assumptions on the locations or values of the observed entries of the manipulated columns. Moreover, we show by an information-theoretic argument that our guarantees are nearly optimal in terms of the fraction of sampled entries on the authentic columns, the fraction of corrupted columns, and the rank of the underlying matrix. Our results therefore sharply characterize the tradeoffs between sample, robustness and rank in matrix completion.
• ### Central-limit approach to risk-aware Markov decision processes(1512.00583)

Dec. 2, 2015 math.OC, cs.SY
Whereas classical Markov decision processes maximize the expected reward, we consider minimizing the risk. We propose to evaluate the risk associated to a given policy over a long-enough time horizon with the help of a central limit theorem. The proposed approach works whether the transition probabilities are known or not. We also provide a gradient-based policy improvement algorithm that converges to a local optimum of the risk objective.
• ### On the Well-posedness of 2-D Incompressible Navier-Stokes Equations with Variable Viscosity in Critical Spaces(1510.08196)

Oct. 28, 2015 math.AP
In this paper, we first prove the local well-posedness of the 2-D incompressible Navier-Stokes equations with variable viscosity in critical Besov spaces with negative regularity indices, without smallness assumption on the variation of the density. The key is to prove for $p\in(1,4)$ and $a\in\dot{B}_{p,1}^{\frac2p}(\mathbb{R}^2)$ that the solution mapping $\mathcal{H}_a:F\mapsto\nabla\Pi$ to the 2-D elliptic equation $\mathrm{div}\big((1+a)\nabla\Pi\big)=\mathrm{div} F$ is bounded on $\dot{B}_{p,1}^{\frac2p-1}(\mathbb{R}^2)$. More precisely, we prove that $$\|\nabla\Pi\|_{\dot{B}_{p,1}^{\frac2p-1}}\leq C\big(1+\|a\|_{\dot{B}_{p,1}^{\frac2p}}\big)^2\|F\|_{\dot{B}_{p,1}^{\frac2p-1}}.$$ The proof of the uniqueness of solution to (1.2) relies on a Lagrangian approach [15]-[17]. When the viscosity coefficient $\mu(\rho)$ is a positive constant, we prove that (1.2) is globally well-posed.
• ### Distributionally Robust Counterpart in Markov Decision Processes(1501.07418)

May 13, 2015 math.OC, cs.SY
This paper studies Markov Decision Processes under parameter uncertainty. We adapt the distributionally robust optimization framework, and assume that the uncertain parameters are random variables following an unknown distribution, and seeks the strategy which maximizes the expected performance under the most adversarial distribution. In particular, we generalize previous study \cite{xu2012distributionally} which concentrates on distribution sets with very special structure to much more generic class of distribution sets, and show that the optimal strategy can be obtained efficiently under mild technical condition. This significantly extends the applicability of distributionally robust MDP to incorporate probabilistic information of uncertainty in a more flexible way.