• ### Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization(1603.06560)

June 18, 2018 cs.LG, stat.ML
Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, Hyperband, for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that Hyperband can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems.

Feb. 27, 2018 cs.LG, stat.ML
Federated learning poses new statistical and systems challenges in training machine learning models over distributed networks of devices. In this work, we show that multi-task learning is naturally suited to handle the statistical challenges of this setting, and propose a novel systems-aware optimization method, MOCHA, that is robust to practical systems issues. Our method and theory for the first time consider issues of high communication cost, stragglers, and fault tolerance for distributed multi-task learning. The resulting method achieves significant speedups compared to alternatives in the federated setting, as we demonstrate through simulations on real-world federated datasets.
• ### Parle: parallelizing stochastic gradient descent(1707.00424)

Sept. 10, 2017 cs.DC, cs.LG, stat.ML
We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error rates that are nearly state-of-the-art on several benchmarks including CIFAR-10 and CIFAR-100, without introducing any additional hyper-parameters. We exploit the phenomenon of flat minima that has been shown to lead to improved generalization error for deep networks. Parle requires very infrequent communication with the parameter server and instead performs more computation on each client, which makes it well-suited to both single-machine, multi-GPU settings and distributed implementations.
• ### MLlib: Machine Learning in Apache Spark(1505.06807)

May 26, 2015 cs.DC, cs.MS, cs.LG, stat.ML
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.
• ### TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries(1502.00068)

March 8, 2015 cs.DC, cs.DB, cs.LG
The proliferation of massive datasets combined with the development of sophisticated analytical techniques have enabled a wide variety of novel applications such as improved product recommendations, automatic image tagging, and improved speech-driven interfaces. These and many other applications can be supported by Predictive Analytic Queries (PAQs). A major obstacle to supporting PAQs is the challenging and expensive process of identifying and training an appropriate predictive model. Recent efforts aiming to automate this process have focused on single node implementations and have assumed that model training itself is a black box, thus limiting the effectiveness of such approaches on large-scale problems. In this work, we build upon these recent efforts and propose an integrated PAQ planning architecture that combines advanced model search techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching. The result is TuPAQ, a component of the MLbase system, which solves the PAQ planning problem with comparable quality to exhaustive strategies but an order of magnitude more efficiently than the standard baseline approach, and can scale to models trained on terabytes of data across hundreds of machines.
• ### Non-stochastic Best Arm Identification and Hyperparameter Optimization(1502.07943)

Feb. 27, 2015 cs.LG, stat.ML
Motivated by the task of hyperparameter optimization, we introduce the non-stochastic best-arm identification problem. Within the multi-armed bandit literature, the cumulative regret objective enjoys algorithms and analyses for both the non-stochastic and stochastic settings while to the best of our knowledge, the best-arm identification framework has only been considered in the stochastic setting. We introduce the non-stochastic setting under this framework, identify a known algorithm that is well-suited for this setting, and analyze its behavior. Next, by leveraging the iterative nature of standard machine learning algorithms, we cast hyperparameter optimization as an instance of non-stochastic best-arm identification, and empirically evaluate our proposed algorithm on this task. Our empirical results show that, by allocating more resources to promising hyperparameter settings, we typically achieve comparable test accuracies an order of magnitude faster than baseline methods.
• ### Matrix Coherence and the Nystrom Method(1408.2044)

Aug. 9, 2014 cs.LG, stat.ML
The Nystrom method is an efficient technique used to speed up large-scale learning applications by generating low-rank approximations. Crucial to the performance of this technique is the assumption that a matrix can be well approximated by working exclusively with a subset of its columns. In this work we relate this assumption to the concept of matrix coherence, connecting coherence to the performance of the Nystrom method. Making use of related work in the compressed sensing and the matrix completion literature, we derive novel coherence-based bounds for the Nystrom method in the low-rank setting. We then present empirical results that corroborate these theoretical bounds. Finally, we present more general empirical results for the full-rank setting that convincingly demonstrate the ability of matrix coherence to measure the degree to which information can be extracted from a subset of columns.
• ### SMaSH: A Benchmarking Toolkit for Human Genome Variant Calling(1310.8420)

Jan. 5, 2014 q-bio.QM, q-bio.GN
Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. Results: We propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms. Availability: We provide free and open access online to the SMaSH toolkit, along with detailed documentation, at smash.cs.berkeley.edu.
• ### Distributed Matrix Completion and Robust Factorization(1107.0789)

Oct. 28, 2013 math.NA, cs.NA, cs.DS, cs.LG, stat.ML
If learning methods are to scale to the massive sizes of modern datasets, it is essential for the field of machine learning to embrace parallel and distributed computing. Inspired by the recent development of matrix factorization methods with rich theory but poor computational complexity and by the relative ease of mapping matrices onto distributed architectures, we introduce a scalable divide-and-conquer framework for noisy matrix factorization. We present a thorough theoretical analysis of this framework in which we characterize the statistical errors introduced by the "divide" step and control their magnitude in the "conquer" step, so that the overall algorithm enjoys high-probability estimation guarantees comparable to those of its base algorithm. We also present experiments in collaborative filtering and video background modeling that demonstrate the near-linear to superlinear speed-ups attainable with this approach.
• ### MLI: An API for Distributed Machine Learning(1310.5426)

Oct. 25, 2013 cs.DC, cs.LG, stat.ML
MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
• ### Distributed Low-rank Subspace Segmentation(1304.5583)

Oct. 16, 2013 cs.CV, cs.DC, cs.LG, stat.ML
Vision problems ranging from image clustering to motion segmentation to semi-supervised learning can naturally be framed as subspace segmentation problems, in which one aims to recover multiple low-dimensional subspaces from noisy and corrupted input data. Low-Rank Representation (LRR), a convex formulation of the subspace segmentation problem, is provably and empirically accurate on small problems but does not scale to the massive sizes of modern vision datasets. Moreover, past work aimed at scaling up low-rank matrix factorization is not applicable to LRR given its non-decomposable constraints. In this work, we propose a novel divide-and-conquer algorithm for large-scale subspace segmentation that can cope with LRR's non-decomposable constraints and maintains LRR's strong recovery guarantees. This has immediate implications for the scalability of subspace segmentation, which we demonstrate on a benchmark face recognition dataset and in simulations. We then introduce novel applications of LRR-based subspace segmentation to large-scale semi-supervised learning for multimedia event detection, concept detection, and image tagging. In each case, we obtain state-of-the-art results and order-of-magnitude speed ups.
• ### A Scalable Bootstrap for Massive Data(1112.5016)

June 28, 2012 stat.CO, stat.ME, stat.ML
The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets---which are increasingly prevalent---the computation of bootstrap-based quantities can be prohibitively demanding computationally. While variants such as subsampling and the $m$ out of $n$ bootstrap can be used in principle to reduce the cost of bootstrap computations, we find that these methods are generally not robust to specification of hyperparameters (such as the number of subsampled data points), and they often require use of more prior information (such as rates of convergence of estimators) than the bootstrap. As an alternative, we introduce the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate BLB's favorable statistical performance via a theoretical analysis elucidating the procedure's properties, as well as a simulation study comparing BLB to the bootstrap, the $m$ out of $n$ bootstrap, and subsampling. In addition, we present results from a large-scale distributed implementation of BLB demonstrating its computational superiority on massive data, a method for adaptively selecting BLB's hyperparameters, an empirical study applying BLB to several real datasets, and an extension of BLB to time series data.
• ### Jointly Predicting Links and Inferring Attributes using a Social-Attribute Network (SAN)(1112.3265)

June 22, 2012 physics.soc-ph, cs.SI
The effects of social influence and homophily suggest that both network structure and node attribute information should inform the tasks of link prediction and node attribute inference. Recently, Yin et al. proposed Social-Attribute Network (SAN), an attribute-augmented social network, to integrate network structure and node attributes to perform both link prediction and attribute inference. They focused on generalizing the random walk with restart algorithm to the SAN framework and showed improved performance. In this paper, we extend the SAN framework with several leading supervised and unsupervised link prediction algorithms and demonstrate performance improvement for each algorithm on both link prediction and attribute inference. Moreover, we make the novel observation that attribute inference can help inform link prediction, i.e., link prediction accuracy is further improved by first inferring missing attributes. We comprehensively evaluate these algorithms and compare them with other existing algorithms using a novel, large-scale Google+ dataset, which we make publicly available.
• ### On the Estimation of Coherence(1009.0861)

Sept. 4, 2010 cs.AI, cs.LG, stat.ML
Low-rank matrix approximations are often used to help scale standard machine learning algorithms to large-scale problems. Recently, matrix coherence has been used to characterize the ability to extract global information from a subset of matrix entries in the context of these low-rank approximations and other sampling-based algorithms, e.g., matrix com- pletion, robust PCA. Since coherence is defined in terms of the singular vectors of a matrix and is expensive to compute, the practical significance of these results largely hinges on the following question: Can we efficiently and accurately estimate the coherence of a matrix? In this paper we address this question. We propose a novel algorithm for estimating coherence from a small number of columns, formally analyze its behavior, and derive a new coherence-based matrix approximation bound based on this analysis. We then present extensive experimental results on synthetic and real datasets that corroborate our worst-case theoretical analysis, yet provide strong support for the use of our proposed algorithm whenever low-rank approximation is being considered. Our algorithm efficiently and accurately estimates matrix coherence across a wide range of datasets, and these coherence estimates are excellent predictors of the effectiveness of sampling-based matrix approximation on a case-by-case basis.
• ### Matrix Coherence and the Nystrom Method(1004.2008)

April 12, 2010 cs.AI
The Nystrom method is an efficient technique to speed up large-scale learning applications by generating low-rank approximations. Crucial to the performance of this technique is the assumption that a matrix can be well approximated by working exclusively with a subset of its columns. In this work we relate this assumption to the concept of matrix coherence and connect matrix coherence to the performance of the Nystrom method. Making use of related work in the compressed sensing and the matrix completion literature, we derive novel coherence-based bounds for the Nystrom method in the low-rank setting. We then present empirical results that corroborate these theoretical bounds. Finally, we present more general empirical results for the full-rank setting that convincingly demonstrate the ability of matrix coherence to measure the degree to which information can be extracted from a subset of columns.