• ### The Complexity of Computing (Almost) Unitary Matrices With $\eps$-Copies of the Fourier Transform(1604.02557)

April 17, 2019 cs.CC
The complexity of computing the Fourier transform is a longstanding open problem. Very recently, Ailon (2013, 2014, 2015) showed in a collection of papers that, roughly speaking, a speedup of the Fourier transform computation implies numerical ill-condition. The papers also quantify this tradeoff. The main method for proving these results is via a potential function called quasi-entropy, reminiscent of Shannon entropy. The quasi-entropy method opens new doors to understanding the computational complexity of the important Fourier transformation. However, it suffers from various obvious limitations. This paper, motivated by one such limitation, partly overcomes it, while at the same time sheds llight on new interesting, and problems on the intersection of computational complexity and group theory. The paper also explains why this research direction, if fruitful, has a chance of solving much bigger questions about the complexity of the Fourier transform.
• ### Semi-supervised deep learning by metric embedding(1611.01449)

Dec. 4, 2018 cs.LG
Deep networks are successfully used as classification models yielding state-of-the-art results when trained on a large number of labeled samples. These models, however, are usually much less suited for semi-supervised problems because of their tendency to overfit easily when trained on small amounts of data. In this work we will explore a new training objective that is targeting a semi-supervised regime with only a small subset of labeled data. This criterion is based on a deep metric embedding over distance relations within the set of labeled samples, together with constraints over the embeddings of the unlabeled set. The final learned representations are discriminative in euclidean space, and hence can be used with subsequent nearest-neighbor classification using the labeled samples.
• ### Deep unsupervised learning through spatial contrasting(1610.00243)

Dec. 4, 2018 cs.AI, cs.LG, stat.ML
Convolutional networks have marked their place over the last few years as the best performing model for various visual tasks. They are, however, most suited for supervised learning from large amounts of labeled data. Previous attempts have been made to use unlabeled data to improve model performance by applying unsupervised techniques. These attempts require different architectures and training methods. In this work we present a novel approach for unsupervised training of Convolutional networks that is based on contrasting between spatial regions within images. This criterion can be employed within conventional neural networks and trained using standard techniques such as SGD and back-propagation, thus complementing supervised methods.
• ### Deep metric learning using Triplet network(1412.6622)

Dec. 4, 2018 cs.CV, cs.LG, stat.ML
Deep learning has proven itself as a successful set of models for learning useful semantic representations of data. These, however, are mostly implicitly learned as part of a classification task. In this paper we propose the triplet network model, which aims to learn useful representations by distance comparisons. A similar model was defined by Wang et al. (2014), tailor made for learning a ranking for image information retrieval. Here we demonstrate using various datasets that our model learns a better representation than that of its immediate competitor, the Siamese network. We also discuss future possible usage as a framework for unsupervised learning.
• ### Paraunitary Matrices, Entropy, Algebraic Condition Number and Fourier Computation(1609.03278)

Nov. 8, 2018 cs.CC
The Fourier Transform is one of the most important linear transformations used in science and engineering. Cooley and Tukey's Fast Fourier Transform (FFT) from 1964 is a method for computing this transformation in time $O(n\log n)$. From a lower bound perspective, relatively little is known. Ailon shows in 2013 an $\Omega(n\log n)$ bound for computing the normalized Fourier Transform assuming only unitary operations on two coordinates are allowed at each step, and no extra memory is allowed. In 2014, Ailon then improved the result to show that, in a $\kappa$-well conditioned computation, Fourier computation can be sped up by no more than $O(\kappa)$. The main conjecture is that Ailon's result can be exponentially improved, in the sense that $\kappa$-well condition cannot admit $\omega(\log \kappa)$ speedup. The main result here is that `algebraic' $\kappa$-well condition admits no more than $O(\sqrt \kappa)$ speedup. The definition of algebraic condition number is obtained by formally viewing multiplication by constants, as performed by the algorithm, as multiplication by indeterminates, giving rise to computation over polynomials. The algebraic condition number is related to the degree of these polynomials. Using the maximum modulus theorem from complex analysis, we show that algebraic condition number upper bounds standard condition number, and equals it in certain cases. Algebraic condition number is an interesting measure of numerical computation stability in its own right. Moreover, we believe that the approach of algebraic condition number has a good chance of establishing an algebraic version of the main conjecture.
• ### Approximate Clustering with Same-Cluster Queries(1704.01862)

Oct. 4, 2017 cs.DS
Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?" There are many clustering contexts where such same-cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the $k$-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed $O(k^2 \log{k} + k \log{n})$ same-cluster queries. This is interesting since the $k$-means problem, even with the margin assumption, is $\mathsf{NP}$-hard. In this paper, we extend the work of Ashtiani et al. to the approximation setting showing that a few of such same-cluster queries enables one to get a polynomial-time $(1 + \varepsilon)$-approximation algorithm for the $k$-means problem without any margin assumption on the input dataset. Again, this is interesting since the $k$-means problem is $\mathsf{NP}$-hard to approximate within a factor $(1 + c)$ for a fixed constant $0 < c < 1$. The number of same-cluster queries used is $\textrm{poly}(k/\varepsilon)$ which is independent of the size $n$ of the dataset. Our algorithm is based on the $D^2$-sampling technique. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make $\Omega \left(\frac{k}{poly \log k} \right)$ same-cluster queries. Our algorithm can be extended for the case when the oracle is faulty. Another result we show with respect to the $k$-means++ seeding algorithm is that a small modification to the $k$-means++ seeding algorithm within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known $O(\log{k})$-approximation algorithm.
• ### Spatial contrasting for deep unsupervised learning(1611.06996)

Nov. 21, 2016 cs.LG, stat.ML
Convolutional networks have marked their place over the last few years as the best performing model for various visual tasks. They are, however, most suited for supervised learning from large amounts of labeled data. Previous attempts have been made to use unlabeled data to improve model performance by applying unsupervised techniques. These attempts require different architectures and training methods. In this work we present a novel approach for unsupervised training of Convolutional networks that is based on contrasting between spatial regions within images. This criterion can be employed within conventional neural networks and trained using standard techniques such as SGD and back-propagation, thus complementing supervised methods.
• ### Tighter Fourier Transform Complexity Tradeoffs(1404.1741)

April 15, 2015 cs.CC
The Fourier Transform is one of the most important linear transformations used in science and engineering. Cooley and Tukey's Fast Fourier Transform (FFT) from 1964 is a method for computing this transformation in time $O(n\log n)$. Achieving a matching lower bound in a reasonable computational model is one of the most important open problems in theoretical computer science. In 2014, improving on his previous work, Ailon showed that if an algorithm speeds up the FFT by a factor of $b=b(n)\geq 1$, then it must rely on computing, as an intermediate "bottleneck" step, a linear mapping of the input with condition number $\Omega(b(n))$. Our main result shows that a factor $b$ speedup implies existence of not just one but $\Omega(n)$ $b$-ill conditioned bottlenecks occurring at $\Omega(n)$ different steps, each causing information from independent (orthogonal) components of the input to either overflow or underflow. This provides further evidence that beating FFT is hard. Our result also gives the first quantitative tradeoff between computation speed and information loss in Fourier computation on fixed word size architectures. The main technical result is an entropy analysis of the Fourier transform under transformations of low trace, which is interesting in its own right.
• ### An n\log n Lower Bound for Fourier Transform Computation in the Well Conditioned Model(1403.1307)

July 24, 2014 cs.CC
Obtaining a non-trivial (super-linear) lower bound for computation of the Fourier transform in the linear circuit model has been a long standing open problem for over 40 years. An early result by Morgenstern from 1973, provides an $\Omega(n \log n)$ lower bound for the unnormalized Fourier transform when the constants used in the computation are bounded. The proof uses a potential function related to a determinant. The result does not explain why the normalized Fourier transform (of unit determinant) should be difficult to compute in the same model. Hence, the result is not scale insensitive. More recently, Ailon (2013) showed that if only unitary 2-by-2 gates are used, and additionally no extra memory is allowed, then the normalized Fourier transform requires $\Omega(n\log n)$ steps. This rather limited result is also sensitive to scaling, but highlights the complexity inherent in the Fourier transform arising from introducing entropy, unlike, say, the identity matrix (which is as complex as the Fourier transform using Morgenstern's arguments, under proper scaling). In this work we extend the arguments of Ailon (2013). In the first extension, which is also the main contribution, we provide a lower bound for computing any scaling of the Fourier transform. Our restriction is that, the composition of all gates up to any point must be a well conditioned linear transformation. The lower bound is $\Omega(R^{-1}n\log n)$, where $R$ is the uniform condition number. Second, we assume extra space is allowed, as long as it contains information of bounded norm at the end of the computation. The main technical contribution is an extension of matrix entropy used in Ailon (2013) for unitary matrices to a potential function computable for any matrix, using Shannon entropy on "quasi-probabilities".
• ### Bandit Online Optimization Over the Permutahedron(1312.1530)

July 6, 2014 cs.LG
The permutahedron is the convex polytope with vertex set consisting of the vectors $(\pi(1),\dots, \pi(n))$ for all permutations (bijections) $\pi$ over $\{1,\dots, n\}$. We study a bandit game in which, at each step $t$, an adversary chooses a hidden weight weight vector $s_t$, a player chooses a vertex $\pi_t$ of the permutahedron and suffers an observed loss of $\sum_{i=1}^n \pi(i) s_t(i)$. A previous algorithm CombBand of Cesa-Bianchi et al (2009) guarantees a regret of $O(n\sqrt{T \log n})$ for a time horizon of $T$. Unfortunately, CombBand requires at each step an $n$-by-$n$ matrix permanent approximation to within improved accuracy as $T$ grows, resulting in a total running time that is super linear in $T$, making it impractical for large time horizons. We provide an algorithm of regret $O(n^{3/2}\sqrt{T})$ with total time complexity $O(n^3T)$. The ideas are a combination of CombBand and a recent algorithm by Ailon (2013) for online optimization over the permutahedron in the full information setting. The technical core is a bound on the variance of the Plackett-Luce noisy sorting process's "pseudo loss". The bound is obtained by establishing positive semi-definiteness of a family of 3-by-3 matrices generated from rational functions of exponentials of 3 parameters.
• ### Reducing Dueling Bandits to Cardinal Bandits(1405.3396)

May 14, 2014 cs.LG
We present algorithms for reducing the Dueling Bandits problem to the conventional (stochastic) Multi-Armed Bandits problem. The Dueling Bandits problem is an online model of learning with ordinal feedback of the form "A is preferred to B" (as opposed to cardinal feedback like "A has value 2.5"), giving it wide applicability in learning from implicit user feedback and revealed and stated preferences. In contrast to existing algorithms for the Dueling Bandits problem, our reductions -- named $\Doubler$, $\MultiSbm$ and $\DoubleSbm$ -- provide a generic schema for translating the extensive body of known results about conventional Multi-Armed Bandit algorithms to the Dueling Bandits setting. For $\Doubler$ and $\MultiSbm$ we prove regret upper bounds in both finite and infinite settings, and conjecture about the performance of $\DoubleSbm$ which empirically outperforms the other two as well as previous algorithms in our experiments. In addition, we provide the first almost optimal regret bound in terms of second order terms, such as the differences between the values of the arms.
• ### A tight lower bound instance for k-means++ in constant dimension(1401.2912)

Jan. 14, 2014 cs.DS
The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial $k$ centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: Pick the first center randomly from the given points. For $i > 1$, pick a point to be the $i^{th}$ center with probability proportional to the square of the Euclidean distance of this point to the closest previously $(i-1)$ chosen centers. The k-means++ seeding algorithm is not only simple and fast but also gives an $O(\log{k})$ approximation in expectation as shown by Arthur and Vassilvitskii. There are datasets on which this seeding algorithm gives an approximation factor of $\Omega(\log{k})$ in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably high probability (say $1/poly(k)$). Brunsch and R\"{o}glin gave a dataset where the k-means++ seeding algorithm achieves an $O(\log{k})$ approximation ratio with probability that is exponentially small in $k$. However, this and all other known lower-bound examples are high dimensional. So, an open problem was to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an $O(\log{k})$ approximation ratio with probability exponentially small in $k$. This solves open problems posed by Mahajan et al. and by Brunsch and R\"{o}glin.
• ### Online Ranking: Discrete Choice, Spearman Correlation and Other Feedback(1308.6797)

Oct. 14, 2013 cs.GT, cs.LG, stat.ML
Given a set $V$ of $n$ objects, an online ranking system outputs at each time step a full ranking of the set, observes a feedback of some form and suffers a loss. We study the setting in which the (adversarial) feedback is an element in $V$, and the loss is the position (0th, 1st, 2nd...) of the item in the outputted ranking. More generally, we study a setting in which the feedback is a subset $U$ of at most $k$ elements in $V$, and the loss is the sum of the positions of those elements. We present an algorithm of expected regret $O(n^{3/2}\sqrt{Tk})$ over a time horizon of $T$ steps with respect to the best single ranking in hindsight. This improves previous algorithms and analyses either by a factor of either $\Omega(\sqrt{k})$, a factor of $\Omega(\sqrt{\log n})$ or by improving running time from quadratic to $O(n\log n)$ per round. We also prove a matching lower bound. Our techniques also imply an improved regret bound for online rank aggregation over the Spearman correlation measure, and to other more complex ranking loss functions.
• ### A Lower Bound for Fourier Transform Computation in a Linear Model Over 2x2 Unitary Gates Using Matrix Entropy(1305.4745)

May 21, 2013 cs.CC
Obtaining a non-trivial (super-linear) lower bound for computation of the Fourier transform in the linear circuit model has been a long standing open problem. All lower bounds so far have made strong restrictions on the computational model. One of the most well known results, by Morgenstern from 1973, provides an $\Omega(n \log n)$ lower bound for the \emph{unnormalized} FFT when the constants used in the computation are bounded. The proof uses a potential function related to a determinant. The determinant of the unnormalized Fourier transform is $n^{n/2}$, and thus by showing that it can grow by at most a constant factor after each step yields the result. This classic result, however, does not explain why the \emph{normalized} Fourier transform, which has a unit determinant, should take $\Omega(n\log n)$ steps to compute. In this work we show that in a layered linear circuit model restricted to unitary $2\times 2$ gates, one obtains an $\Omega(n\log n)$ lower bound. The well known FFT works in this model. The main argument concluded from this work is that a potential function that might eventually help proving the $\Omega(n\log n)$ conjectured lower bound for computation of Fourier transform is not related to matrix determinant, but rather to a notion of matrix entropy.
• ### Breaking the Small Cluster Barrier of Graph Clustering(1302.4549)

Feb. 20, 2013 cs.LG, stat.ML
This paper investigates graph clustering in the planted cluster model in the presence of {\em small clusters}. Traditional results dictate that for an algorithm to provably correctly recover the clusters, {\em all} clusters must be sufficiently large (in particular, $\tilde{\Omega}(\sqrt{n})$ where $n$ is the number of nodes of the graph). We show that this is not really a restriction: by a more refined analysis of the trace-norm based recovery approach proposed in Jalali et al. (2011) and Chen et al. (2012), we prove that small clusters, under certain mild assumptions, do not hinder recovery of large ones. Based on this result, we further devise an iterative algorithm to recover {\em almost all clusters} via a "peeling strategy", i.e., recover large clusters first, leading to a reduced problem, and repeat this procedure. These results are extended to the {\em partial observation} setting, in which only a (chosen) part of the graph is observed.The peeling strategy gives rise to an active learning algorithm, in which edges adjacent to smaller clusters are queried more often as large clusters are learned (and removed). From a high level, this paper sheds novel insights on high-dimensional statistics and learning structured data, by presenting a structured matrix learning problem for which a one shot convex relaxation approach necessarily fails, but a carefully constructed sequence of convex relaxationsdoes the job.
• ### Fast and RIP-optimal transforms(1301.0878)

Feb. 17, 2013 cs.IT, math.IT, cs.NA
We study constructions of $k \times n$ matrices $A$ that both (1) satisfy the restricted isometry property (RIP) at sparsity $s$ with optimal parameters, and (2) are efficient in the sense that only $O(n\log n)$ operations are required to compute $Ax$ given a vector $x$. Our construction is based on repeated application of independent transformations of the form $DH$, where $H$ is a Hadamard or Fourier transform and $D$ is a diagonal matrix with random $\{+1,-1\}$ elements on the diagonal, followed by any $k \times n$ matrix of orthonormal rows (e.g.\ selection of $k$ coordinates). We provide guarantees (1) and (2) for a larger regime of parameters for which such constructions were previously unknown. Additionally, our construction does not suffer from the extra poly-logarithmic factor multiplying the number of observations $k$ as a function of the sparsity $s$, as present in the currently best known RIP estimates for partial random Fourier matrices and other classes of structured random matrices.
• ### Active Learning Using Smooth Relative Regret Approximations with Applications(1110.2136)

June 20, 2012 cs.LG
The disagreement coefficient of Hanneke has become a central data independent invariant in proving active learning rates. It has been shown in various ways that a concept class with low complexity together with a bound on the disagreement coefficient at an optimal solution allows active learning rates that are superior to passive learning ones. We present a different tool for pool based active learning which follows from the existence of a certain uniform version of low disagreement coefficient, but is not equivalent to it. In fact, we present two fundamental active learning problems of significant interest for which our approach allows nontrivial active learning bounds. However, any general purpose method relying on the disagreement coefficient bounds only fails to guarantee any useful bounds for these problems. The tool we use is based on the learner's ability to compute an estimator of the difference between the loss of any hypotheses and some fixed "pivotal" hypothesis to within an absolute error of at most $\eps$ times the
• ### A note on: No need to choose: How to get both a PTAS and Sublinear Query Complexity(1204.6588)

April 30, 2012 cs.CC
We revisit various PTAS's (Polynomial Time Approximation Schemes) for minimization versions of dense problems, and show that they can be performed with sublinear query complexity. This means that not only do we obtain a (1+eps)-approximation to the NP-Hard problems in polynomial time, but also avoid reading the entire input. This setting is particularly advantageous when the price of reading parts of the input is high, as is the case, for examples, where humans provide the input. Trading off query complexity with approximation is the raison d'etre of the field of learning theory, and of the ERM (Empirical Risk Minimization) setting in particular. A typical ERM result, however, does not deal with computational complexity. We discuss two particular problems for which (a) it has already been shown that sublinear querying is sufficient for obtaining a (1 + eps)-approximation using unlimited computational power (an ERM result), and (b) with full access to input, we could get a (1+eps)-approximation in polynomial time (a PTAS). Here we show that neither benefit need be sacrificed. We get a PTAS with efficient query complexity.
• ### Active Learning of Custering with Side Information Using $\eps$-Smooth Relative Regret Approximations(1201.6462)

Jan. 31, 2012 cs.LG
Clustering is considered a non-supervised learning setting, in which the goal is to partition a collection of data points into disjoint clusters. Often a bound $k$ on the number of clusters is given or assumed by the practitioner. Many versions of this problem have been defined, most notably $k$-means and $k$-median. An underlying problem with the unsupervised nature of clustering it that of determining a similarity function. One approach for alleviating this difficulty is known as clustering with side information, alternatively, semi-supervised clustering. Here, the practitioner incorporates side information in the form of "must be clustered" or "must be separated" labels for data point pairs. Each such piece of information comes at a "query cost" (often involving human response solicitation). The collection of labels is then incorporated in the usual clustering algorithm as either strict or as soft constraints, possibly adding a pairwise constraint penalty function to the chosen clustering objective. Our work is mostly related to clustering with side information. We ask how to choose the pairs of data points. Our analysis gives rise to a method provably better than simply choosing them uniformly at random. Roughly speaking, we show that the distribution must be biased so as more weight is placed on pairs incident to elements in smaller clusters in some optimal solution. Of course we do not know the optimal solution, hence we don't know the bias. Using the recently introduced method of $\eps$-smooth relative regret approximations of Ailon, Begleiter and Ezra, we can show an iterative process that improves both the clustering and the bias in tandem. The process provably converges to the optimal solution faster (in terms of query cost) than an algorithm selecting pairs uniformly.
• ### An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity(1011.0108)

May 17, 2011 cs.DS