
The complexity of computing the Fourier transform is a longstanding open
problem. Very recently, Ailon (2013, 2014, 2015) showed in a collection of
papers that, roughly speaking, a speedup of the Fourier transform computation
implies numerical illcondition. The papers also quantify this tradeoff. The
main method for proving these results is via a potential function called
quasientropy, reminiscent of Shannon entropy. The quasientropy method opens
new doors to understanding the computational complexity of the important
Fourier transformation. However, it suffers from various obvious limitations.
This paper, motivated by one such limitation, partly overcomes it, while at the
same time sheds llight on new interesting, and problems on the intersection of
computational complexity and group theory. The paper also explains why this
research direction, if fruitful, has a chance of solving much bigger questions
about the complexity of the Fourier transform.

Deep networks are successfully used as classification models yielding
stateoftheart results when trained on a large number of labeled samples.
These models, however, are usually much less suited for semisupervised
problems because of their tendency to overfit easily when trained on small
amounts of data. In this work we will explore a new training objective that is
targeting a semisupervised regime with only a small subset of labeled data.
This criterion is based on a deep metric embedding over distance relations
within the set of labeled samples, together with constraints over the
embeddings of the unlabeled set. The final learned representations are
discriminative in euclidean space, and hence can be used with subsequent
nearestneighbor classification using the labeled samples.

Convolutional networks have marked their place over the last few years as the
best performing model for various visual tasks. They are, however, most suited
for supervised learning from large amounts of labeled data. Previous attempts
have been made to use unlabeled data to improve model performance by applying
unsupervised techniques. These attempts require different architectures and
training methods. In this work we present a novel approach for unsupervised
training of Convolutional networks that is based on contrasting between spatial
regions within images. This criterion can be employed within conventional
neural networks and trained using standard techniques such as SGD and
backpropagation, thus complementing supervised methods.

Deep learning has proven itself as a successful set of models for learning
useful semantic representations of data. These, however, are mostly implicitly
learned as part of a classification task. In this paper we propose the triplet
network model, which aims to learn useful representations by distance
comparisons. A similar model was defined by Wang et al. (2014), tailor made for
learning a ranking for image information retrieval. Here we demonstrate using
various datasets that our model learns a better representation than that of its
immediate competitor, the Siamese network. We also discuss future possible
usage as a framework for unsupervised learning.

The Fourier Transform is one of the most important linear transformations
used in science and engineering. Cooley and Tukey's Fast Fourier Transform
(FFT) from 1964 is a method for computing this transformation in time $O(n\log
n)$. From a lower bound perspective, relatively little is known. Ailon shows in
2013 an $\Omega(n\log n)$ bound for computing the normalized Fourier Transform
assuming only unitary operations on two coordinates are allowed at each step,
and no extra memory is allowed. In 2014, Ailon then improved the result to show
that, in a $\kappa$well conditioned computation, Fourier computation can be
sped up by no more than $O(\kappa)$. The main conjecture is that Ailon's result
can be exponentially improved, in the sense that $\kappa$well condition cannot
admit $\omega(\log \kappa)$ speedup.
The main result here is that `algebraic' $\kappa$well condition admits no
more than $O(\sqrt \kappa)$ speedup. The definition of algebraic condition
number is obtained by formally viewing multiplication by constants, as
performed by the algorithm, as multiplication by indeterminates, giving rise to
computation over polynomials. The algebraic condition number is related to the
degree of these polynomials. Using the maximum modulus theorem from complex
analysis, we show that algebraic condition number upper bounds standard
condition number, and equals it in certain cases. Algebraic condition number is
an interesting measure of numerical computation stability in its own right.
Moreover, we believe that the approach of algebraic condition number has a good
chance of establishing an algebraic version of the main conjecture.

Ashtiani et al. proposed a SemiSupervised Active Clustering framework
(SSAC), where the learner is allowed to make adaptive queries to a domain
expert. The queries are of the kind "do two given points belong to the same
optimal cluster?" There are many clustering contexts where such samecluster
queries are feasible. Ashtiani et al. exhibited the power of such queries by
showing that any instance of the $k$means clustering problem, with additional
margin assumption, can be solved efficiently if one is allowed $O(k^2 \log{k} +
k \log{n})$ samecluster queries. This is interesting since the $k$means
problem, even with the margin assumption, is $\mathsf{NP}$hard.
In this paper, we extend the work of Ashtiani et al. to the approximation
setting showing that a few of such samecluster queries enables one to get a
polynomialtime $(1 + \varepsilon)$approximation algorithm for the $k$means
problem without any margin assumption on the input dataset. Again, this is
interesting since the $k$means problem is $\mathsf{NP}$hard to approximate
within a factor $(1 + c)$ for a fixed constant $0 < c < 1$. The number of
samecluster queries used is $\textrm{poly}(k/\varepsilon)$ which is
independent of the size $n$ of the dataset. Our algorithm is based on the
$D^2$sampling technique. We also give a conditional lower bound on the number
of samecluster queries showing that if the Exponential Time Hypothesis (ETH)
holds, then any such efficient query algorithm needs to make $\Omega
\left(\frac{k}{poly \log k} \right)$ samecluster queries. Our algorithm can be
extended for the case when the oracle is faulty. Another result we show with
respect to the $k$means++ seeding algorithm is that a small modification to
the $k$means++ seeding algorithm within the SSAC framework converts it to a
constant factor approximation algorithm instead of the well known
$O(\log{k})$approximation algorithm.

Convolutional networks have marked their place over the last few years as the
best performing model for various visual tasks. They are, however, most suited
for supervised learning from large amounts of labeled data. Previous attempts
have been made to use unlabeled data to improve model performance by applying
unsupervised techniques. These attempts require different architectures and
training methods. In this work we present a novel approach for unsupervised
training of Convolutional networks that is based on contrasting between spatial
regions within images. This criterion can be employed within conventional
neural networks and trained using standard techniques such as SGD and
backpropagation, thus complementing supervised methods.

The Fourier Transform is one of the most important linear transformations
used in science and engineering. Cooley and Tukey's Fast Fourier Transform
(FFT) from 1964 is a method for computing this transformation in time $O(n\log
n)$. Achieving a matching lower bound in a reasonable computational model is
one of the most important open problems in theoretical computer science.
In 2014, improving on his previous work, Ailon showed that if an algorithm
speeds up the FFT by a factor of $b=b(n)\geq 1$, then it must rely on
computing, as an intermediate "bottleneck" step, a linear mapping of the input
with condition number $\Omega(b(n))$. Our main result shows that a factor $b$
speedup implies existence of not just one but $\Omega(n)$ $b$ill conditioned
bottlenecks occurring at $\Omega(n)$ different steps, each causing information
from independent (orthogonal) components of the input to either overflow or
underflow. This provides further evidence that beating FFT is hard. Our result
also gives the first quantitative tradeoff between computation speed and
information loss in Fourier computation on fixed word size architectures. The
main technical result is an entropy analysis of the Fourier transform under
transformations of low trace, which is interesting in its own right.

Obtaining a nontrivial (superlinear) lower bound for computation of the
Fourier transform in the linear circuit model has been a long standing open
problem for over 40 years.
An early result by Morgenstern from 1973, provides an $\Omega(n \log n)$
lower bound for the unnormalized Fourier transform when the constants used in
the computation are bounded. The proof uses a potential function related to a
determinant. The result does not explain why the normalized Fourier transform
(of unit determinant) should be difficult to compute in the same model. Hence,
the result is not scale insensitive.
More recently, Ailon (2013) showed that if only unitary 2by2 gates are
used, and additionally no extra memory is allowed, then the normalized Fourier
transform requires $\Omega(n\log n)$ steps. This rather limited result is also
sensitive to scaling, but highlights the complexity inherent in the Fourier
transform arising from introducing entropy, unlike, say, the identity matrix
(which is as complex as the Fourier transform using Morgenstern's arguments,
under proper scaling).
In this work we extend the arguments of Ailon (2013). In the first extension,
which is also the main contribution, we provide a lower bound for computing any
scaling of the Fourier transform. Our restriction is that, the composition of
all gates up to any point must be a well conditioned linear transformation. The
lower bound is $\Omega(R^{1}n\log n)$, where $R$ is the uniform condition
number. Second, we assume extra space is allowed, as long as it contains
information of bounded norm at the end of the computation.
The main technical contribution is an extension of matrix entropy used in
Ailon (2013) for unitary matrices to a potential function computable for any
matrix, using Shannon entropy on "quasiprobabilities".

The permutahedron is the convex polytope with vertex set consisting of the
vectors $(\pi(1),\dots, \pi(n))$ for all permutations (bijections) $\pi$ over
$\{1,\dots, n\}$. We study a bandit game in which, at each step $t$, an
adversary chooses a hidden weight weight vector $s_t$, a player chooses a
vertex $\pi_t$ of the permutahedron and suffers an observed loss of
$\sum_{i=1}^n \pi(i) s_t(i)$.
A previous algorithm CombBand of CesaBianchi et al (2009) guarantees a
regret of $O(n\sqrt{T \log n})$ for a time horizon of $T$. Unfortunately,
CombBand requires at each step an $n$by$n$ matrix permanent approximation to
within improved accuracy as $T$ grows, resulting in a total running time that
is super linear in $T$, making it impractical for large time horizons.
We provide an algorithm of regret $O(n^{3/2}\sqrt{T})$ with total time
complexity $O(n^3T)$. The ideas are a combination of CombBand and a recent
algorithm by Ailon (2013) for online optimization over the permutahedron in the
full information setting. The technical core is a bound on the variance of the
PlackettLuce noisy sorting process's "pseudo loss". The bound is obtained by
establishing positive semidefiniteness of a family of 3by3 matrices
generated from rational functions of exponentials of 3 parameters.

We present algorithms for reducing the Dueling Bandits problem to the
conventional (stochastic) MultiArmed Bandits problem. The Dueling Bandits
problem is an online model of learning with ordinal feedback of the form "A is
preferred to B" (as opposed to cardinal feedback like "A has value 2.5"),
giving it wide applicability in learning from implicit user feedback and
revealed and stated preferences. In contrast to existing algorithms for the
Dueling Bandits problem, our reductions  named $\Doubler$, $\MultiSbm$ and
$\DoubleSbm$  provide a generic schema for translating the extensive body of
known results about conventional MultiArmed Bandit algorithms to the Dueling
Bandits setting. For $\Doubler$ and $\MultiSbm$ we prove regret upper bounds in
both finite and infinite settings, and conjecture about the performance of
$\DoubleSbm$ which empirically outperforms the other two as well as previous
algorithms in our experiments. In addition, we provide the first almost optimal
regret bound in terms of second order terms, such as the differences between
the values of the arms.

The kmeans++ seeding algorithm is one of the most popular algorithms that is
used for finding the initial $k$ centers when using the kmeans heuristic. The
algorithm is a simple sampling procedure and can be described as follows: Pick
the first center randomly from the given points. For $i > 1$, pick a point to
be the $i^{th}$ center with probability proportional to the square of the
Euclidean distance of this point to the closest previously $(i1)$ chosen
centers.
The kmeans++ seeding algorithm is not only simple and fast but also gives an
$O(\log{k})$ approximation in expectation as shown by Arthur and Vassilvitskii.
There are datasets on which this seeding algorithm gives an approximation
factor of $\Omega(\log{k})$ in expectation. However, it is not clear from these
results if the algorithm achieves good approximation factor with reasonably
high probability (say $1/poly(k)$). Brunsch and R\"{o}glin gave a dataset where
the kmeans++ seeding algorithm achieves an $O(\log{k})$ approximation ratio
with probability that is exponentially small in $k$. However, this and all
other known lowerbound examples are high dimensional. So, an open problem was
to understand the behavior of the algorithm on low dimensional datasets. In
this work, we give a simple two dimensional dataset on which the seeding
algorithm achieves an $O(\log{k})$ approximation ratio with probability
exponentially small in $k$. This solves open problems posed by Mahajan et al.
and by Brunsch and R\"{o}glin.

Given a set $V$ of $n$ objects, an online ranking system outputs at each time
step a full ranking of the set, observes a feedback of some form and suffers a
loss. We study the setting in which the (adversarial) feedback is an element in
$V$, and the loss is the position (0th, 1st, 2nd...) of the item in the
outputted ranking. More generally, we study a setting in which the feedback is
a subset $U$ of at most $k$ elements in $V$, and the loss is the sum of the
positions of those elements.
We present an algorithm of expected regret $O(n^{3/2}\sqrt{Tk})$ over a time
horizon of $T$ steps with respect to the best single ranking in hindsight. This
improves previous algorithms and analyses either by a factor of either
$\Omega(\sqrt{k})$, a factor of $\Omega(\sqrt{\log n})$ or by improving running
time from quadratic to $O(n\log n)$ per round. We also prove a matching lower
bound. Our techniques also imply an improved regret bound for online rank
aggregation over the Spearman correlation measure, and to other more complex
ranking loss functions.

Obtaining a nontrivial (superlinear) lower bound for computation of the
Fourier transform in the linear circuit model has been a long standing open
problem. All lower bounds so far have made strong restrictions on the
computational model. One of the most well known results, by Morgenstern from
1973, provides an $\Omega(n \log n)$ lower bound for the \emph{unnormalized}
FFT when the constants used in the computation are bounded. The proof uses a
potential function related to a determinant. The determinant of the
unnormalized Fourier transform is $n^{n/2}$, and thus by showing that it can
grow by at most a constant factor after each step yields the result.
This classic result, however, does not explain why the \emph{normalized}
Fourier transform, which has a unit determinant, should take $\Omega(n\log n)$
steps to compute. In this work we show that in a layered linear circuit model
restricted to unitary $2\times 2$ gates, one obtains an $\Omega(n\log n)$ lower
bound. The well known FFT works in this model. The main argument concluded from
this work is that a potential function that might eventually help proving the
$\Omega(n\log n)$ conjectured lower bound for computation of Fourier transform
is not related to matrix determinant, but rather to a notion of matrix entropy.

This paper investigates graph clustering in the planted cluster model in the
presence of {\em small clusters}. Traditional results dictate that for an
algorithm to provably correctly recover the clusters, {\em all} clusters must
be sufficiently large (in particular, $\tilde{\Omega}(\sqrt{n})$ where $n$ is
the number of nodes of the graph). We show that this is not really a
restriction: by a more refined analysis of the tracenorm based recovery
approach proposed in Jalali et al. (2011) and Chen et al. (2012), we prove that
small clusters, under certain mild assumptions, do not hinder recovery of large
ones.
Based on this result, we further devise an iterative algorithm to recover
{\em almost all clusters} via a "peeling strategy", i.e., recover large
clusters first, leading to a reduced problem, and repeat this procedure. These
results are extended to the {\em partial observation} setting, in which only a
(chosen) part of the graph is observed.The peeling strategy gives rise to an
active learning algorithm, in which edges adjacent to smaller clusters are
queried more often as large clusters are learned (and removed).
From a high level, this paper sheds novel insights on highdimensional
statistics and learning structured data, by presenting a structured matrix
learning problem for which a one shot convex relaxation approach necessarily
fails, but a carefully constructed sequence of convex relaxationsdoes the job.

We study constructions of $k \times n$ matrices $A$ that both (1) satisfy the
restricted isometry property (RIP) at sparsity $s$ with optimal parameters, and
(2) are efficient in the sense that only $O(n\log n)$ operations are required
to compute $Ax$ given a vector $x$. Our construction is based on repeated
application of independent transformations of the form $DH$, where $H$ is a
Hadamard or Fourier transform and $D$ is a diagonal matrix with random
$\{+1,1\}$ elements on the diagonal, followed by any $k \times n$ matrix of
orthonormal rows (e.g.\ selection of $k$ coordinates). We provide guarantees
(1) and (2) for a larger regime of parameters for which such constructions were
previously unknown. Additionally, our construction does not suffer from the
extra polylogarithmic factor multiplying the number of observations $k$ as a
function of the sparsity $s$, as present in the currently best known RIP
estimates for partial random Fourier matrices and other classes of structured
random matrices.

The disagreement coefficient of Hanneke has become a central data independent
invariant in proving active learning rates. It has been shown in various ways
that a concept class with low complexity together with a bound on the
disagreement coefficient at an optimal solution allows active learning rates
that are superior to passive learning ones.
We present a different tool for pool based active learning which follows from
the existence of a certain uniform version of low disagreement coefficient, but
is not equivalent to it. In fact, we present two fundamental active learning
problems of significant interest for which our approach allows nontrivial
active learning bounds. However, any general purpose method relying on the
disagreement coefficient bounds only fails to guarantee any useful bounds for
these problems.
The tool we use is based on the learner's ability to compute an estimator of
the difference between the loss of any hypotheses and some fixed "pivotal"
hypothesis to within an absolute error of at most $\eps$ times the

We revisit various PTAS's (Polynomial Time Approximation Schemes) for
minimization versions of dense problems, and show that they can be performed
with sublinear query complexity. This means that not only do we obtain a
(1+eps)approximation to the NPHard problems in polynomial time, but also
avoid reading the entire input. This setting is particularly advantageous when
the price of reading parts of the input is high, as is the case, for examples,
where humans provide the input. Trading off query complexity with approximation
is the raison d'etre of the field of learning theory, and of the ERM (Empirical
Risk Minimization) setting in particular. A typical ERM result, however, does
not deal with computational complexity. We discuss two particular problems for
which (a) it has already been shown that sublinear querying is sufficient for
obtaining a (1 + eps)approximation using unlimited computational power (an ERM
result), and (b) with full access to input, we could get a
(1+eps)approximation in polynomial time (a PTAS). Here we show that neither
benefit need be sacrificed. We get a PTAS with efficient query complexity.

Clustering is considered a nonsupervised learning setting, in which the goal
is to partition a collection of data points into disjoint clusters. Often a
bound $k$ on the number of clusters is given or assumed by the practitioner.
Many versions of this problem have been defined, most notably $k$means and
$k$median.
An underlying problem with the unsupervised nature of clustering it that of
determining a similarity function. One approach for alleviating this difficulty
is known as clustering with side information, alternatively, semisupervised
clustering. Here, the practitioner incorporates side information in the form of
"must be clustered" or "must be separated" labels for data point pairs. Each
such piece of information comes at a "query cost" (often involving human
response solicitation). The collection of labels is then incorporated in the
usual clustering algorithm as either strict or as soft constraints, possibly
adding a pairwise constraint penalty function to the chosen clustering
objective.
Our work is mostly related to clustering with side information. We ask how to
choose the pairs of data points. Our analysis gives rise to a method provably
better than simply choosing them uniformly at random. Roughly speaking, we show
that the distribution must be biased so as more weight is placed on pairs
incident to elements in smaller clusters in some optimal solution. Of course we
do not know the optimal solution, hence we don't know the bias. Using the
recently introduced method of $\eps$smooth relative regret approximations of
Ailon, Begleiter and Ezra, we can show an iterative process that improves both
the clustering and the bias in tandem. The process provably converges to the
optimal solution faster (in terms of query cost) than an algorithm selecting
pairs uniformly.

We study the problem of learning to rank from pairwise preferences, and solve
a longstanding open problem that has led to development of many heuristics but
no provable results for our particular problem. Given a set $V$ of $n$
elements, we wish to linearly order them given pairwise preference labels. A
pairwise preference label is obtained as a response, typically from a human, to
the question "which if preferred, u or v?$ for two elements $u,v\in V$. We
assume possible nontransitivity paradoxes which may arise naturally due to
human mistakes or irrationality. The goal is to linearly order the elements
from the most preferred to the least preferred, while disagreeing with as few
pairwise preference labels as possible. Our performance is measured by two
parameters: The loss and the query complexity (number of pairwise preference
labels we obtain). This is a typical learning problem, with the exception that
the space from which the pairwise preferences is drawn is finite, consisting of
${n\choose 2}$ possibilities only. We present an active learning algorithm for
this problem, with query bounds significantly beating general (non active)
bounds for the same error guarantee, while almost achieving the information
theoretical lower bound. Our main construct is a decomposition of the input
s.t. (i) each block incurs high loss at optimum, and (ii) the optimal solution
respecting the decomposition is not much worse than the true opt. The
decomposition is done by adapting a recent result by Kenyon and Schudy for a
related combinatorial optimization problem to the query efficient setting. We
thus settle an open problem posed by learningtorank theoreticians and
practitioners: What is a provably correct way to sample preference labels? To
further show the power and practicality of our solution, we show how to use it
in concert with an SVM relaxation.

Bipartite Correlation clustering is the problem of generating a set of
disjoint bicliques on a set of nodes while minimizing the symmetric difference
to a bipartite input graph. The number or size of the output clusters is not
constrained in any way. The best known approximation algorithm for this problem
gives a factor of 11. This result and all previous ones involve solving large
linear or semidefinite programs which become prohibitive even for modestly
sized tasks. In this paper we present an improved factor 4 approximation
algorithm to this problem using a simple combinatorial algorithm which does not
require solving large convex programs. The analysis extends a method developed
by Ailon, Charikar and Alantha in 2008, where a randomized pivoting algorithm
was analyzed for obtaining a 3approximation algorithm for Correlation
Clustering, which is the same problem on graphs (not bipartite). The analysis
for Correlation Clustering there required defining events for structures
containing 3 vertices and using the probability of these events to produce a
feasible solution to a dual of a certain natural LP bounding the optimal cost.
It is tempting here to use sets of 4 vertices, which are the smallest
structures for which contradictions arise for Bipartite Correlation Clustering.
This simple idea, however, appears to be evasive. We show that, by modifying
the LP, we can analyze algorithms which take into consideration subgraph
structures of unbounded size. We believe our techniques are interesting in
their own right, and may be used for other problems as well.

We investigate ways in which an algorithm can improve its expected
performance by finetuning itself automatically with respect to an unknown
input distribution D. We assume here that D is of product type. More precisely,
suppose that we need to process a sequence I_1, I_2, ... of inputs I = (x_1,
x_2, ..., x_n) of some fixed length n, where each x_i is drawn independently
from some arbitrary, unknown distribution D_i. The goal is to design an
algorithm for these inputs so that eventually the expected running time will be
optimal for the input distribution D = D_1 * D_2 * ... * D_n.
We give such selfimproving algorithms for two problems: (i) sorting a
sequence of numbers and (ii) computing the Delaunay triangulation of a planar
point set. Both algorithms achieve optimal expected limiting complexity. The
algorithms begin with a training phase during which they collect information
about the input distribution, followed by a stationary regime in which the
algorithms settle to their optimized incarnations.

The problems of random projections and sparse reconstruction have much in
common and individually received much attention. Surprisingly, until now they
progressed in parallel and remained mostly separate. Here, we employ new tools
from probability in Banach spaces that were successfully used in the context of
sparse reconstruction to advance on an open problem in random pojection. In
particular, we generalize and use an intricate result by Rudelson and Vershynin
for sparse reconstruction which uses Dudley's theorem for bounding Gaussian
processes. Our main result states that any set of $N = \exp(\tilde{O}(n))$ real
vectors in $n$ dimensional space can be linearly mapped to a space of dimension
$k=O(\log N\polylog(n))$, while (1) preserving the pairwise distances among the
vectors to within any constant distortion and (2) being able to apply the
transformation in time $O(n\log n)$ on each vector. This improves on the best
known $N = \exp(\tilde{O}(n^{1/2}))$ achieved by Ailon and Liberty and $N =
\exp(\tilde{O}(n^{1/3}))$ by Ailon and Chazelle.
The dependence in the distortion constant however is believed to be
suboptimal and subject to further investigation. For constant distortion, this
settles the open question posed by these authors up to a $\polylog(n)$ factor
while considerably simplifying their constructions.

The LETOR website contains three information retrieval datasets used as a
benchmark for testing machine learning ideas for ranking. Algorithms
participating in the challenge are required to assign score values to search
results for a collection of queries, and are measured using standard IR ranking
measures (NDCG, precision, MAP) that depend only the relative scoreinduced
order of the results. Similarly to many of the ideas proposed in the
participating algorithms, we train a linear classifier. In contrast with other
participating algorithms, we define an additional free variable (intercept, or
benchmark) for each query. This allows expressing the fact that results for
different queries are incomparable for the purpose of determining relevance.
The cost of this idea is the addition of relatively few nuisance parameters.
Our approach is simple, and we used a standard logistic regression library to
test it. The results beat the reported participating algorithms. Hence, it
seems promising to combine our approach with other more complex ideas.

This paper describes an efficient reduction of the learning problem of
ranking to binary classification. The reduction guarantees an average pairwise
misranking regret of at most that of the binary classifier regret, improving a
recent result of Balcan et al which only guarantees a factor of 2. Moreover,
our reduction applies to a broader class of ranking loss functions, admits a
simpler proof, and the expected running time complexity of our algorithm in
terms of number of calls to a classifier or preference function is improved
from $\Omega(n^2)$ to $O(n \log n)$. In addition, when the top $k$ ranked
elements only are required ($k \ll n$), as in many applications in information
extraction or search engines, the time complexity of our algorithm can be
further reduced to $O(k \log k + n)$. Our reduction and algorithm are thus
practical for realistic applications where the number of points to rank exceeds
several thousands. Much of our results also extend beyond the bipartite case
previously studied.
Our rediction is a randomized one. To complement our result, we also derive
lower bounds on any deterministic reduction from binary (preference)
classification to ranking, implying that our use of a randomized reduction is
essentially necessary for the guarantees we provide.