
A recent model for property testing of probability distributions (Chakraborty
et al., ITCS 2013, Canonne et al., SICOMP 2015) enables tremendous savings in
the sample complexity of testing algorithms, by allowing them to condition the
sampling on subsets of the domain. In particular, Canonne, Ron, and Servedio
(SICOMP 2015) showed that, in this setting, testing identity of an unknown
distribution $D$ (whether $D=D^\ast$ for an explicitly known $D^\ast$) can be
done with a constant number of queries, independent of the support size $n$ 
in contrast to the required $\Omega(\sqrt{n})$ in the standard sampling model.
It was unclear whether the same stark contrast exists for the case of testing
equivalence, where both distributions are unknown. While Canonne et al.
established a $\mathrm{poly}(\log n)$query upper bound for equivalence
testing, very recently brought down to $\tilde O(\log\log n)$ by Falahatgar et
al. (COLT 2015), whether a dependence on the domain size $n$ is necessary was
still open, and explicitly posed by Fischer at the Bertinoro Workshop on
Sublinear Algorithms (2014). We show that any testing algorithm for equivalence
must make $\Omega(\sqrt{\log\log n})$ queries in the conditional sampling
model. This demonstrates a gap between identity and equivalence testing, absent
in the standard sampling model (where both problems have sampling complexity
$n^{\Theta(1)}$).
We also obtain results on the query complexity of uniformity testing and
supportsize estimation with conditional samples. We answer a question of
Chakraborty et al. (ITCS 2013) showing that nonadaptive uniformity testing
indeed requires $\Omega(\log n)$ queries in the conditional model. For the
related problem of supportsize estimation, we provide both adaptive and
nonadaptive algorithms, with query complexities $\mathrm{poly}(\log\log n)$
and $\mathrm{poly}(\log n)$, respectively.

Independent samples from an unknown probability distribution $\mathbf{p}$ on
a domain of size $k$ are distributed across $n$ players, with each player
holding one sample. Each player can communicate $\ell$ bits to a central
referee in a simultaneous message passing (SMP) model of communication to help
the referee infer a property of the unknown $\mathbf{p}$. When $\ell\geq\log k$
bits, the problem reduces to the wellstudied collocated case where all the
samples are available in one place. In this work, we focus on the
communicationstarved setting of $\ell < \log k$, in which the landscape may
change drastically. We propose a general formulation for inference problems in
this distributed setting, and instantiate it to two prototypical inference
questions: learning and uniformity testing.

We develop differentially private methods for estimating various
distributional properties. Given a sample from a discrete distribution $p$,
some functional $f$, and accuracy and privacy parameters $\alpha$ and
$\varepsilon$, the goal is to estimate $f(p)$ up to accuracy $\alpha$, while
maintaining $\varepsilon$differential privacy of the sample.
We prove almosttight bounds on the sample size required for this problem for
several functionals of interest, including support size, support coverage, and
entropy. We show that the cost of privacy is negligible in a variety of
settings, both theoretically and experimentally. Our methods are based on a
sensitivity analysis of several stateoftheart methods for estimating these
properties with sublinear sample complexities.

We consider discrete distribution estimation over $k$ elements under
$\varepsilon$local differential privacy from $n$ samples. The samples are
distributed across users who send privatized versions of their sample to the
server. All previously known sample optimal algorithms require linear (in $k$)
communication complexity in the high privacy regime $(\varepsilon<1)$, and have
a running time that grows as $n\cdot k$, which can be prohibitive for large
domain size $k$.
We study the task simultaneously under four resource constraints, privacy,
sample complexity, computational complexity, and communication complexity. We
propose \emph{Hadamard Response (HR)}, a local noninteractive privatization
mechanism with order optimal sample complexity (for all privacy regimes), a
communication complexity of $\log k+2$ bits, and runs in nearly linear time.
Our encoding and decoding mechanisms are based on Hadamard matrices, and are
simple to implement. The gain in sample complexity comes from the large Hamming
distance between rows of Hadamard matrices, and the gain in time complexity is
achieved by using the Fast WalshHadamard transform.
We compare our approach with Randomized Response (RR), RAPPOR, and
subsetselection mechanisms (SS), theoretically, and experimentally. For
$k=10000$, our algorithm runs about 100x faster than SS, and RAPPOR.

We study the fundamental problems of identity testing (goodness of fit), and
closeness testing (two sample test) of distributions over $k$ elements, under
differential privacy. While the problems have a long history in statistics,
finite sample bounds for these problems have only been established recently.
In this work, we derive upper and lower bounds on the sample complexity of
both the problems under $\varepsilon$differential privacy. Our results improve
over the best known algorithms for identity testing, and are the first results
for differentially private closeness testing.
Our bounds are tight up to a constant factor whenever the number of samples
is $O(k)$, a regime which has garnered much attention over the last decade
since it allows property testing even when only a fraction of the domain is
observed.
Our upper bounds are obtained by converting (and sometimes combining) the
known nonprivate testing algorithms into differentially private algorithms. We
propose a simple, yet general procedure based on coupling of distributions, to
establish sample complexity lower bounds for differentially private algorithms.

Unlike compressive sensing where the measurement outputs are assumed to be
realvalued and have infinite precision, in "onebit compressive sensing",
measurements are quantized to one bit, their signs. In this work, we show how
to recover the support of sparse highdimensional vectors in the onebit
compressive sensing framework with an asymptotically nearoptimal number of
measurements. We also improve the bounds on the number of measurements for
approximately recovering vectors from onebit compressive sensing measurements.
Our results are universal, namely the same measurement scheme works
simultaneously for all sparse vectors.
Our proof of optimality for support recovery is obtained by showing an
equivalence between the task of support recovery using 1bit compressive
sensing and a wellstudied combinatorial object known as Union Free Families.

The advent of data science has spurred interest in estimating properties of
distributions over large alphabets. Fundamental symmetric properties such as
support size, support coverage, entropy, and proximity to uniformity, received
most attention, with each property estimated using a different technique and
often intricate analysis tools.
We prove that for all these properties, a single, simple, plugin
estimatorprofile maximum likelihood (PML)performs as well as the best
specialized techniques. This raises the possibility that PML may optimally
estimate many other symmetric properties.

We study the fixed design segmented regression problem: Given noisy samples
from a piecewise linear function $f$, we want to recover $f$ up to a desired
accuracy in meansquared error.
Previous rigorous approaches for this problem rely on dynamic programming
(DP) and, while sample efficient, have running time quadratic in the sample
size. As our main contribution, we provide new sample nearlinear time
algorithms for the problem that  while not being minimax optimal  achieve a
significantly better sampletime tradeoff on large datasets compared to the DP
approach. Our experimental evaluation shows that, compared with the DP
approach, our algorithms provide a convergence rate that is only off by a
factor of $2$ to $4$, while achieving speedups of three orders of magnitude.

We study maximum selection and sorting of $n$ numbers using pairwise
comparators that output the larger of their two inputs if the inputs are more
than a given threshold apart, and output an adversariallychosen input
otherwise. We consider two adversarial models. A nonadaptive adversary that
decides on the outcomes in advance based solely on the inputs, and an adaptive
adversary that can decide on the outcome of each query depending on previous
queries and outcomes.
Against the nonadaptive adversary, we derive a maximumselection algorithm
that uses at most $2n$ comparisons in expectation, and a sorting algorithm that
uses at most $2n \ln n$ comparisons in expectation. These numbers are within
small constant factors from the best possible. Against the adaptive adversary,
we propose a maximumselection algorithm that uses $\Theta(n\log
(1/{\epsilon}))$ comparisons to output a correct answer with probability at
least $1\epsilon$. The existence of this algorithm affirmatively resolves an
open problem of Ajtai, Feldman, Hassadim, and Nelson.
Our study was motivated by a densityestimation problem where, given samples
from an unknown underlying distribution, we would like to find a distribution
in a known class of $n$ candidate distributions that is close to underlying
distribution in $\ell_1$ distance. Scheffe's algorithm outputs a distribution
at an $\ell_1$ distance at most 9 times the minimum and runs in time
$\Theta(n^2\log n)$. Using maximum selection, we propose an algorithm with the
same approximation guarantee but run time of $\Theta(n\log n)$.

It was recently shown that estimating the Shannon entropy $H({\rm p})$ of a
discrete $k$symbol distribution ${\rm p}$ requires $\Theta(k/\log k)$ samples,
a number that grows nearlinearly in the support size. In many applications
$H({\rm p})$ can be replaced by the more general R\'enyi entropy of order
$\alpha$, $H_\alpha({\rm p})$. We determine the number of samples needed to
estimate $H_\alpha({\rm p})$ for all $\alpha$, showing that $\alpha < 1$
requires a superlinear, roughly $k^{1/\alpha}$ samples, noninteger $\alpha>1$
requires a nearlinear $k$ samples, but, perhaps surprisingly, integer
$\alpha>1$ requires only $\Theta(k^{11/\alpha})$ samples. Furthermore,
developing on a recently established connection between polynomial
approximation and estimation of additive functions of the form $\sum_{x} f({\rm
p}_x)$, we reduce the sample complexity for noninteger values of $\alpha$ by a
factor of $\log k$ compared to the empirical estimator. The estimators
achieving these bounds are simple and run in time linear in the number of
samples. Our lower bounds provide explicit constructions of distributions with
different R\'enyi entropies that are hard to distinguish.

Given samples from an unknown distribution $p$, is it possible to distinguish
whether $p$ belongs to some class of distributions $\mathcal{C}$ versus $p$
being far from every distribution in $\mathcal{C}$? This fundamental question
has received tremendous attention in statistics, focusing primarily on
asymptotic analysis, and more recently in information theory and theoretical
computer science, where the emphasis has been on small sample size and
computational complexity. Nevertheless, even for basic properties of
distributions such as monotonicity, logconcavity, unimodality, independence,
and monotonehazard rate, the optimal sample complexity is unknown.
We provide a general approach via which we obtain sampleoptimal and
computationally efficient testers for all these distribution families. At the
core of our approach is an algorithm which solves the following problem: Given
samples from an unknown distribution $p$, and a known distribution $q$, are $p$
and $q$ close in $\chi^2$distance, or far in total variation distance?
The optimality of our testers is established by providing matching lower
bounds with respect to both $n$ and $\varepsilon$. Finally, a necessary
building block for our testers and an important byproduct of our work are the
first known computationally efficient proper learners for discrete logconcave
and monotone hazard rate distributions.

We design a new, fast algorithm for agnostically learning univariate
probability distributions whose densities are well approximated by piecewise
polynomial functions. Let $f$ be the density function of an arbitrary
univariate distribution, and suppose that $f$ is $\mathrm{OPT}$close in
$L_1$distance to an unknown piecewise polynomial function with $t$ interval
pieces and degree $d$. Our algorithm draws $n = O(t(d+1)/\epsilon^2)$ samples
from $f$, runs in time $\tilde{O}(n \cdot \mathrm{poly}(d))$, and with
probability at least $9/10$ outputs an $O(t)$piecewise degree$d$ hypothesis
$h$ that is $4 \cdot \mathrm{OPT} +\epsilon$ close to $f$.
Our general algorithm yields (nearly) sampleoptimal and nearlylinear time
estimators for a wide range of structured distribution families over both
continuous and discrete domains in a unified way. For most of our applications,
these are the first sampleoptimal and nearlylinear time estimators in the
literature. As a consequence, our work resolves the sample and computational
complexities of a broad class of inference tasks via a single "metaalgorithm".
Moreover, we experimentally demonstrate that our algorithm performs very well
in practice.
Our algorithm consists of three "levels": (i) At the top level, we employ an
iterative greedy algorithm for finding a good partition of the real line into
the pieces of a piecewise polynomial. (ii) For each piece, we show that the
subproblem of finding a good polynomial fit on the current interval can be
solved efficiently with a separation oracle method. (iii) We reduce the task of
finding a separating hyperplane to a combinatorial problem and give an
efficient algorithm for this problem. Combining these three procedures gives a
density estimation algorithm with the claimed guarantees.

A Poisson Binomial distribution over $n$ variables is the distribution of the
sum of $n$ independent Bernoullis. We provide a sample nearoptimal algorithm
for testing whether a distribution $P$ supported on $\{0,...,n\}$ to which we
have sample access is a Poisson Binomial distribution, or far from all Poisson
Binomial distributions. The sample complexity of our algorithm is $O(n^{1/4})$
to which we provide a matching lower bound. We note that our sample complexity
improves quadratically upon that of the naive "learn followed by toleranttest"
approach, while instance optimal identity testing [VV14] is not applicable
since we are looking to simultaneously test against a whole family of
distributions.

The Poissonsampling technique eliminates dependencies among symbol
appearances in a random sequence. It has been used to simplify the analysis and
strengthen the performance guarantees of randomized algorithms. Applying this
method to universal compression, we relate the redundancies of fixedlength and
Poissonsampled sequences, use the relation to derive a simple singleletter
formula that approximates the redundancy of any envelope class to within an
additive logarithmic term. As a first application, we consider i.i.d.
distributions over a small alphabet as a stepenvelope class, and provide a
short proof that determines the redundancy of discrete distributions over a
small al phabet up to the first order terms. We then show the strength of our
method by applying the formula to tighten the existing bounds on the redundancy
of exponential and powerlaw classes, in particular answering a question posed
by Boucheron, Garivier and Gassiat.

Motivated by massspectrometry protein sequencing, we consider a
simplystated problem of reconstructing a string from the multiset of its
substring compositions. We show that all strings of length 7, one less than a
prime, or one less than twice a prime, can be reconstructed uniquely up to
reversal. For all other lengths we show that reconstruction is not always
possible and provide sometimestight bounds on the largest number of strings
with given substring compositions. The lower bounds are derived by
combinatorial arguments and the upper bounds by algebraic considerations that
precisely characterize the set of strings with the same substring compositions
in terms of the factorization of bivariate polynomials. The problem can be
viewed as a combinatorial simplification of the turnpike problem, and its
solution may shed light on this longstanding problem as well. Using well known
results on transience of multidimensional random walks, we also provide a
reconstruction algorithm that reconstructs random strings over alphabets of
size $\ge4$ in optimal nearquadratic time.

Statistical and machinelearning algorithms are frequently applied to
highdimensional data. In many of these applications data is scarce, and often
much more costly than computation time. We provide the first sampleefficient
polynomialtime estimator for highdimensional spherical Gaussian mixtures.
For mixtures of any $k$ $d$dimensional spherical Gaussians, we derive an
intuitive spectralestimator that uses
$\mathcal{O}_k\bigl(\frac{d\log^2d}{\epsilon^4}\bigr)$ samples and runs in time
$\mathcal{O}_{k,\epsilon}(d^3\log^5 d)$, both significantly lower than
previously known. The constant factor $\mathcal{O}_k$ is polynomial for sample
complexity and is exponential for the time complexity, again much smaller than
what was previously known. We also show that
$\Omega_k\bigl(\frac{d}{\epsilon^2}\bigr)$ samples are needed for any
algorithm. Hence the sample complexity is nearoptimal in the number of
dimensions.
We also derive a simple estimator for onedimensional mixtures that uses
$\mathcal{O}\bigl(\frac{k \log \frac{k}{\epsilon} }{\epsilon^2} \bigr)$ samples
and runs in time
$\widetilde{\mathcal{O}}\left(\bigl(\frac{k}{\epsilon}\bigr)^{3k+1}\right)$.
Our other technical contributions include a faster algorithm for choosing a
density estimate from a set of distributions, that minimizes the $\ell_1$
distance to an unknown underlying distribution.

This paper proposes a novel method for segmentation of images by hierarchical
multilevel thresholding. The method is global, agglomerative in nature and
disregards pixel locations. It involves the optimization of the ratio of the
unbiased estimators of within class to between class variances. We obtain a
recursive relation at each step for the variances which expedites the process.
The efficacy of the method is shown in a comparison with some wellknown
methods.

A novel algorithm is proposed for segmenting an image into multiple levels
using its mean and variance. Starting from the extreme pixel values at both
ends of the histogram plot, the algorithm is applied recursively on subranges
computed from the previous step, so as to find a threshold level and a new
subrange for the next step, until no significant improvement in image quality
can be achieved. The method makes use of the fact that a number of
distributions tend towards Dirac delta function, peaking at the mean, in the
limiting condition of vanishing variance. The procedure naturally provides for
variable size segmentation with bigger blocks near the extreme pixel values and
finer divisions around the mean or other chosen value for better visualization.
Experiments on a variety of images show that the new algorithm effectively
segments the image in computationally very less time.