
Many algorithms in machine learning and computational geometry require, as
input, the intrinsic dimension of the manifold that supports the probability
distribution of the data. This parameter is rarely known and therefore has to
be estimated. We characterize the statistical difficulty of this problem by
deriving upper and lower bounds on the minimax rate for estimating the
dimension. First, we consider the problem of testing the hypothesis that the
support of the datagenerating probability distribution is a wellbehaved
manifold of intrinsic dimension $d_1$ versus the alternative that it is of
dimension $d_2$, with $d_{1}<d_{2}$. With an i.i.d. sample of size $n$, we
provide an upper bound on the probability of choosing the wrong dimension of
$O\left( n^{\left(d_{2}/d_{1}1\epsilon\right)n} \right)$, where $\epsilon$
is an arbitrarily small positive number. The proof is based on bounding the
length of the traveling salesman path through the data points. We also
demonstrate a lower bound of $\Omega \left( n^{(2d_{2}2d_{1}+\epsilon)n}
\right)$, by applying Le Cam's lemma with a specific set of $d_{1}$dimensional
probability distributions. We then extend these results to get minimax rates
for estimating the dimension of wellbehaved manifolds. We obtain an upper
bound of order $O \left( n^{(\frac{1}{m1}\epsilon)n} \right)$ and a lower
bound of order $\Omega \left( n^{(2+\epsilon)n} \right)$, where $m$ is the
embedding dimension.

In most classification tasks there are observations that are ambiguous and
therefore difficult to correctly label. Setvalued classifiers output sets of
plausible labels rather than a single label, thereby giving a more appropriate
and informative treatment to the labeling of ambiguous instances. We introduce
a framework for multiclass setvalued classification, where the classifiers
guarantee userdefined levels of coverage or confidence (the probability that
the true label is contained in the set) while minimizing the ambiguity (the
expected size of the output). We first derive oracle classifiers assuming the
true distribution to be known. We show that the oracle classifiers are obtained
from level sets of the functions that define the conditional probability of
each class. Then we develop estimators with good asymptotic and finite sample
properties. The proposed estimators build on existing singlelabel classifiers.
The optimal classifier can sometimes output the empty set, but we provide two
solutions to fix this issue that are suitable for various practical needs.

Modebased clustering methods define clusters to be the basins of attraction
of the modes of a density estimate. The most common version is mean shift clus
tering which uses a gradient ascent algorithm to find the basins. Rodriguez and
Laio (2014) introduced a new method that is faster and simpler than mean shift
clustering. Furthermore, they define a clustering diagram that provides a sim
ple, twodimensional summary of the mode clustering information. We study the
statistical properties of this diagram and we propose some improvements and
extensions. In particular, we show a connection between the diagram and robust
linear regression.

Several new methods have been proposed for performing valid inference after
model selection. An older method is sampling splitting: use part of the data
for model selection and part for inference. In this paper we revisit sample
splitting combined with the bootstrap (or the Normal approximation). We show
that this leads to a simple, assumptionfree approach to inference and we
establish results on the accuracy of the method. In fact, we find new bounds on
the accuracy of the bootstrap and the Normal approximation for general
nonlinear parameters with increasing dimension which we then use to assess the
accuracy of regression inference. We show that an alternative, called the image
bootstrap, has higher coverage accuracy at the cost of more computation. We
define new parameters that measure variable importance and that can be inferred
with greater accuracy than the usual regression coefficients. There is a
inferenceprediction tradeoff: splitting increases the accuracy and robustness
of inference but can decrease the accuracy of the predictions.

In this work, we generalize the Cram\'ervon Mises statistic via projection
pursuit to obtain robust tests for the multivariate twosample problem. The
proposed tests are consistent against all fixed alternatives, robust to
heavytailed data and minimax rate optimal. Our test statistics are completely
free of tuning parameters and are computationally efficient even in high
dimensions. When the dimension tends to infinity, the proposed test is shown to
have identical power to that of the existing highdimensional mean tests under
certain location models. As a byproduct of our approach, we introduce a new
metric called the angular distance which can be thought of as a robust
alternative to the Euclidean distance. Using the angular distance, we connect
the proposed to the reproducing kernel Hilbert space approach. In addition to
the Cram\'ervon Mises statistic, we show that the projection pursuit technique
can be used to define robust, multivariate tests in many other problems.

Various problems in manifold estimation make use of a quantity called the
reach, denoted by $\tau\_M$, which is a measure of the regularity of the
manifold. This paper is the first investigation into the problem of how to
estimate the reach. First, we study the geometry of the reach through an
approximation perspective. We derive new geometric results on the reach for
submanifolds without boundary. An estimator $\hat{\tau}$ of $\tau\_{M}$ is
proposed in a framework where tangent spaces are known, and bounds assessing
its efficiency are derived. In the case of i.i.d. random point cloud
$\mathbb{X}\_{n}$, $\hat{\tau}(\mathbb{X}\_{n})$ is showed to achieve uniform
expected loss bounds over a $\mathcal{C}^3$like model. Finally, we obtain
upper and lower bounds on the minimax rate for estimating the reach.

Recently, Tibshirani et al. (2016) proposed a method for making inferences
about parameters defined by model selection, in a typical regression setting
with normally distributed errors. Here, we study the large sample properties of
this method, without assuming normality. We prove that the test statistic of
Tibshirani et al. (2016) is asymptotically valid, as the number of samples n
grows and the dimension d of the regression problem stays fixed. Our asymptotic
result holds uniformly over a wide class of nonnormal error distributions. We
also propose an efficient bootstrap version of this test that is provably
(asymptotically) conservative, and in practice, often delivers shorter
intervals than those from the original normalitybased approach. Finally, we
prove that the test statistic of Tibshirani et al. (2016) does not enjoy
uniform validity in a highdimensional setting, when the dimension d is allowed
grow.

We consider the goodnessoffit testing problem of distinguishing whether the
data are drawn from a specified distribution, versus a composite alternative
separated from the null in the total variation metric. In the discrete case, we
consider goodnessoffit testing when the null distribution has a possibly
growing or unbounded number of categories. In the continuous case, we consider
testing a Lipschitz density, with possibly unbounded support, in the
lowsmoothness regime where the Lipschitz parameter is not assumed to be
constant. In contrast to existing results, we show that the minimax rate and
critical testing radius in these settings depend strongly, and in a precise
way, on the null distribution being tested and this motivates the study of the
(local) minimax rate as a function of the null distribution. For multinomials
the local minimax rate was recently studied in the work of Valiant and Valiant.
We revisit and extend their results and develop two modifications to the
chisquared test whose performance we characterize. For testing Lipschitz
densities, we show that the usual binning tests are inadequate in the
lowsmoothness regime and we design a spatially adaptive partitioning scheme
that forms the basis for our locally minimax optimal tests. Furthermore, we
provide the first local minimax lower bounds for this problem which yield a
sharp characterization of the dependence of the critical radius on the null
hypothesis being tested. In the lowsmoothness regime we also provide adaptive
tests, that adapt to the unknown smoothness parameter. We illustrate our
results with a variety of simulations that demonstrate the practical utility of
our proposed tests.

The MorseSmale complex of a function $f$ decomposes the sample space into
cells where $f$ is increasing or decreasing. When applied to nonparametric
density estimation and regression, it provides a way to represent, visualize,
and compare multivariate functions. In this paper, we present some statistical
results on estimating MorseSmale complexes. This allows us to derive new
results for two existing methods: mode clustering and MorseSmale regression.
We also develop two new methods based on the MorseSmale complex: a
visualization technique for multivariate functions and a twosample,
multivariate hypothesis test.

We develop a general framework for distributionfree predictive inference in
regression, using conformal inference. The proposed methodology allows for the
construction of a prediction band for the response variable using any estimator
of the regression function. The resulting prediction band preserves the
consistency properties of the original estimator under standard assumptions,
while guaranteeing finitesample marginal coverage even when these assumptions
do not hold. We analyze and compare, both empirically and theoretically, the
two major variants of our conformal framework: full conformal inference and
split conformal inference, along with a related jackknife method. These methods
offer different tradeoffs between statistical accuracy (length of resulting
prediction intervals) and computational efficiency. As extensions, we develop a
method for constructing valid insample prediction intervals called {\it
rankoneout} conformal inference, which has essentially the same computational
efficiency as split conformal inference. We also describe an extension of our
procedures for producing prediction bands with locally varying length, in order
to adapt to heteroskedascity in the data. Finally, we propose a modelfree
notion of variable importance, called {\it leaveonecovariateout} or LOCO
inference. Accompanying this paper is an R package {\tt conformalInference}
that implements all of the proposals we have introduced. In the spirit of
reproducibility, all of our empirical results can also be easily (re)generated
using this package.

A cluster tree provides a highlyinterpretable summary of a density function
by representing the hierarchy of its highdensity clusters. It is estimated
using the empirical tree, which is the cluster tree constructed from a density
estimator. This paper addresses the basic question of quantifying our
uncertainty by assessing the statistical significance of topological features
of an empirical cluster tree. We first study a variety of metrics that can be
used to compare different trees, analyze their properties and assess their
suitability for inference. We then propose methods to construct and summarize
confidence sets for the unknown true cluster tree. We introduce a partial
ordering on cluster trees which we use to prune some of the statistically
insignificant features of the empirical tree, yielding interpretable and
parsimonious cluster trees. Finally, we illustrate the proposed methods on a
variety of synthetic examples and furthermore demonstrate their utility in the
analysis of a GraftversusHost Disease (GvHD) data set.

We study the effects of filaments on galaxy properties in the Sloan Digital
Sky Survey (SDSS) Data Release 12 using filaments from the `Cosmic Web
Reconstruction' catalogue (Chen et al. 2016), a publicly available filament
catalogue for SDSS. Since filaments are tracers of mediumtohigh density
regions, we expect that galaxy properties associated with the environment are
dependent on the distance to the nearest filament. Our analysis demonstrates
that a red galaxy or a highmass galaxy tend to reside closer to filaments than
a blue or lowmass galaxy. After adjusting the effect from stellar mass, on
average, earlyforming galaxies or large galaxies have a shorter distance to
filaments than lateforming galaxies or small galaxies. For the Main galaxy
sample (MGS), all signals are very significant ($>6\sigma$). For the LOWZ and
CMASS sample, the stellar mass and size are significant ($>2 \sigma$). The
filament effects we observe persist until $z = 0.7$ (the edge of the CMASS
sample). Comparing our results to those using the galaxy distances from
redMaPPer galaxy clusters as a reference, we find a similar result between
filaments and clusters. Moreover, we find that the effect of clusters on the
stellar mass of nearby galaxies depends on the galaxy's filamentary
environment. Our findings illustrate the strong correlation of galaxy
properties with proximity to density ridges, strongly supporting the claim that
density ridges are good tracers of filaments.

Topological Data Analysis (TDA) can broadly be described as a collection of
data analysis methods that find structure in data. This includes: clustering,
manifold estimation, nonlinear dimension reduction, mode estimation, ridge
estimation and persistent homology. This paper reviews some of these methods.

We derive asymptotic theory for the plugin estimate for density level sets
under Hausdoff loss. Based on the asymptotic theory, we propose two bootstrap
confidence regions for level sets. The confidence regions can be used to
perform tests for anomaly detection and clustering. We also introduce a
technique to visualize high dimensional density level sets by combining mode
clustering and multidimensional scaling.

We present results derived from the first multichord stellar occultation by
the transNeptunian object (229762) 2007 UK$_{126}$, observed on 2014 November
15. The event was observed by the Research and Education Collaborative
Occultation Network (RECON) project and International Occultation Timing
Association (IOTA) collaborators throughout the United States. Use of two
different data analysis methods obtain a satisfactory fit to seven chords,
yelding an elliptical fit to the chords with an equatorial radius of
$R=338_{10} ^{+15}$ km and equivalent radius of $R_{eq}=319_{7} ^{+14}$ km. A
circular fit also gives a radius of $R=324_{23} ^{+30}$ km. Assuming that the
object is a Maclaurin spheroid with indeterminate aspect angle, and using two
published absolute magnitudes for the body, we derive possible ranges for
geometric albedo between $p_{V}=0.159_{0.013} ^{+0.007}$ and
$p_{R}=0.189_{0.015}^{+0.009}$, and for the body oblateness between
$\epsilon=0.105_{0.040} ^{+0.050}$ and $\epsilon=0.118_{0.048} ^{+0.055}$.
For a nominal rotational period of 11.05 h, an upper limit for density of
$\rho=1740$ kg~m$^{3}$ is estimated for the body.

We present a method for finding high density, lowdimensional structures in
noisy point clouds. These structures are sets with zero Lebesgue measure with
respect to the $D$dimensional ambient space and belong to a $d<D$ dimensional
space. We call them "singular features." Hunting for singular features
corresponds to finding unexpected or unknown structures hidden in point clouds
belonging to $\R^D$. Our method outputs well defined sets of dimensions $d<D$.
Unlike spectral clustering, the method works well in the presence of noise. We
show how to find singular features by first finding ridges in the estimated
density, followed by a filtering step based on the eigenvalues of the Hessian
of the density.

Modal regression estimates the local modes of the distribution of $Y$ given
$X=x$, instead of the mean, as in the usual regression sense, and can hence
reveal important structure missed by usual regression methods. We study a
simple nonparametric method for modal regression, based on a kernel density
estimate (KDE) of the joint distribution of $Y$ and $X$. We derive asymptotic
error bounds for this method, and propose techniques for constructing
confidence sets and prediction sets. The latter is used to select the smoothing
bandwidth of the underlying KDE. The idea behind modal regression is connected
to many others, such as mixture regression and density ridge estimation, and we
discuss these ties as well.

When data analysts train a classifier and check if its accuracy is
significantly different from a half, they are implicitly performing a
twosample test. We investigate the statistical optimality of this indirect but
flexible method in the highdimensional setting of $d/n \to c \in (0,\infty)$.
We provide a concrete answer for the case of distinguishing Gaussians with
meandifference $\delta$ and common (known or unknown) covariance $\Sigma$, by
contrasting the indirect approach using variants of linear discriminant
analysis (LDA) such as naive Bayes, with the direct approach using
corresponding variants of Hotelling's test. Somewhat surprisingly, the indirect
approach achieves the same power as the direct approach in terms of
$n,d,\delta,\Sigma$, and is only worse by a constant factor, achieving an
asymptotic relative efficiency of $1/\pi$ for the balanced sample case. Other
results of independent interest are provided, such as minimax lower bounds, and
optimality of Hotelling's test when $d=o(n)$. Simulation results validate our
theory, and we present practical takeaway messages along with several open
problems.

Linear independence testing is a fundamental informationtheoretic and
statistical problem that can be posed as follows: given $n$ points
$\{(X_i,Y_i)\}^n_{i=1}$ from a $p+q$ dimensional multivariate distribution
where $X_i \in \mathbb{R}^p$ and $Y_i \in\mathbb{R}^q$, determine whether $a^T
X$ and $b^T Y$ are uncorrelated for every $a \in \mathbb{R}^p, b\in
\mathbb{R}^q$ or not. We give minimax lower bound for this problem (when $p+q,n
\to \infty$, $(p+q)/n \leq \kappa < \infty$, without sparsity assumptions). In
summary, our results imply that $n$ must be at least as large as $\sqrt
{pq}/\\Sigma_{XY}\_F^2$ for any procedure (test) to have nontrivial power,
where $\Sigma_{XY}$ is the crosscovariance matrix of $X,Y$. We also provide
some evidence that the lower bound is tight, by connections to twosample
testing and regression in specific settings.

Mode clustering is a nonparametric method for clustering that defines
clusters using the basins of attraction of a density estimator's modes. We
provide several enhancements to mode clustering: (i) a soft variant of cluster
assignment, (ii) a measure of connectivity between clusters, (iii) a technique
for choosing the bandwidth, (iv) a method for denoising small clusters, and (v)
an approach to visualizing the clusters. Combining all these enhancements gives
us a complete procedure for clustering in multivariate problems. We also
compare mode clustering to other clustering methods in several examples

The large sample theory of estimators for density modes is well understood.
In this paper we consider density ridges, which are a higherdimensional
extension of modes. Modes correspond to zerodimensional, local highdensity
regions in point clouds. Density ridges correspond to $s$dimensional, local
highdensity regions in point clouds. We establish three main results. First we
show that under appropriate regularity conditions, the local variation of the
estimated ridge can be approximated by an empirical process. Second, we show
that the distribution of the estimated ridge converges to a Gaussian process.
Third, we establish that the bootstrap leads to valid confidence sets for
density ridges.

Persistence diagrams are twodimensional plots that summarize the topological
features of functions and are an important part of topological data analysis. A
problem that has received much attention is how deal with sets of persistence
diagrams. How do we summarize them, average them or cluster them? One approach
 the persistence intensity function  was introduced informally by
Edelsbrunner, Ivanov, and Karasev (2012). Here we provide a modification and
formalization of this approach. Using the persistence intensity function, we
can visualize multiple diagrams, perform clustering and conduct twosample
tests.

We construct a catalogue for filaments using a novel approach called SCMS
(subspace constrained mean shift; Ozertem & Erdogmus 2011; Chen et al. 2015).
SCMS is a gradientbased method that detects filaments through density ridges
(smooth curves tracing highdensity regions). A great advantage of SCMS is its
uncertainty measure, which allows an evaluation of the errors for the detected
filaments. To detect filaments, we use data from the Sloan Digital Sky Survey,
which consist of three galaxy samples: the NYU main galaxy sample (MGS), the
LOWZ sample and the CMASS sample. Each of the three dataset covers different
redshift regions so that the combined sample allows detection of filaments up
to z = 0.7. Our filament catalogue consists of a sequence of twodimensional
filament maps at different redshifts that provide several useful statistics on
the evolution cosmic web. To construct the maps, we select spectroscopically
confirmed galaxies within 0.050 < z < 0.700 and partition them into 130 bins.
For each bin, we ignore the redshift, treating the galaxy observations as a 2D
data and detect filaments using SCMS. The filament catalogue consists of 130
individual 2D filament maps, and each map comprises points on the detected
filaments that describe the filamentary structures at a particular redshift. We
also apply our filament catalogue to investigate galaxy luminosity and its
relation with distance to filament. Using a volumelimited sample, we find
strong evidence (6.1$\sigma$  12.3$\sigma$) that galaxies close to filaments
are generally brighter than those at significant distance from filaments.

The detection and characterization of filamentary structures in the cosmic
web allows cosmologists to constrain parameters that dictates the evolution of
the Universe. While many filament estimators have been proposed, they generally
lack estimates of uncertainty, reducing their inferential power. In this paper,
we demonstrate how one may apply the Subspace Constrained Mean Shift (SCMS)
algorithm (Ozertem and Erdogmus (2011); Genovese et al. (2012)) to uncover
filamentary structure in galaxy data. The SCMS algorithm is a gradient ascent
method that models filaments as density ridges, onedimensional smooth curves
that trace highdensity regions within the point cloud. We also demonstrate how
augmenting the SCMS algorithm with bootstrapbased methods of uncertainty
estimation allows one to place uncertainty bands around putative filaments. We
apply the SCMS method to datasets sampled from the P3M Nbody simulation, with
galaxy number densities consistent with SDSS and WFIRSTAFTA and to LOWZ and
CMASS data from the Baryon Oscillation Spectroscopic Survey (BOSS). To further
assess the efficacy of SCMS, we compare the relative locations of BOSS
filaments with galaxy clusters in the redMaPPer catalog, and find that
redMaPPer clusters are significantly closer (with pvalues $< 10^{9}$) to
SCMSdetected filaments than to randomly selected galaxies.

In this paper, we study the filamentary structures and the galaxy alignment
along filaments at redshift $z=0.06$ in the MassiveBlackII simulation, a
stateoftheart, highresolution hydrodynamical cosmological simulation which
includes stellar and AGN feedback in a volume of (100 Mpc$/h$)$^3$. The
filaments are constructed using the subspace constrained mean shift (SCMS;
Ozertem & Erdogmus (2011) and Chen et al. (2015a)). First, we show that
reconstructed filaments using galaxies and reconstructed filaments using dark
matter particles are similar to each other; over $50\%$ of the points on the
galaxy filaments have a corresponding point on the dark matter filaments within
distance $0.13$ Mpc$/h$ (and vice versa) and this distance is even smaller at
highdensity regions. Second, we observe the alignment of the major principal
axis of a galaxy with respect to the orientation of its nearest filament and
detect a $2.5$ Mpc$/h$ critical radius for filament's influence on the
alignment when the subhalo mass of this galaxy is between $10^9M_\odot/h$ and
$10^{12}M_\odot/h$. Moreover, we find the alignment signal to increase
significantly with the subhalo mass. Third, when a galaxy is close to filaments
(less than $0.25$ Mpc$/h$), the galaxy alignment toward the nearest galaxy
group depends on the galaxy subhalo mass. Finally, we find that galaxies close
to filaments or groups tend to be rounder than those away from filaments or
groups.