
The support vector machine (SVM) is a powerful and widely used classification
algorithm. This paper uses the KarushKuhnTucker conditions to provide
rigorous mathematical proof for new insights into the behavior of SVM. These
insights provide perhaps unexpected relationships between SVM and two other
linear classifiers: the mean difference and the maximal data piling direction.
For example, we show that in many cases SVM can be viewed as a cropped version
of these classifiers. By carefully exploring these connections we show how SVM
tuning behavior is affected by characteristics including: balanced vs.
unbalanced classes, low vs. high dimension, separable vs. nonseparable data.
These results provide further insights into tuning SVM via crossvalidation by
explaining observed pathological behavior and motivating improved
crossvalidation methodology. Finally, we also provide new results on the
geometry of complete data piling directions in high dimensional space.

Data science is the business of learning from data, which is traditionally
the business of statistics. Data science, however, is often understood as a
broader, taskdriven and computationallyoriented version of statistics. Both
the term data science and the broader idea it conveys have origins in
statistics and are a reaction to a narrower view of data analysis. Expanding
upon the views of a number of statisticians, this paper encourages a bigtent
view of data analysis. We examine how evolving approaches to modern data
analysis relate to the existing discipline of statistics (e.g. exploratory
analysis, machine learning, reproducibility, computation, communication and the
role of theory). Finally, we discuss what these trends mean for the future of
statistics by highlighting promising directions for communication, education
and research.

High dimension low sample size statistical analysis is important in a wide
range of applications. In such situations, the highly appealing discrimination
method, support vector machine, can be improved to alleviate data piling at the
margin. This leads naturally to the development of distance weighted
discrimination (DWD), which can be modeled as a secondorder cone programming
problem and solved by interiorpoint methods when the scale (in sample size and
feature dimension) of the data is moderate. Here, we design a scalable and
robust algorithm for solving large scale generalized DWD problems. Numerical
experiments on real data sets from the UCI repository demonstrate that our
algorithm is highly efficient in solving large scale problems, and sometimes
even more efficient than the highly optimized LIBLINEAR and LIBSVM for solving
the corresponding SVM problems.

The issue of robustness to family relationships in computing genotype
ancestry scores such as eigenvector projections has received increased
attention in genetic association, as the scores are widely used to control
spurious association. We use a motivational example from the North American
Cystic Fibrosis (CF) Consortium genetic association study with 3444 individuals
and 898 family members to illustrate the challenge of computing ancestry scores
when sets of both unrelated individuals and closelyrelated family members are
included. We propose novel methods to obtain ancestry scores and demonstrate
that the proposed methods outperform existing methods. The current standard is
to compute loadings (left singular vectors) using unrelated individuals and to
compute projected scores for remaining family members. However, projected
ancestry scores from this approach suffer from shrinkage toward zero. We
consider in turn alternate strategies: (i) withinfamily data
orthogonalization, (ii) matrix substitution based on decomposition of a target
familyorthogonalized covariance matrix, (iii) covariancepreserving whitening,
retaining covariances between unrelated pairs while orthogonalizing family
members, and (iv) using familyaveraged data to obtain loadings. Except for
withinfamily orthogonalization, our proposed approaches offer similar
performance and are superior to the standard approaches. We illustrate the
performance via simulation and analysis of the CF dataset.

New representations of treestructured data objects, using ideas from
topological data analysis, enable improved statistical analyses of a population
of brain artery trees. A number of representations of each data tree arise from
persistence diagrams that quantify branching and looping of vessels at multiple
scales. Novel approaches to the statistical analysis, through various summaries
of the persistence diagrams, lead to heightened correlations with covariates
such as age and sex, relative to earlier analyses of this data set. The
correlation with age continues to be significant even after controlling for
correlations from earlier significant summaries

Collaborative forecasting involves exchanging information on how much of an
item will be needed by a buyer and how much can be supplied by a seller or
manufacturer in a supply chain. This exchange allows parties to plan their
operations based on the needs and limitations of their supply chain partner.
The success of this system critically depends on the healthy flow of
information. This paper focuses on methods to easily analyze and visualize this
process. To understand how the information travels on this network and how
parties react to new information from their partners, this paper proposes a
Gaussian Graphical Model based method, and finds certain inefficiencies in the
system. To simplify and better understand the update structure, a Continuum
Canonical Correlation based method is proposed. The analytical tools introduced
in this article are implemented as a part of a forecasting solution software
developed to aid the forecasting practice of a large company.

Motivated by the prevalence of high dimensional low sample size datasets in
modern statistical applications, we propose a general nonparametric framework,
DirectionProjectionPermutation (DiProPerm), for testing high dimensional
hypotheses. The method is aimed at rigorous testing of whether lower
dimensional visual differences are statistically significant. Theoretical
analysis under the nonclassical asymptotic regime of dimension going to
infinity for fixed sample size reveals that certain natural variations of
DiProPerm can have very different behaviors. An empirical power study both
confirms the theoretical results and suggests DiProPerm is a powerful test in
many settings. Finally DiProPerm is applied to a high dimensional gene
expression dataset.

Object Oriented Data Analysis is a new area in statistics that studies
populations of general data objects. In this article we consider populations of
treestructured objects as our focus of interest. We develop improved analysis
tools for data lying in a binary tree space analogous to classical Principal
Component Analysis methods in Euclidean space. Our extensions of PCA are
analogs of one dimensional subspaces that best fit the data. Previous work was
based on the notion of treelines.
In this paper, a generalization of the previous treeline notion is proposed:
ktreelines. Previously proposed treelines are ktreelines where k=1. New
subcases of ktreelines studied in this work are the 2treelines and
treecurves, which explain much more variation per principal component than
treelines. The optimal principal component treelines were computable in
linear time. Because 2treelines and treecurves are more complex, they are
computationally more expensive, but yield improved data analysis results.
We provide a comparative study of all these methods on a motivating data set
consisting of brain vessel structures of 98 subjects.

Sparse Principal Component Analysis (PCA) methods are efficient tools to
reduce the dimension (or the number of variables) of complex data. Sparse
principal components (PCs) are easier to interpret than conventional PCs,
because most loadings are zero. We study the asymptotic properties of these
sparse PC directions for scenarios with fixed sample size and increasing
dimension (i.e. High Dimension, Low Sample Size (HDLSS)). Under the previously
studied spike covariance assumption, we show that Sparse PCA remains consistent
under the same large spike condition that was previously established for
conventional PCA. Under a broad range of small spike conditions, we find a
large set of sparsity assumptions where Sparse PCA is consistent, but PCA is
strongly inconsistent. The boundaries of the consistent region are clarified
using an oracle result.

This study introduces a new method of visualizing complex tree structured
objects. The usefulness of this method is illustrated in the context of
detecting unexpected features in a data set of very large trees. The major
contribution is a novel twodimensional graphical representation of each tree,
with a covariate coded by color. The motivating data set contains three
dimensional representations of brain artery systems of 105 subjects. Due to
inaccuracies inherent in the medical imaging techniques, issues with the
reconstruction algo rithms and inconsistencies introduced by manual
adjustment, various discrepancies are present in the data. The proposed
representation enables quick visual detection of the most common discrepancies.
For our driving example, this tool led to the modification of 10% of the artery
trees and deletion of 6.7%. The benefits of our cleaning method are
demonstrated through a statistical hypothesis test on the effects of aging on
vessel structure. The data cleaning resulted in improved significance levels.

A set of curves or images of similar shape is an increasingly common
functional data set collected in the sciences. Principal Component Analysis
(PCA) is the most widely used technique to decompose variation in functional
data. However, the linear modes of variation found by PCA are not always
interpretable by the experimenters. In addition, the modes of variation of
interest to the experimenter are not always linear. We present in this paper a
new analysis of variance for Functional Data. Our method was motivated by
decomposing the variation in the data into predetermined and interpretable
directions (i.e. modes) of interest. Since some of these modes could be
nonlinear, we develop a new defined ratio of sums of squares which takes into
account the curvature of the space of variation. We discuss, in the general
case, consistency of our estimates of variation, using mathematical tools from
differential geometry and shape statistics. We successfully applied our method
to a motivating example of biological data. This decomposition allows
biologists to compare the prevalence of different genetic tradeoffs in a
population and to quantify the effect of selection on evolution.