Object Oriented Data Analysis is a new area in statistics that studies
populations of general data objects. In this article we consider populations of
tree-structured objects as our focus of interest. We develop improved analysis
tools for data lying in a binary tree space analogous to classical Principal
Component Analysis methods in Euclidean space. Our extensions of PCA are
analogs of one dimensional subspaces that best fit the data. Previous work was
based on the notion of tree-lines.
In this paper, a generalization of the previous tree-line notion is proposed:
k-tree-lines. Previously proposed tree-lines are k-tree-lines where k=1. New
sub-cases of k-tree-lines studied in this work are the 2-tree-lines and
tree-curves, which explain much more variation per principal component than
tree-lines. The optimal principal component tree-lines were computable in
linear time. Because 2-tree-lines and tree-curves are more complex, they are
computationally more expensive, but yield improved data analysis results.
We provide a comparative study of all these methods on a motivating data set
consisting of brain vessel structures of 98 subjects.
The statistical analysis of tree structured data is a new topic in statistics
with wide application areas. Some Principal Component Analysis (PCA) ideas were
previously developed for binary tree spaces. In this study, we extend these
ideas to the more general space of rooted and labeled trees. We re-define
concepts such as tree-line and forward principal component tree-line for this
more general space, and generalize the optimal algorithm that finds them.
We then develop an analog of classical dimension reduction technique in PCA
for the tree space. To do this, we define the components that carry the least
amount of variation of a tree data set, called backward principal components.
We present an optimal algorithm to find them. Furthermore, we investigate the
relationship of these the forward principal components, and prove a
path-independency property between the forward and backward techniques.
We apply our methods to a data set of brain artery data set of 98 subjects.
Using our techniques, we investigate how aging affects the brain artery
structure of males and females. We also analyze a data set of organization
structure of a large US company and explore the structural differences across
different types of departments within the company.
This study introduces a new method of visualizing complex tree structured
objects. The usefulness of this method is illustrated in the context of
detecting unexpected features in a data set of very large trees. The major
contribution is a novel two-dimensional graphical representation of each tree,
with a covariate coded by color. The motivating data set contains three
dimensional representations of brain artery systems of 105 subjects. Due to
inaccuracies inherent in the medical imaging techniques, issues with the
reconstruction algo- rithms and inconsistencies introduced by manual
adjustment, various discrepancies are present in the data. The proposed
representation enables quick visual detection of the most common discrepancies.
For our driving example, this tool led to the modification of 10% of the artery
trees and deletion of 6.7%. The benefits of our cleaning method are
demonstrated through a statistical hypothesis test on the effects of aging on
vessel structure. The data cleaning resulted in improved significance levels.