
Latent Dirichlet allocation (LDA) is useful in document analysis, image
processing, and many information systems; however, its generalization
performance has been left unknown because it is a singular learning machine to
which regular statistical theory can not be applied.
Stochastic matrix factorization (SMF) is a restricted matrix factorization in
which matrix factors are stochastic; the column of the matrix is in a simplex.
SMF is being applied to image recognition and text mining. We can understand
SMF as a statistical model by which a stochastic matrix of given data is
represented by a product of two stochastic matrices, whose generalization
performance has also been left unknown because of nonregularity.
In this paper, by using an algebraic and geometric method, we show the
analytic equivalence of LDA and SMF, both of which have the same real log
canonical threshold (RLCT), resulting in that they asymptotically have the same
Bayesian generalization error and the same log marginal likelihood. Moreover,
we derive the upper bound of the RLCT and prove that it is smaller than the
dimension of the parameter divided by two, hence the Bayesian generalization
errors of them are smaller than those of regular statistical models.

Nonnegative matrix factorization (NMF) is a new knowledge discovery method
that is used for text mining, signal processing, bioinformatics, and consumer
analysis. However, its basic property as a learning machine is not yet
clarified, as it is not a regular statistical model, resulting that theoretical
optimization method of NMF has not yet established. In this paper, we study the
real log canonical threshold of NMF and give an upper bound of the
generalization error in Bayesian learning. The results show that the
generalization error of the matrix factorization can be made smaller than
regular statistical models if Bayesian learning is applied.

Prior design is one of the most important problems in both statistics and
machine learning. The cross validation (CV) and the widely applicable
information criterion (WAIC) are predictive measures of the Bayesian
estimation, however, it has been difficult to apply them to find the optimal
prior because their mathematical properties in prior evaluation have been
unknown and the region of the hyperparameters is too wide to be examined. In
this paper, we derive a new formula by which the theoretical relation among CV,
WAIC, and the generalization loss is clarified and the optimal hyperparameter
can be directly found.
By the formula, three facts are clarified about predictive prior design.
Firstly, CV and WAIC have the same second order asymptotic expansion, hence
they are asymptotically equivalent to each other as the optimizer of the
hyperparameter. Secondly, the hyperparameter which minimizes CV or WAIC makes
the average generalization loss to be minimized asymptotically but does not the
random generalization loss. And lastly, by using the mathematical relation
between priors, the variances of the optimized hyperparameters by CV and WAIC
are made smaller with small computational costs. Also we show that the
optimized hyperparameter by DIC or the marginal likelihood does not minimize
the average or random generalization loss in general.

A statistical model or a learning machine is called regular if the map taking
a parameter to a probability distribution is onetoone and if its Fisher
information matrix is always positive definite. If otherwise, it is called
singular. In regular statistical models, the Bayes free energy, which is
defined by the minus logarithm of Bayes marginal likelihood, can be
asymptotically approximated by the Schwarz Bayes information criterion (BIC),
whereas in singular models such approximation does not hold.
Recently, it was proved that the Bayes free energy of a singular model is
asymptotically given by a generalized formula using a birational invariant, the
real log canonical threshold (RLCT), instead of half the number of parameters
in BIC. Theoretical values of RLCTs in several statistical models are now being
discovered based on algebraic geometrical methodology. However, it has been
difficult to estimate the Bayes free energy using only training samples,
because an RLCT depends on an unknown true distribution.
In the present paper, we define a widely applicable Bayesian information
criterion (WBIC) by the average log likelihood function over the posterior
distribution with the inverse temperature $1/\log n$, where $n$ is the number
of training samples. We mathematically prove that WBIC has the same asymptotic
expansion as the Bayes free energy, even if a statistical model is singular for
and unrealizable by a statistical model. Since WBIC can be numerically
calculated without any information about a true distribution, it is a
generalized version of BIC onto singular statistical models.

Many learning machines such as normal mixtures and layered neural networks
are not regular but singular statistical models, because the map from a
parameter to a probability distribution is not onetoone. The conventional
statistical asymptotic theory can not be applied to such learning machines
because the likelihood function can not be approximated by any normal
distribution. Recently, new statistical theory has been established based on
algebraic geometry and it was clarified that the generalization and training
errors are determined by two birational invariants, the real log canonical
threshold and the singular fluctuation. However, their concrete values are left
unknown. In the present paper, we propose a new concept, a quasiregular case
in statistical learning theory. A quasiregular case is not a regular case but
a singular case, however, it has the same property as a regular case. In fact,
we prove that, in a quasiregular case, two birational invariants are equal to
each other, resulting that the symmetry of the generalization and training
errors holds. Moreover, the concrete values of two birational invariants are
explicitly obtained, the quasiregular case is useful to study statistical
learning theory.

In regular statistical models, the leaveoneout crossvalidation is
asymptotically equivalent to the Akaike information criterion. However, since
many learning machines are singular statistical models, the asymptotic behavior
of the crossvalidation remains unknown. In previous studies, we established
the singular learning theory and proposed a widely applicable information
criterion, the expectation value of which is asymptotically equal to the
average Bayes generalization loss. In the present paper, we theoretically
compare the Bayes crossvalidation loss and the widely applicable information
criterion and prove two theorems. First, the Bayes crossvalidation loss is
asymptotically equivalent to the widely applicable information criterion as a
random variable. Therefore, model selection and hyperparameter optimization
using these two values are asymptotically equivalent. Second, the sum of the
Bayes generalization error and the Bayes crossvalidation error is
asymptotically equal to $2\lambda/n$, where $\lambda$ is the real log canonical
threshold and $n$ is the number of training samples. Therefore the relation
between the crossvalidation error and the generalization error is determined
by the algebraic geometrical structure of a learning machine. We also clarify
that the deviance information criteria are different from the Bayes
crossvalidation and the widely applicable information criterion.

Bayes statistics and statistical physics have the common mathematical
structure, where the log likelihood function corresponds to the random
Hamiltonian. Recently, it was discovered that the asymptotic learning curves in
Bayes estimation are subject to a universal law, even if the log likelihood
function can not be approximated by any quadratic form. However, it is left
unknown what mathematical property ensures such a universal law. In this paper,
we define a renormalizable condition of the statistical estimation problem, and
show that, under such a condition, the asymptotic learning curves are ensured
to be subject to the universal law, even if the true distribution is
unrealizable and singular for a statistical model. Also we study a
nonrenormalizable case, in which the learning curves have the different
asymptotic behaviors from the universal law.

Many learning machines that have hierarchical structure or hidden variables
are now being used in information science, artificial intelligence, and
bioinformatics. However, several learning machines used in such fields are not
regular but singular statistical models, hence their generalization performance
is still left unknown. To overcome these problems, in the previous papers, we
proved new equations in statistical learning, by which we can estimate the
Bayes generalization loss from the Bayes training loss and the functional
variance, on the condition that the true distribution is a singularity
contained in a learning machine. In this paper, we prove that the same
equations hold even if a true distribution is not contained in a parametric
model. Also we prove that, the proposed equations in a regular case are
asymptotically equivalent to the Takeuchi information criterion. Therefore, the
proposed equations are always applicable without any condition on the unknown
true distribution.

Learning machines which have hierarchical structures or hidden variables are
singular statistical models because they are nonidentifiable and their Fisher
information matrices are singular. In singular statistical models, neither the
Bayes a posteriori distribution converges to the normal distribution nor the
maximum likelihood estimator satisfies asymptotic normality. This is the main
reason why it has been difficult to predict their generalization performances
from trained states. In this paper, we study four errors, (1) Bayes
generalization error, (2) Bayes training error, (3) Gibbs generalization error,
and (4) Gibbs training error, and prove that there are mathematical relations
among these errors. The formulas proved in this paper are equations of states
in statistical estimation because they hold for any true distribution, any
parametric model, and any a priori distribution. Also we show that Bayes and
Gibbs generalization errors are estimated by Bayes and Gibbs training errors,
and propose widely applicable information criteria which can be applied to both
regular and singular statistical models.

In statistical problems, a set of parameterized probability distributions is
used to estimate the true probability distribution. If Fisher information
matrix at the true distribution is singular, then it has been left unknown what
we can estimate about the true distribution from random samples. In this paper,
we study a singular regression problem and prove a limit theorem which shows
the relation between the singular regression problem and two birational
invariants, a real log canonical threshold and a singular fluctuation. The
obtained theorem has an important application to statistics, because it enables
us to estimate the generalization error from the training error without any
knowledge of the true probability distribution.