• Latent Dirichlet allocation (LDA) is useful in document analysis, image processing, and many information systems; however, its generalization performance has been left unknown because it is a singular learning machine to which regular statistical theory can not be applied. Stochastic matrix factorization (SMF) is a restricted matrix factorization in which matrix factors are stochastic; the column of the matrix is in a simplex. SMF is being applied to image recognition and text mining. We can understand SMF as a statistical model by which a stochastic matrix of given data is represented by a product of two stochastic matrices, whose generalization performance has also been left unknown because of non-regularity. In this paper, by using an algebraic and geometric method, we show the analytic equivalence of LDA and SMF, both of which have the same real log canonical threshold (RLCT), resulting in that they asymptotically have the same Bayesian generalization error and the same log marginal likelihood. Moreover, we derive the upper bound of the RLCT and prove that it is smaller than the dimension of the parameter divided by two, hence the Bayesian generalization errors of them are smaller than those of regular statistical models.
  • Non-negative matrix factorization (NMF) is a new knowledge discovery method that is used for text mining, signal processing, bioinformatics, and consumer analysis. However, its basic property as a learning machine is not yet clarified, as it is not a regular statistical model, resulting that theoretical optimization method of NMF has not yet established. In this paper, we study the real log canonical threshold of NMF and give an upper bound of the generalization error in Bayesian learning. The results show that the generalization error of the matrix factorization can be made smaller than regular statistical models if Bayesian learning is applied.
  • Prior design is one of the most important problems in both statistics and machine learning. The cross validation (CV) and the widely applicable information criterion (WAIC) are predictive measures of the Bayesian estimation, however, it has been difficult to apply them to find the optimal prior because their mathematical properties in prior evaluation have been unknown and the region of the hyperparameters is too wide to be examined. In this paper, we derive a new formula by which the theoretical relation among CV, WAIC, and the generalization loss is clarified and the optimal hyperparameter can be directly found. By the formula, three facts are clarified about predictive prior design. Firstly, CV and WAIC have the same second order asymptotic expansion, hence they are asymptotically equivalent to each other as the optimizer of the hyperparameter. Secondly, the hyperparameter which minimizes CV or WAIC makes the average generalization loss to be minimized asymptotically but does not the random generalization loss. And lastly, by using the mathematical relation between priors, the variances of the optimized hyperparameters by CV and WAIC are made smaller with small computational costs. Also we show that the optimized hyperparameter by DIC or the marginal likelihood does not minimize the average or random generalization loss in general.
  • A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature $1/\log n$, where $n$ is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for and unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models.
  • Many learning machines such as normal mixtures and layered neural networks are not regular but singular statistical models, because the map from a parameter to a probability distribution is not one-to-one. The conventional statistical asymptotic theory can not be applied to such learning machines because the likelihood function can not be approximated by any normal distribution. Recently, new statistical theory has been established based on algebraic geometry and it was clarified that the generalization and training errors are determined by two birational invariants, the real log canonical threshold and the singular fluctuation. However, their concrete values are left unknown. In the present paper, we propose a new concept, a quasi-regular case in statistical learning theory. A quasi-regular case is not a regular case but a singular case, however, it has the same property as a regular case. In fact, we prove that, in a quasi-regular case, two birational invariants are equal to each other, resulting that the symmetry of the generalization and training errors holds. Moreover, the concrete values of two birational invariants are explicitly obtained, the quasi-regular case is useful to study statistical learning theory.
  • In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to $2\lambda/n$, where $\lambda$ is the real log canonical threshold and $n$ is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion.
  • Bayes statistics and statistical physics have the common mathematical structure, where the log likelihood function corresponds to the random Hamiltonian. Recently, it was discovered that the asymptotic learning curves in Bayes estimation are subject to a universal law, even if the log likelihood function can not be approximated by any quadratic form. However, it is left unknown what mathematical property ensures such a universal law. In this paper, we define a renormalizable condition of the statistical estimation problem, and show that, under such a condition, the asymptotic learning curves are ensured to be subject to the universal law, even if the true distribution is unrealizable and singular for a statistical model. Also we study a nonrenormalizable case, in which the learning curves have the different asymptotic behaviors from the universal law.
  • Many learning machines that have hierarchical structure or hidden variables are now being used in information science, artificial intelligence, and bioinformatics. However, several learning machines used in such fields are not regular but singular statistical models, hence their generalization performance is still left unknown. To overcome these problems, in the previous papers, we proved new equations in statistical learning, by which we can estimate the Bayes generalization loss from the Bayes training loss and the functional variance, on the condition that the true distribution is a singularity contained in a learning machine. In this paper, we prove that the same equations hold even if a true distribution is not contained in a parametric model. Also we prove that, the proposed equations in a regular case are asymptotically equivalent to the Takeuchi information criterion. Therefore, the proposed equations are always applicable without any condition on the unknown true distribution.
  • Learning machines which have hierarchical structures or hidden variables are singular statistical models because they are nonidentifiable and their Fisher information matrices are singular. In singular statistical models, neither the Bayes a posteriori distribution converges to the normal distribution nor the maximum likelihood estimator satisfies asymptotic normality. This is the main reason why it has been difficult to predict their generalization performances from trained states. In this paper, we study four errors, (1) Bayes generalization error, (2) Bayes training error, (3) Gibbs generalization error, and (4) Gibbs training error, and prove that there are mathematical relations among these errors. The formulas proved in this paper are equations of states in statistical estimation because they hold for any true distribution, any parametric model, and any a priori distribution. Also we show that Bayes and Gibbs generalization errors are estimated by Bayes and Gibbs training errors, and propose widely applicable information criteria which can be applied to both regular and singular statistical models.
  • In statistical problems, a set of parameterized probability distributions is used to estimate the true probability distribution. If Fisher information matrix at the true distribution is singular, then it has been left unknown what we can estimate about the true distribution from random samples. In this paper, we study a singular regression problem and prove a limit theorem which shows the relation between the singular regression problem and two birational invariants, a real log canonical threshold and a singular fluctuation. The obtained theorem has an important application to statistics, because it enables us to estimate the generalization error from the training error without any knowledge of the true probability distribution.