• The traditional measurement theory interprets the variance as the dispersion of a measured value, which is actually contrary to a general mathematical concept that the variance of a constant is 0. This paper will fully demonstrate that the variance in measurement theory is actually the evaluation of probability interval of an error instead of the dispersion of a measured value, point out the key point of mistake in the traditional interpretation, and fully interpret a series of changes in conceptual logic and processing method brought about by this new concept.
  • We consider the problem of selecting a portfolio of entries of fixed cardinality for contests with top-heavy payoff structures, i.e. most of the winnings go to the top-ranked entries. This framework is general and can be used to model a variety of problems, such as movie studios selecting movies to produce, venture capital firms picking start-up companies to invest in, or individuals selecting lineups for daily fantasy sports contests, which is the example we focus on here. We model the portfolio selection task as a combinatorial optimization problem with a submodular objective function, which is given by the probability of at least one entry winning. We then show that this probability can be approximated using only pairwise marginal probabilities of the entries winning when there is a certain structure on their joint distribution. We consider a model where the entries are jointly Gaussian random variables and present a closed form approximation to the objective function. Building on this, we then consider a scenario where the entries are given by sums of constrained resources and present an integer programming formulation to construct the entries. Our formulation uses principles based on our theoretical analysis to construct entries: we maximize the expected score of an entry subject to a lower bound on its variance and an upper bound on its correlation with previously constructed entries. To demonstrate the effectiveness of our integer programming approach, we apply it to daily fantasy sports contests that have top-heavy payoff structures. We find that our approach performs well in practice. Using our integer programming approach, we are able to rank in the top-ten multiple times in hockey and baseball contests with thousands of competing entries. Our approach can easily be extended to other problems with constrained resources and a top-heavy payoff structure.
  • A simple, intuitive approach to the assessment of probabilistic inferences is introduced. The Shannon information metrics are translated to the probability domain. The translation shows that the negative logarithmic score and the geometric mean are equivalent measures of the accuracy of a probabilistic inference. Thus there is both a quantitative reduction in perplexity, which is the inverse of the geometric mean of the probabilities, as good inference algorithms reduce the uncertainty and a qualitative reduction due to the increased clarity between the original set of probabilistic forecasts and their central tendency, the geometric mean. Further insight is provided by showing that the R\'enyi and Tsallis entropy functions translated to the probability domain are both the weighted generalized mean of the distribution. The generalized mean of probabilistic forecasts forms a spectrum of performance metrics referred to as a Risk Profile. The arithmetic mean is used to measure the decisiveness, while the -2/3 mean is used to measure the robustness.
  • We investigate a Poisson sampling design in the presence of unknown selection probabilities when applied to a population of unknown size for multiple sampling occasions. The fixed-population model is adopted and extended upon for inference. The complete minimal sufficient statistic is derived for the sampling model parameters and fixed-population parameter vector. The Rao-Blackwell version of population quantity estimators is detailed. An application is applied to an emprical population. The extended inferential framework is found to have much potential and utility for empirical studies.
  • Quality control in industrial processes is increasingly making use of prior scientific knowledge, often encoded in physical models that require numerical approximation. Statistical prediction, and subsequent optimization, is key to ensuring the process output meets a specification target. However, the numerical expense of approximating the models poses computational challenges to the identification of combinations of the process factors where there is confidence in the quality of the response. Recent work in Bayesian computation and statistical approximation (emulation) of expensive computational models is exploited to develop a novel strategy for optimizing the posterior probability of a process meeting specification. The ensuing methodology is motivated by, and demonstrated on, a chemical synthesis process to manufacture a pharmaceutical product, within which an initial set of substances evolve according to chemical reactions, under certain process conditions, into a series of new substances. One of these substances is a target pharmaceutical product and two are unwanted by-products. The aim is to determine the combinations of process conditions and amounts of initial substances that maximize the probability of obtaining sufficient target pharmaceutical product whilst ensuring unwanted by-products do not exceed a given level. The relationship between the factors and amounts of substances of interest is theoretically described by the solution to a system of ordinary differential equations incorporating temperature dependence. Using data from a small experiment, it is shown how the methodology can approximate the multivariate posterior predictive distribution of the pharmaceutical target and by-products, and therefore identify suitable operating values. Materials to replicate the analysis can be found at www.github.com/amo105/chemicalkinetics.
  • This study demonstrates how to use "spmoran", an R package estimating spatial additive mixed models and other spatial regression models for Gaussian and non-Gaussian data. Moran eigenvectors are used to an approximate Gaussian process modeling which is interpretable in terms of the Moran coefficient. The GP is used for modeling the spatial processes in residuals and regression coefficients. All these models are estimated computationally efficiently. For the sample code used in this paper, see https://github.com/dmuraka/spmoran.
  • Objective prior distributions represent an important tool that allows one to have the advantages of using the Bayesian framework even when information about the parameters of a model is not available. The usual objective approaches work off the chosen statistical model and in the majority of cases the resulting prior is improper, which can pose limitations to a practical implementation, even when the complexity of the model is moderate. In this paper we propose to take a novel look at the construction of objective prior distributions, where the connection with a chosen sampling distribution model is removed. We explore the notion of defining objective prior distributions which allow one to have some degree of flexibility, in particular in exhibiting some desirable features, such as being proper, or centered on specific values which would be of interest in nested model comparisons. The basic tool we use are proper scoring rules and the main result is a class of objective prior distributions that can be employed in scenarios where the usual model based priors fail, such as mixture models and model selection via Bayes factors. In addition, we show that the proposed class of priors is the result of minimising the information it contains, providing solid interpretation to the method.
  • New results on functional prediction of the Ornstein-Uhlenbeck process in an autoregressive Hilbert-valued and Banach-valued frameworks are derived. Specifically, consistency of the maximum likelihood estimator of the autocorrelation operator, and of the associated plug-in predictor is obtained in both frameworks.
  • This paper presents new results on prediction of linear processes in function spaces. The autoregressive Hilbertian process framework of order one (ARH(1) process framework) is adopted. A componentwise estimator of the autocorrelation operator is formulated, from the moment-based estimation of its diagonal coefficients, with respect to the orthogonal eigenvectors of the auto-covariance operator, which are assumed to be known. Mean-square convergence to the theoretical autocorrelation operator, in the space of Hilbert-Schmidt operators, is proved. Consistency then follows in that space. For the associated ARH(1) plug-in predictor, mean absolute convergence to the corresponding conditional expectation, in the considered Hilbert space, is obtained. Hence, consistency in that space also holds. A simulation study is undertaken to illustrate the finite-large sample behavior of the formulated componentwise estimator and predictor. The performance of the presented approach is compared with alternative approaches in the previous and current ARH(1) framework literature, including the case of unknown eigenvectors.
  • A special class of standard Gaussian Autoregressive Hilbertian processes of order one (Gaussian ARH(1) processes), with bounded linear autocorrelation operator, which does not satisfy the usual Hilbert-Schmidt assumption, is considered. To compensate the slow decay of the diagonal coefficients of the autocorrelation operator, a faster decay velocity of the eigenvalues of the trace autocovariance operator of the innovation process is assumed. As usual, the eigenvectors of the autocovariance operator of the ARH(1) process are considered for projection, since, here, they are assumed to be known. Diagonal componentwise classical and bayesian estimation of the autocorrelation operator is studied for prediction. The asymptotic efficiency and equivalence of both estimators is proved, as well as of their associated componentwise ARH(1) plugin predictors. A simulation study is undertaken to illustrate the theoretical results derived.
  • Possible parameter values in a random sampling model are shown by definition to have uniform base-rate prior probabilities. This allows a frequentist posterior probability distribution to be calculated for such possible parameter values conditional solely on actual study observations. If the likelihood probability distribution of a random selection is modelled with a symmetrical continuous function then the frequentist posterior probability of something equal to or more extreme than the null hypothesis will be equal to the P-value; otherwise the P value would be an approximation. An idealistic probability of replication based on an assumption of perfect study methodological reproducibility can be used as the upper bound of a realistic probability of replication that may be affected by various confounding factors. Bayesian distributions can be combined with these frequentist distributions. The idealistic frequentist posterior probability of replication may be easier than the P-value for non-statisticians to understand and to interpret.
  • Regression for count data is widely performed by models such as Poisson, negative binomial (NB) and zero-inflated regression. A challenge often faced by practitioners is the selection of the right model to take into account dispersion, which typically occurs in count datasets. It is highly desirable to have a unified model that can automatically adapt to the underlying dispersion and that can be easily implemented in practice. In this paper, a discrete Weibull regression model is shown to be able to adapt in a simple way to different types of dispersions relative to Poisson regression: overdispersion, underdispersion and covariate-specific dispersion. Maximum likelihood can be used for efficient parameter estimation. The description of the model, parameter inference and model diagnostics is accompanied by simulated and real data analyses.
  • Consider a real-valued function that can only be observed with stochastic noise at a finite set of design points within a Euclidean space. We wish to determine whether there exists a convex function that goes through the true function values at the design points. We develop an asymptotically consistent Bayesian sequential sampling procedure that estimates the posterior probability of this being true. In each iteration, the posterior probability is estimated using Monte Carlo simulation. We offer three variance reduction methods -- change of measure, acceptance-rejection, and conditional Monte Carlo. Numerical experiments suggest that the conditional Monte Carlo method should be preferred.
  • We developed a simulation game to study the effectiveness of decision-makers in overcoming two complexities in building cybersecurity capabilities: potential delays in capability development; and uncertainties in predicting cyber incidents. Analyzing 1,479 simulation runs, we compared the performances of a group of experienced professionals with those of an inexperienced control group. Experienced subjects did not understand the mechanisms of delays any better than inexperienced subjects; however, experienced subjects were better able to learn the need for proactive decision-making through an iterative process. Both groups exhibited similar errors when dealing with the uncertainty of cyber incidents. Our findings highlight the importance of training for decision-makers with a focus on systems thinking skills, and lay the groundwork for future research on uncovering mental biases about the complexities of cybersecurity.
  • In several literatures, the authors give a new thinking of measurement theory system based on error non-classification philosophy, which completely overthrows the existing measurement concept system of precision, trueness and accuracy. In this paper, by focusing on the issues of error's regularities and effect characteristics, the authors will do a thematic interpretation, and prove that the error's regularities actually come from different cognitive perspectives, are also unable to be used for classifying errors, and that the error's effect characteristics actually depend on artificial condition rules of repeated measurement, and are still unable to be used for classifying errors. Thus, from the perspectives of error's regularities and effect characteristics, the existing error classification philosophy is still incorrect; and an uncertainty concept system, which must be interpreted by the error non-classification philosophy, naturally becomes the only way out of measurement theory.
  • Here we define and study the properties of retrodictive inference. We derive equations relating retrodiction entropy and thermodynamic entropy, and as a special case, show that under equilibrium conditions, the two are identical. We demonstrate relations involving the KL-divergence and retrodiction probability, and bound the time rate of change of retrodiction entropy. As a specific case, we invert various Langevin processes, inferring the initial condition of \(N\) particles given their final positions at some later time. We evaluate the retrodiction entropy for Langevin dynamics exactly for special cases, and find that one's ability to infer the initial state of a system can exhibit two possible qualitative behaviors depending on the potential energy landscape, either decreasing indefinitely, or asymptotically approaching a fixed value. We also study how well we can retrodict points that evolve based on the logistic map. We find singular changes in the retrodictivity near bifurcations. Counterintuitively, the transition to chaos is accompanied by maximal retrodictability.
  • Numerical (and experimental) data analysis often requires the restoration of a smooth function from a set of sampled integrals over finite bins. We present the bin hierarchy method that efficiently computes the maximally smooth function from the sampled integrals using essentially all the information contained in the data. We perform extensive tests with different classes of functions and levels of data quality, including Monte Carlo data suffering from a severe sign problem and physical data for the Green's function of the Fr\"ohlich polaron.
  • In this paper we develop an Expectation Maximization(EM) algorithm to estimate the parameter of a Yule-Simon distribution. The Yule-Simon distribution exhibits the "rich get richer" effect whereby an 80-20 type of rule tends to dominate. These distributions are ubiquitous in industrial settings. The EM algorithm presented provides both frequentist and Bayesian estimates of the $\lambda$ parameter. By placing the estimation method within the EM framework we are able to derive Standard errors of the resulting estimate. Additionally, we prove convergence of the Yule-Simon EM algorithm and study the rate of convergence. An explicit, closed form solution for the rate of convergence of the algorithm is given.
  • The analysis of adverse events (AEs) is a key component in the assessment of a drug's safety profile. Inappropriate analysis methods may result in misleading conclusions about a therapy's safety and consequently its benefit-risk ratio. The statistical analysis of AEs is complicated by the fact that the follow-up times can vary between the patients included in a clinical trial. This paper takes as its focus the analysis of AE data in the presence of varying follow-up times within the benefit assessment of therapeutic interventions. Instead of approaching this issue directly and solely from an analysis point of view, we first discuss what should be estimated in the context of safety data, leading to the concept of estimands. Although the current discussion on estimands is mainly related to efficacy evaluation, the concept is applicable to safety endpoints as well. Within the framework of estimands, we present statistical methods for analysing AEs with the focus being on the time to the occurrence of the first AE of a specific type. We give recommendations which estimators should be used for the estimands described. Furthermore, we state practical implications of the analysis of AEs in clinical trials and give an overview of examples across different indications. We also provide a review of current practices of health technology assessment (HTA) agencies with respect to the evaluation of safety data. Finally, we describe problems with meta-analyses of AE data and sketch possible solutions.
  • Consider a finite population of N items, where item i has a probability p_i to be defective. The goal is to identify all items by means of group testing. This is the generalized group testing problem (GGTP hereafter). In the case of p_1=...=p_N=p Yao and Hwang (1990) proved that the pairwise testing algorithm (PTA hereafter) is the optimal nested algorithm for all N if and only if p in [1-1/\sqrt{2},\,(3-\sqrt{5})/2] (R-range hereafter) (an optimal at the boundary values). In this note, we present a result that helps to define the generalized pairwise testing algorithm (GPTA hereafter) for GGTP. We conjecture that in GGTP when all p_i, i=1,...,N belong to the R-range the optimal nested procedure is GPTA. Although this conjecture is logically reasonable, we only were able to verify it empirically up to a particular level of N. As a byproduct, a slight improvement of the algorithm by Kurtz and Sidi (1988) was obtained.
  • In this paper, we will see that the proportion of d as p th digit, where p > 1 and d $\in$ 0, 9, in data (obtained thanks to the hereunder developed model) is more likely to follow a law whose probability distribution is determined by a specific upper bound, rather than the generalization of Benford's Law to digits beyond the first one. These probability distributions fluctuate around theoretical values determined by Hill in 1995. Knowing beforehand the value of the upper bound can be a way to find a better adjusted law than Hill's one.
  • Data science is the business of learning from data, which is traditionally the business of statistics. Data science, however, is often understood as a broader, task-driven and computationally-oriented version of statistics. Both the term data science and the broader idea it conveys have origins in statistics and are a reaction to a narrower view of data analysis. Expanding upon the views of a number of statisticians, this paper encourages a big-tent view of data analysis. We examine how evolving approaches to modern data analysis relate to the existing discipline of statistics (e.g. exploratory analysis, machine learning, reproducibility, computation, communication and the role of theory). Finally, we discuss what these trends mean for the future of statistics by highlighting promising directions for communication, education and research.
  • We introduce a framework for updating large scale geospatial processes using a model-data synthesis method based on Bayesian hierarchical modelling. Two major challenges come from updating large-scale Gaussian process and modelling non-stationarity. To address the first, we adopt the SPDE approach that uses a sparse Gaussian Markov random fields (GMRF) approximation to reduce the computational cost and implement the Bayesian inference by using the INLA method. For non-stationary global processes, we propose two general models that accommodate commonly-seen geospatial problems. Finally, we show an example of updating an estimate of global glacial isostatic adjustment (GIA) using GPS measurements.
  • The R package BNSP implements Markov chain Monte Carlo algorithms for fitting non- and semi-parametric Bayesian models. In this paper we present the implemented methods for fitting semiparametric, heteroscedastic Gaussian models. The statistical model that we present utilizes basis function expansions to represent semiparametric covariate effects in the mean and variance functions, and it utilizes spike-slab priors to perform selection and to regularize estimated effects. In addition to the main function that performs posterior sampling, the package includes functions for assessing convergence of the sampler and for visualizing covariate effects.
  • The rational solution of the Monty Hall problem unsettles many people. Most people, including the authors, think it feels wrong to switch the initial choice of one of the three doors, despite having fully accepted the mathematical proof for its superiority. Many people, if given the choice to switch, think the chances are fifty-fifty between their options, but still strongly prefer to stay with their initial choice. Is there some ratio behind these irrational feelings? We argue that intuition solves the problem of how to behave in a real game show, not in the abstracted textbook version of the Monty Hall problem. A real show master sometimes plays evil, either to make the show more interesting, to save money, or because having a bad mood. A moody show master erases any information advantage the guest could extract from him opening other doors, driving the chance for the car being behind the chosen door towards fifty percent. Furthermore, the show master could try to read or manipulate the guest's strategy to the guest's disadvantage. Given this, the preference to stay with the initial choice is a very rational mental defense strategy of the show's guest against the threat of being manipulated by its host. Folding these realistic possibilities into the considerations confirms that the intuitive feelings most people have on the Monty Hall problem are indeed very rational.