• ### Adversarial Variational Optimization of Non-Differentiable Simulators(1707.07113)

April 16, 2020 cs.LG, stat.ML
Complex computer simulators are increasingly used across fields of science as generative models tying parameters of an underlying theory to experimental observations. Inference in this setup is often difficult, as simulators rarely admit a tractable density or likelihood function. We introduce Adversarial Variational Optimization (AVO), a likelihood-free inference algorithm for fitting a non-differentiable generative model incorporating ideas from generative adversarial networks, variational optimization and empirical Bayes. We adapt the training procedure of generative adversarial networks by replacing the differentiable generative network with a domain-specific simulator. We solve the resulting non-differentiable minimax problem by minimizing variational upper bounds of the two adversarial objectives. Effectively, the procedure results in learning a proposal distribution over simulator parameters, such that the JS divergence between the marginal distribution of the synthetic data and the empirical distribution of observed data is minimized. We evaluate and compare the method with simulators producing both discrete and continuous data.
• ### QCD-Aware Recursive Neural Networks for Jet Physics(1702.00748)

July 13, 2018 hep-ph, physics.data-an, stat.ML
Recent progress in applying machine learning for jet physics has been built upon an analogy between calorimeters and images. In this work, we present a novel class of recursive neural networks built instead upon an analogy between QCD and natural languages. In the analogy, four-momenta are like words and the clustering history of sequential recombination jet algorithms is like the parsing of a sentence. Our approach works directly with the four-momenta of a variable-length set of particles, and the jet-based tree structure varies on an event-by-event basis. Our experiments highlight the flexibility of our method for building task-specific jet embeddings and show that recursive architectures are significantly more accurate and data efficient than previous image-based networks. We extend the analogy from individual jets (sentences) to full events (paragraphs), and show for the first time an event-level classifier operating on all the stable particles produced in an LHC event.
• ### Constraining Effective Field Theories with Machine Learning(1805.00013)

April 30, 2018 hep-ph, physics.data-an, stat.ML
We present powerful new analysis techniques to constrain effective field theories at the LHC. By leveraging the structure of particle physics processes, we extract extra information from Monte-Carlo simulations, which can be used to train neural network models that estimate the likelihood ratio. These methods scale well to processes with many observables and theory parameters, do not require any approximations of the parton shower or detector response, and can be evaluated in microseconds. We show that they allow us to put significantly stronger bounds on dimension-six operators than existing methods, demonstrating their potential to improve the precision of the LHC legacy constraints.
• ### A Guide to Constraining Effective Field Theories with Machine Learning(1805.00020)

April 30, 2018 hep-ph, physics.data-an, stat.ML
We develop, discuss, and compare several inference techniques to constrain theory parameters in collider experiments. By harnessing the latent-space structure of particle physics processes, we extract extra information from the simulator. This augmented data can be used to train neural networks that precisely estimate the likelihood ratio. The new methods scale well to many observables and high-dimensional parameter spaces, do not require any approximations of the parton shower and detector response, and can be evaluated in microseconds. Using weak-boson-fusion Higgs production as an example process, we compare the performance of several techniques. The best results are found for likelihood ratio estimators trained with extra information about the score, the gradient of the log likelihood function with respect to the theory parameters. The score also provides sufficient statistics that contain all the information needed for inference in the neighborhood of the Standard Model. These methods enable us to put significantly stronger bounds on effective dimension-six operators than the traditional approach based on histograms. They also outperform generic machine learning methods that do not make use of the particle physics structure, demonstrating their potential to substantially improve the new physics reach of the LHC legacy results.
• ### HEP Software Foundation Community White Paper Working Group - Data Analysis and Interpretation(1804.03983)

April 9, 2018 hep-ex, physics.comp-ph
At the heart of experimental high energy physics (HEP) is the development of facilities and instrumentation that provide sensitivity to new phenomena. Our understanding of nature at its most fundamental level is advanced through the analysis and interpretation of data from sophisticated detectors in HEP experiments. The goal of data analysis systems is to realize the maximum possible scientific potential of the data within the constraints of computing and human resources in the least time. To achieve this goal, future analysis systems should empower physicists to access the data with a high level of interactivity, reproducibility and throughput capability. As part of the HEP Software Foundation Community White Paper process, a working group on Data Analysis and Interpretation was formed to assess the challenges and opportunities in HEP data analysis and develop a roadmap for activities in this area over the next decade. In this report, the key findings and recommendations of the Data Analysis and Interpretation Working Group are presented.
• Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the shear amounts of data to be recorded. In planning for the HL-LHC in particular, it is critical that all of the collaborating stakeholders agree on the software goals and priorities, and that the efforts complement each other. In this spirit, this white paper describes the R&D activities required to prepare for this software upgrade.
• ### Modeling Smooth Backgrounds and Generic Localized Signals with Gaussian Processes(1709.05681)

Sept. 17, 2017 hep-ph, hep-ex, physics.data-an
We describe a procedure for constructing a model of a smooth data spectrum using Gaussian processes rather than the historical parametric description. This approach considers a fuller space of possible functions, is robust at increasing luminosity, and allows us to incorporate our understanding of the underlying physics. We demonstrate the application of this approach to modeling the background to searches for dijet resonances at the Large Hadron Collider and describe how the approach can be used in the search for generic localized signals.
• ### Yadage and Packtivity - analysis preservation using parametrized workflows(1706.01878)

June 6, 2017 hep-ex, physics.data-an
Preserving data analyses produced by the collaborations at LHC in a parametrized fashion is crucial in order to maintain reproducibility and re-usability. We argue for a declarative description in terms of individual processing steps - packtivities - linked through a dynamic directed acyclic graph (DAG) and present an initial set of JSON schemas for such a description and an implementation - yadage - capable of executing workflows of analysis preserved via Linux containers.
• ### Learning to Pivot with Adversarial Networks(1611.01046)

Several techniques for domain adaptation have been proposed to account for differences in the distribution of the data used for training and testing. The majority of this work focuses on a binary domain label. Similar problems occur in a scientific context where there may be a continuous family of plausible data generation processes associated to the presence of systematic uncertainties. Robust inference is possible if it is based on a pivot -- a quantity whose distribution does not depend on the unknown values of the nuisance parameters that parametrize this family of data generation processes. In this work, we introduce and derive theoretical results for a training procedure based on adversarial networks for enforcing the pivotal property (or, equivalently, fairness with respect to continuous attributes) on a predictive model. The method includes a hyperparameter to control the trade-off between accuracy and robustness. We demonstrate the effectiveness of this approach with a toy example and examples from particle physics.
• ### Better Higgs Measurements Through Information Geometry(1612.05261)

March 30, 2017 hep-ph, physics.data-an
Information geometry can be used to understand and optimize Higgs measurements at the LHC. The Fisher information encodes the maximum sensitivity of observables to model parameters for a given experiment. Applied to higher-dimensional operators, it defines the new physics reach of any LHC signature. We calculate the Fisher information for Higgs production in weak boson fusion with decays into tau pairs and four leptons, and for Higgs production in association with a single top quark. In a next step we analyze how the differential information is distributed over phase space, which defines optimal event selections. Conversely, we consider the information in the distribution of a subset of the kinematic variables, showing which production and decay observables are the most powerful and how much information is lost in traditional histogram-based analysis methods compared to fully multivariate ones.
• ### Approximating Likelihood Ratios with Calibrated Discriminative Classifiers(1506.02169)

March 18, 2016 physics.data-an, stat.AP, stat.ML
In many fields of science, generalized likelihood ratio tests are established tools for statistical inference. At the same time, it has become increasingly common that a simulator (or generative model) is used to describe complex processes that tie parameters $\theta$ of an underlying theory and measurement apparatus to high-dimensional observations $\mathbf{x}\in \mathbb{R}^p$. However, simulator often do not provide a way to evaluate the likelihood function for a given observation $\mathbf{x}$, which motivates a new class of likelihood-free inference algorithms. In this paper, we show that likelihood ratios are invariant under a specific class of dimensionality reduction maps $\mathbb{R}^p \mapsto \mathbb{R}$. As a direct consequence, we show that discriminative classifiers can be used to approximate the generalized likelihood ratio statistic when only a generative model for the data is available. This leads to a new machine learning-based approach to likelihood-free inference that is complementary to Approximate Bayesian Computation, and which does not require a prior on the model parameters. Experimental results on artificial problems with known exact likelihoods illustrate the potential of the proposed method.
• ### Parameterized Machine Learning for High-Energy Physics(1601.07913)

Jan. 28, 2016 hep-ph, hep-ex, cs.LG
We investigate a new structure for machine learning classifiers applied to problems in high-energy physics by expanding the inputs to include not only measured features but also physics parameters. The physics parameters represent a smoothly varying learning task, and the resulting parameterized classifier can smoothly interpolate between them and replace sets of classifiers trained at individual values. This simplifies the training process and gives improved performance at intermediate values, even for complex problems requiring deep learning. Applications include tools parameterized in terms of theoretical model parameters, such as the mass of a particle, which allow for a single network to provide improved discrimination across a range of masses. This concept is simple to implement and allows for optimized interpolatable results.
• ### Observing Ultra-High Energy Cosmic Rays with Smartphones(1410.2895)

We propose a novel approach for observing cosmic rays at ultra-high energy ($>10^{18}$~eV) by repurposing the existing network of smartphones as a ground detector array. Extensive air showers generated by cosmic rays produce muons and high-energy photons, which can be detected by the CMOS sensors of smartphone cameras. The small size and low efficiency of each sensor is compensated by the large number of active phones. We show that if user adoption targets are met, such a network will have significant observing power at the highest energies.
• ### Decoupling Theoretical Uncertainties from Measurements of the Higgs Boson(1401.0080)

April 1, 2015 hep-ph, physics.data-an
We develop a technique to present Higgs coupling measurements, which decouple the poorly defined theoretical uncertainties associated to inclusive and exclusive cross section predictions. The technique simplifies the combination of multiple measurements and can be used in a more general setting. We illustrate the approach with toy LHC Higgs coupling measurements and a collection of new physics models.
• ### Practical Statistics for the LHC(1503.07622)

March 26, 2015 hep-ex, physics.data-an
This document is a pedagogical introduction to statistics for particle physics. Emphasis is placed on the terminology, concepts, and methods being used at the Large Hadron Collider. The document addresses both the statistical tests applied to a model of the data and the modeling itself.
• ### 10 Simple Rules for the Care and Feeding of Scientific Data(1401.2134)

Jan. 9, 2014 cs.CY, cs.DL, astro-ph.IM
This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more, but our goal here is not to review that literature. Instead, we present a short guide intended for researchers who want to know why it is important to "care for and feed" data, with some practical advice on how to do that.
• ### Asymptotic formulae for likelihood-based tests of new physics(1007.1727)

June 24, 2013 hep-ex, physics.data-an
We describe likelihood-based statistical tests for use in high energy physics for the discovery of new phenomena and for construction of confidence intervals on model parameters. We focus on the properties of the test procedures that allow one to account for systematic uncertainties. Explicit formulae for the asymptotic distributions of test statistics are derived using results of Wilks and Wald. We motivate and justify the use of a representative data set, called the "Asimov data set", which provides a simple method to obtain the median experimental sensitivity of a search or measurement as well as fluctuations about this expectation.
• ### Asymptotic distribution for two-sided tests with lower and upper boundaries on the parameter of interest(1210.6948)

Oct. 25, 2012 hep-ex, physics.data-an
We present the asymptotic distribution for two-sided tests based on the profile likelihood ratio with lower and upper boundaries on the parameter of interest. This situation is relevant for branching ratios and the elements of unitary matrices such as the CKM matrix.
• ### Status Report of the DPHEP Study Group: Towards a Global Effort for Sustainable Data Preservation in High Energy Physics(1205.4667)

May 21, 2012 hep-ex, cs.DL
Data from high-energy physics (HEP) experiments are collected with significant financial and human effort and are mostly unique. An inter-experimental study group on HEP data preservation and long-term analysis was convened as a panel of the International Committee for Future Accelerators (ICFA). The group was formed by large collider-based experiments and investigated the technical and organisational aspects of HEP data preservation. An intermediate report was released in November 2009 addressing the general issues of data preservation in HEP. This paper includes and extends the intermediate report. It provides an analysis of the research case for data preservation and a detailed description of the various projects at experiment, laboratory and international levels. In addition, the paper provides a concrete proposal for an international organisation in charge of the data management and policies in high-energy physics.
• ### Power-Constrained Limits(1105.3166)

May 16, 2011 hep-ex, physics.data-an
We propose a method for setting limits that avoids excluding parameter values for which the sensitivity falls below a specified threshold. These "power-constrained" limits (PCL) address the issue that motivated the widely used CLs procedure, but do so in a way that makes more transparent the properties of the statistical test to which each value of the parameter is subjected. A case of particular interest is for upper limits on parameters that are proportional to the cross section of a process whose existence is not yet established. The basic idea of the power constraint can easily be applied, however, to other types of limits.
• ### The RooStats Project(1009.1003)

Feb. 1, 2011 physics.data-an
RooStats is a project to create advanced statistical tools required for the analysis of LHC data, with emphasis on discoveries, confidence intervals, and combined measurements. The idea is to provide the major statistical techniques as a set of C++ classes with coherent interfaces, so that can be used on arbitrary model and datasets in a common way. The classes are built on top of the RooFit package, which provides functionality for easily creating probability models, for analysis combinations and for digital publications of the results. We will present in detail the design and the implementation of the different statistical methods of RooStats. We will describe the various classes for interval estimation and for hypothesis test depending on different statistical techniques such as those based on the likelihood function, or on frequentists or bayesian statistics. These methods can be applied in complex problems, including cases with multiple parameters of interest and various nuisance parameters.
• ### RECAST: Extending the Impact of Existing Analyses(1010.2506)

Oct. 12, 2010 hep-ph, hep-ex, physics.data-an
Searches for new physics by experimental collaborations represent a significant investment in time and resources. Often these searches are sensitive to a broader class of models than they were originally designed to test. We aim to extend the impact of existing searches through a technique we call 'recasting'. After considering several examples, which illustrate the issues and subtleties involved, we present RECAST, a framework designed to facilitate the usage of this technique.
• ### Natural Priors, CMSSM Fits and LHC Weather Forecasts(0705.0487)

July 5, 2007 hep-ph, hep-ex
Previous LHC forecasts for the constrained minimal supersymmetric standard model (CMSSM), based on current astrophysical and laboratory measurements, have used priors that are flat in the parameter tan beta, while being constrained to postdict the central experimental value of MZ. We construct a different, new and more natural prior with a measure in mu and B (the more fundamental MSSM parameters from which tan beta and MZ are actually derived). We find that as a consequence this choice leads to a well defined fine-tuning measure in the parameter space. We investigate the effect of such on global CMSSM fits to indirect constraints, providing posterior probability distributions for Large Hadron Collider (LHC) sparticle production cross sections. The change in priors has a significant effect, strongly suppressing the pseudoscalar Higgs boson dark matter annihilation region, and diminishing the probable values of sparticle masses. We also show how to interpret fit information from a Markov Chain Monte Carlo in a frequentist fashion; namely by using the profile likelihood. Bayesian and frequentist interpretations of CMSSM fits are compared and contrasted.
• ### Maximum Significance at the LHC and Higgs Decays to Muons(hep-ph/0605268)

March 30, 2007 hep-ph
We present a new way to define and compute the maximum significance achievable for signal and background processes at the LHC, using all available phase space information. As an example, we show that a light Higgs boson produced in weak--boson fusion with a subsequent decay into muons can be extracted from the backgrounds. The method, aimed at phenomenological studies, can be incorporated in parton--level event generators and accommodate parametric descriptions of detector effects for selected observables.
• ### Statistical Challenges for Searches for New Physics at the LHC(physics/0511028)

Jan. 4, 2006 physics.data-an
Because the emphasis of the LHC is on 5 sigma discoveries and the LHC environment induces high systematic errors, many of the common statistical procedures used in High Energy Physics are not adequate. I review the basic ingredients of LHC searches, the sources of systematics, and the performance of several methods. Finally, I indicate the methods that seem most promising for the LHC and areas that are in need of further study.