
Complex computer simulators are increasingly used across fields of science as
generative models tying parameters of an underlying theory to experimental
observations. Inference in this setup is often difficult, as simulators rarely
admit a tractable density or likelihood function. We introduce Adversarial
Variational Optimization (AVO), a likelihoodfree inference algorithm for
fitting a nondifferentiable generative model incorporating ideas from
generative adversarial networks, variational optimization and empirical Bayes.
We adapt the training procedure of generative adversarial networks by replacing
the differentiable generative network with a domainspecific simulator. We
solve the resulting nondifferentiable minimax problem by minimizing
variational upper bounds of the two adversarial objectives. Effectively, the
procedure results in learning a proposal distribution over simulator
parameters, such that the JS divergence between the marginal distribution of
the synthetic data and the empirical distribution of observed data is
minimized. We evaluate and compare the method with simulators producing both
discrete and continuous data.

Recent progress in applying machine learning for jet physics has been built
upon an analogy between calorimeters and images. In this work, we present a
novel class of recursive neural networks built instead upon an analogy between
QCD and natural languages. In the analogy, fourmomenta are like words and the
clustering history of sequential recombination jet algorithms is like the
parsing of a sentence. Our approach works directly with the fourmomenta of a
variablelength set of particles, and the jetbased tree structure varies on an
eventbyevent basis. Our experiments highlight the flexibility of our method
for building taskspecific jet embeddings and show that recursive architectures
are significantly more accurate and data efficient than previous imagebased
networks. We extend the analogy from individual jets (sentences) to full events
(paragraphs), and show for the first time an eventlevel classifier operating
on all the stable particles produced in an LHC event.

We present powerful new analysis techniques to constrain effective field
theories at the LHC. By leveraging the structure of particle physics processes,
we extract extra information from MonteCarlo simulations, which can be used to
train neural network models that estimate the likelihood ratio. These methods
scale well to processes with many observables and theory parameters, do not
require any approximations of the parton shower or detector response, and can
be evaluated in microseconds. We show that they allow us to put significantly
stronger bounds on dimensionsix operators than existing methods, demonstrating
their potential to improve the precision of the LHC legacy constraints.

We develop, discuss, and compare several inference techniques to constrain
theory parameters in collider experiments. By harnessing the latentspace
structure of particle physics processes, we extract extra information from the
simulator. This augmented data can be used to train neural networks that
precisely estimate the likelihood ratio. The new methods scale well to many
observables and highdimensional parameter spaces, do not require any
approximations of the parton shower and detector response, and can be evaluated
in microseconds. Using weakbosonfusion Higgs production as an example
process, we compare the performance of several techniques. The best results are
found for likelihood ratio estimators trained with extra information about the
score, the gradient of the log likelihood function with respect to the theory
parameters. The score also provides sufficient statistics that contain all the
information needed for inference in the neighborhood of the Standard Model.
These methods enable us to put significantly stronger bounds on effective
dimensionsix operators than the traditional approach based on histograms. They
also outperform generic machine learning methods that do not make use of the
particle physics structure, demonstrating their potential to substantially
improve the new physics reach of the LHC legacy results.

At the heart of experimental high energy physics (HEP) is the development of
facilities and instrumentation that provide sensitivity to new phenomena. Our
understanding of nature at its most fundamental level is advanced through the
analysis and interpretation of data from sophisticated detectors in HEP
experiments. The goal of data analysis systems is to realize the maximum
possible scientific potential of the data within the constraints of computing
and human resources in the least time. To achieve this goal, future analysis
systems should empower physicists to access the data with a high level of
interactivity, reproducibility and throughput capability. As part of the HEP
Software Foundation Community White Paper process, a working group on Data
Analysis and Interpretation was formed to assess the challenges and
opportunities in HEP data analysis and develop a roadmap for activities in this
area over the next decade. In this report, the key findings and recommendations
of the Data Analysis and Interpretation Working Group are presented.

Particle physics has an ambitious and broad experimental programme for the
coming decades. This programme requires large investments in detector hardware,
either to build new facilities and experiments, or to upgrade existing ones.
Similarly, it requires commensurate investment in the R&D of software to
acquire, manage, process, and analyse the shear amounts of data to be recorded.
In planning for the HLLHC in particular, it is critical that all of the
collaborating stakeholders agree on the software goals and priorities, and that
the efforts complement each other. In this spirit, this white paper describes
the R&D activities required to prepare for this software upgrade.

We describe a procedure for constructing a model of a smooth data spectrum
using Gaussian processes rather than the historical parametric description.
This approach considers a fuller space of possible functions, is robust at
increasing luminosity, and allows us to incorporate our understanding of the
underlying physics. We demonstrate the application of this approach to modeling
the background to searches for dijet resonances at the Large Hadron Collider
and describe how the approach can be used in the search for generic localized
signals.

Preserving data analyses produced by the collaborations at LHC in a
parametrized fashion is crucial in order to maintain reproducibility and
reusability. We argue for a declarative description in terms of individual
processing steps  packtivities  linked through a dynamic directed acyclic
graph (DAG) and present an initial set of JSON schemas for such a description
and an implementation  yadage  capable of executing workflows of analysis
preserved via Linux containers.

Several techniques for domain adaptation have been proposed to account for
differences in the distribution of the data used for training and testing. The
majority of this work focuses on a binary domain label. Similar problems occur
in a scientific context where there may be a continuous family of plausible
data generation processes associated to the presence of systematic
uncertainties. Robust inference is possible if it is based on a pivot  a
quantity whose distribution does not depend on the unknown values of the
nuisance parameters that parametrize this family of data generation processes.
In this work, we introduce and derive theoretical results for a training
procedure based on adversarial networks for enforcing the pivotal property (or,
equivalently, fairness with respect to continuous attributes) on a predictive
model. The method includes a hyperparameter to control the tradeoff between
accuracy and robustness. We demonstrate the effectiveness of this approach with
a toy example and examples from particle physics.

Information geometry can be used to understand and optimize Higgs
measurements at the LHC. The Fisher information encodes the maximum sensitivity
of observables to model parameters for a given experiment. Applied to
higherdimensional operators, it defines the new physics reach of any LHC
signature. We calculate the Fisher information for Higgs production in weak
boson fusion with decays into tau pairs and four leptons, and for Higgs
production in association with a single top quark. In a next step we analyze
how the differential information is distributed over phase space, which defines
optimal event selections. Conversely, we consider the information in the
distribution of a subset of the kinematic variables, showing which production
and decay observables are the most powerful and how much information is lost in
traditional histogrambased analysis methods compared to fully multivariate
ones.

In many fields of science, generalized likelihood ratio tests are established
tools for statistical inference. At the same time, it has become increasingly
common that a simulator (or generative model) is used to describe complex
processes that tie parameters $\theta$ of an underlying theory and measurement
apparatus to highdimensional observations $\mathbf{x}\in \mathbb{R}^p$.
However, simulator often do not provide a way to evaluate the likelihood
function for a given observation $\mathbf{x}$, which motivates a new class of
likelihoodfree inference algorithms. In this paper, we show that likelihood
ratios are invariant under a specific class of dimensionality reduction maps
$\mathbb{R}^p \mapsto \mathbb{R}$. As a direct consequence, we show that
discriminative classifiers can be used to approximate the generalized
likelihood ratio statistic when only a generative model for the data is
available. This leads to a new machine learningbased approach to
likelihoodfree inference that is complementary to Approximate Bayesian
Computation, and which does not require a prior on the model parameters.
Experimental results on artificial problems with known exact likelihoods
illustrate the potential of the proposed method.

We investigate a new structure for machine learning classifiers applied to
problems in highenergy physics by expanding the inputs to include not only
measured features but also physics parameters. The physics parameters represent
a smoothly varying learning task, and the resulting parameterized classifier
can smoothly interpolate between them and replace sets of classifiers trained
at individual values. This simplifies the training process and gives improved
performance at intermediate values, even for complex problems requiring deep
learning. Applications include tools parameterized in terms of theoretical
model parameters, such as the mass of a particle, which allow for a single
network to provide improved discrimination across a range of masses. This
concept is simple to implement and allows for optimized interpolatable results.

We propose a novel approach for observing cosmic rays at ultrahigh energy
($>10^{18}$~eV) by repurposing the existing network of smartphones as a ground
detector array. Extensive air showers generated by cosmic rays produce muons
and highenergy photons, which can be detected by the CMOS sensors of
smartphone cameras. The small size and low efficiency of each sensor is
compensated by the large number of active phones. We show that if user adoption
targets are met, such a network will have significant observing power at the
highest energies.

We develop a technique to present Higgs coupling measurements, which decouple
the poorly defined theoretical uncertainties associated to inclusive and
exclusive cross section predictions. The technique simplifies the combination
of multiple measurements and can be used in a more general setting. We
illustrate the approach with toy LHC Higgs coupling measurements and a
collection of new physics models.

This document is a pedagogical introduction to statistics for particle
physics. Emphasis is placed on the terminology, concepts, and methods being
used at the Large Hadron Collider. The document addresses both the statistical
tests applied to a model of the data and the modeling itself.

This article offers a short guide to the steps scientists can take to ensure
that their data and associated analyses continue to be of value and to be
recognized. In just the past few years, hundreds of scholarly papers and
reports have been written on questions of data sharing, data provenance,
research reproducibility, licensing, attribution, privacy, and more, but our
goal here is not to review that literature. Instead, we present a short guide
intended for researchers who want to know why it is important to "care for and
feed" data, with some practical advice on how to do that.

We describe likelihoodbased statistical tests for use in high energy physics
for the discovery of new phenomena and for construction of confidence intervals
on model parameters. We focus on the properties of the test procedures that
allow one to account for systematic uncertainties. Explicit formulae for the
asymptotic distributions of test statistics are derived using results of Wilks
and Wald. We motivate and justify the use of a representative data set, called
the "Asimov data set", which provides a simple method to obtain the median
experimental sensitivity of a search or measurement as well as fluctuations
about this expectation.

We present the asymptotic distribution for twosided tests based on the
profile likelihood ratio with lower and upper boundaries on the parameter of
interest. This situation is relevant for branching ratios and the elements of
unitary matrices such as the CKM matrix.

Data from highenergy physics (HEP) experiments are collected with
significant financial and human effort and are mostly unique. An
interexperimental study group on HEP data preservation and longterm analysis
was convened as a panel of the International Committee for Future Accelerators
(ICFA). The group was formed by large colliderbased experiments and
investigated the technical and organisational aspects of HEP data preservation.
An intermediate report was released in November 2009 addressing the general
issues of data preservation in HEP. This paper includes and extends the
intermediate report. It provides an analysis of the research case for data
preservation and a detailed description of the various projects at experiment,
laboratory and international levels. In addition, the paper provides a concrete
proposal for an international organisation in charge of the data management and
policies in highenergy physics.

We propose a method for setting limits that avoids excluding parameter values
for which the sensitivity falls below a specified threshold. These
"powerconstrained" limits (PCL) address the issue that motivated the widely
used CLs procedure, but do so in a way that makes more transparent the
properties of the statistical test to which each value of the parameter is
subjected. A case of particular interest is for upper limits on parameters that
are proportional to the cross section of a process whose existence is not yet
established. The basic idea of the power constraint can easily be applied,
however, to other types of limits.

RooStats is a project to create advanced statistical tools required for the
analysis of LHC data, with emphasis on discoveries, confidence intervals, and
combined measurements. The idea is to provide the major statistical techniques
as a set of C++ classes with coherent interfaces, so that can be used on
arbitrary model and datasets in a common way. The classes are built on top of
the RooFit package, which provides functionality for easily creating
probability models, for analysis combinations and for digital publications of
the results. We will present in detail the design and the implementation of the
different statistical methods of RooStats. We will describe the various classes
for interval estimation and for hypothesis test depending on different
statistical techniques such as those based on the likelihood function, or on
frequentists or bayesian statistics. These methods can be applied in complex
problems, including cases with multiple parameters of interest and various
nuisance parameters.

Searches for new physics by experimental collaborations represent a
significant investment in time and resources. Often these searches are
sensitive to a broader class of models than they were originally designed to
test. We aim to extend the impact of existing searches through a technique we
call 'recasting'. After considering several examples, which illustrate the
issues and subtleties involved, we present RECAST, a framework designed to
facilitate the usage of this technique.

Previous LHC forecasts for the constrained minimal supersymmetric standard
model (CMSSM), based on current astrophysical and laboratory measurements, have
used priors that are flat in the parameter tan beta, while being constrained to
postdict the central experimental value of MZ. We construct a different, new
and more natural prior with a measure in mu and B (the more fundamental MSSM
parameters from which tan beta and MZ are actually derived). We find that as a
consequence this choice leads to a well defined finetuning measure in the
parameter space. We investigate the effect of such on global CMSSM fits to
indirect constraints, providing posterior probability distributions for Large
Hadron Collider (LHC) sparticle production cross sections. The change in priors
has a significant effect, strongly suppressing the pseudoscalar Higgs boson
dark matter annihilation region, and diminishing the probable values of
sparticle masses. We also show how to interpret fit information from a Markov
Chain Monte Carlo in a frequentist fashion; namely by using the profile
likelihood. Bayesian and frequentist interpretations of CMSSM fits are compared
and contrasted.

We present a new way to define and compute the maximum significance
achievable for signal and background processes at the LHC, using all available
phase space information. As an example, we show that a light Higgs boson
produced in weakboson fusion with a subsequent decay into muons can be
extracted from the backgrounds. The method, aimed at phenomenological studies,
can be incorporated in partonlevel event generators and accommodate
parametric descriptions of detector effects for selected observables.

Because the emphasis of the LHC is on 5 sigma discoveries and the LHC
environment induces high systematic errors, many of the common statistical
procedures used in High Energy Physics are not adequate. I review the basic
ingredients of LHC searches, the sources of systematics, and the performance of
several methods. Finally, I indicate the methods that seem most promising for
the LHC and areas that are in need of further study.