At the heart of experimental high energy physics (HEP) is the development of
facilities and instrumentation that provide sensitivity to new phenomena. Our
understanding of nature at its most fundamental level is advanced through the
analysis and interpretation of data from sophisticated detectors in HEP
experiments. The goal of data analysis systems is to realize the maximum
possible scientific potential of the data within the constraints of computing
and human resources in the least time. To achieve this goal, future analysis
systems should empower physicists to access the data with a high level of
interactivity, reproducibility and throughput capability. As part of the HEP
Software Foundation Community White Paper process, a working group on Data
Analysis and Interpretation was formed to assess the challenges and
opportunities in HEP data analysis and develop a roadmap for activities in this
area over the next decade. In this report, the key findings and recommendations
of the Data Analysis and Interpretation Working Group are presented.
Particle physics has an ambitious and broad experimental programme for the
coming decades. This programme requires large investments in detector hardware,
either to build new facilities and experiments, or to upgrade existing ones.
Similarly, it requires commensurate investment in the R&D of software to
acquire, manage, process, and analyse the shear amounts of data to be recorded.
In planning for the HL-LHC in particular, it is critical that all of the
collaborating stakeholders agree on the software goals and priorities, and that
the efforts complement each other. In this spirit, this white paper describes
the R&D activities required to prepare for this software upgrade.
Historically, high energy physics computing has been performed on large
purpose-built computing systems. These began as single-site compute facilities,
but have evolved into the distributed computing grids used today. Recently,
there has been an exponential increase in the capacity and capability of
commercial clouds. Cloud resources are highly virtualized and intended to be
able to be flexibly deployed for a variety of computing tasks. There is a
growing nterest among the cloud providers to demonstrate the capability to
perform large-scale scientific computing. In this paper, we discuss results
from the CMS experiment using the Fermilab HEPCloud facility, which utilized
both local Fermilab resources and virtual machines in the Amazon Web Services
Elastic Compute Cloud. We discuss the planning, technical challenges, and
lessons learned involved in performing physics workflows on a large-scale set
of virtualized resources. In addition, we will discuss the economics and
operational efficiencies when executing workflows both in the cloud and on
Experimental Particle Physics has been at the forefront of analyzing the
worlds largest datasets for decades. The HEP community was the first to develop
suitable software and computing tools for this task. In recent times, new
toolkits and systems collectively called Big Data technologies have emerged to
support the analysis of Petabyte and Exabyte datasets in industry. While the
principles of data analysis in HEP have not changed (filtering and transforming
experiment-specific data formats), these new technologies use different
approaches and promise a fresh look at analysis of very large datasets and
could potentially reduce the time-to-physics with increased interactivity. In
this talk, we present an active LHC Run 2 analysis, searching for dark matter
with the CMS detector, as a testbed for Big Data technologies. We directly
compare the traditional NTuple-based analysis with an equivalent analysis using
Apache Spark on the Hadoop ecosystem and beyond. In both cases, we start the
analysis with the official experiment data formats and produce publication
physics plots. We will discuss advantages and disadvantages of each approach
and give an outlook on further studies needed.
Computing plays an essential role in all aspects of high energy physics. As
computational technology evolves rapidly in new directions, and data throughput
and volume continue to follow a steep trend-line, it is important for the HEP
community to develop an effective response to a series of expected challenges.
In order to help shape the desired response, the HEP Forum for Computational
Excellence (HEP-FCE) initiated a roadmap planning activity with two key
overlapping drivers -- 1) software effectiveness, and 2) infrastructure and
expertise advancement. The HEP-FCE formed three working groups, 1) Applications
Software, 2) Software Libraries and Tools, and 3) Systems (including systems
software), to provide an overview of the current status of HEP computing and to
present findings and opportunities for the desired HEP computational roadmap.
The final versions of the reports are combined in this document, and are
presented along with introductory material.