Particle physics has an ambitious and broad experimental programme for the
coming decades. This programme requires large investments in detector hardware,
either to build new facilities and experiments, or to upgrade existing ones.
Similarly, it requires commensurate investment in the R&D of software to
acquire, manage, process, and analyse the shear amounts of data to be recorded.
In planning for the HL-LHC in particular, it is critical that all of the
collaborating stakeholders agree on the software goals and priorities, and that
the efforts complement each other. In this spirit, this white paper describes
the R&D activities required to prepare for this software upgrade.
The computing systems used by LHC experiments has historically consisted of
the federation of hundreds to thousands of distributed resources, ranging from
small to mid-size resource. In spite of the impressive scale of the existing
distributed computing solutions, the federation of small to mid-size resources
will be insufficient to meet projected future demands. This paper is a case
study of how the ATLAS experiment has embraced Titan -- a DOE leadership
facility in conjunction with traditional distributed high-throughput computing
to reach sustained production scales of approximately 51M core-hours a years.
The three main contributions of this paper are: (i) a critical evaluation of
design and operational considerations to support the sustained, scalable and
production usage of Titan; (ii) a preliminary characterization of a next
generation executor for PanDA to support new workloads and advanced execution
modes; and (iii) early lessons for how current and future experimental and
observational systems can be integrated with production supercomputers and
other platforms in a general and extensible manner.
The physics goals of the next Large Hadron Collider run include high
precision tests of the Standard Model and searches for new physics. These goals
require detailed comparison of data with computational models simulating the
expected data behavior. To highlight the role which modeling and simulation
plays in future scientific discovery, we report on use cases and experience
with a unified system built to process both real and simulated data of growing
volume and variety.
Grid computing consists of the coordinated use of large sets of diverse,
geographically distributed resources for high performance computation.
Effective monitoring of these computing resources is extremely important to
allow efficient use on the Grid. The large number of heterogeneous computing
entities available in Grids makes the task challenging. In this work, we
describe a Grid monitoring system, called GridMonitor, that captures and makes
available the most important information from a large computing facility. The
Grid monitoring system consists of four tiers: local monitoring, archiving,
publishing and harnessing. This architecture was applied on a large scale linux
farm and network infrastructure. It can be used by many higher-level Grid
services including scheduling services and resource brokering.