-
Notifications
You must be signed in to change notification settings - Fork 6
/
LS1.tex
172 lines (146 loc) · 12.9 KB
/
LS1.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
\section{Developments during LS1}
\noindent{(Bloom, Dasu, Elmer, Lange)}
During the LS1, the US CMS institutions have developed new solutions for CMS computing
problems that result in more efficient use of both storage resources, using dynamic data
management and seamless wide-area network access to our storage (AAA and XROOTD), and
compute resources, using improved workflows, multi-threaded framework, GPUs and ability to exploit cloud
technologies for CMS workflows.
The original "Monarch" model of organization of CMS computing resources in a tiered structure
is now dated. While we retain Tier-0 at CERN for prompt processing, both calibration and
reconstruction, the functionality at higher tiers is changing. Especially at the Tier-2s, we are
evolving to a set of institutions providing portions of resources, focusing on local expertise and capability development, in
a continuum infrastructure of services. Here we will develop an evolving new model for optimal
use of resources to provide for CMS computing needs.
\subsection{CMS Data Federation (AAA)}
CMS Data Federation is built to provide international-scale data access infrastructure under the name ÒAny Data, Anytime, AnywhereÓ (AAA). AAA removes the requirement of co-location of storage and processing resources. The infrastructure is transparent, in that users have the same experience whether the data they analyze is halfway around the world or in the room next door. It is reliable, in that end users rarely see a failure of data access when they run their application. It enables greater access to the data, in that users no longer have the burden of purchasing and operating complex disk systems. In fact, any data can be accessed anytime from anywhere with an internet connection. The key to success of AAA is the improved wide-area network access due to enhancements made to our dedicated LHC network.
AAA is made possible by XRootd software, which allows the creation of data federations. A data federation serves a global namespace via a tree of XRootd servers. The leaves of this tree are referred to as Òdata sources,Ó as they serve data from the local storage systems. Each storage system is independent of the others, allowing for a broad range of implementations and groups to participate in the federation as long as they expose an agreed-upon namespace through the XRootd software. The non-leaf nodes have no storage, but may redirect client applications to a subscribed data source that has the requested file. Each host is subscribed to at most one redirector, called a manager; loops are disallowed. If the requested file is not present on a server subscribed to the redirector, then the client will be redirected to the current hostÕs manager. The manager continues the process until either a source is found or the client is at the root of the tree. An application may thus be redirected to any host in the federation, irrespective of the branch point it initially accesses.
The CMS Data Federation is now fully deployed across all tiers of its computing infrastructure. Easy access to
this data federation across the wide-area network is democratizing the computing abilities of University
groups across the world. Local campus grids controlled by non-CMS entities are easily integrated in
the CMS computing environment. Temporary access to dedicated large resources can be purchased
on commercial grids or obtained from national or campus research facilities.
\subsection{Dynamic Data Placement and Management}
Underpinning the computing infrastructure in Run-1 was data distribution mechanism implemented
using the PhEDEx software, developed by US CMS. The workflow involved operators moving large chunks of data on command
down the hierarchical grid so that after initial calibration and reconstruction at Tier-0, the raw data
were moved to be archived at Tier-1 and reprocessed as necessary. The (re)reconstructed data from
Tier-1s were further processed to obtain lower volume Analysis Objects Data (AOD) versions whose
copies were transferred to Tier-2s and placed on disk storage for random access for chaotic
analysis workflows. Similarly the Monte Carlo (MC) simulation data was produced at
Tier-2s, aggregated, reconstructed and archived at Tier-1s. The MC AODs were then placed on
command by the operators at various Tier-2s. The net result was multiple copies of data placed
statically at various facilities around the globe. It was observed that much of the disk volume
was occupied by often rarely used data.
Dynamic data placement and management (DDM) software was developed and commissioned during
the LS1 to address these shortcomings. DDM is now deployed at all tiers, and has replaced the manual procedures used in Run 1
with an algorithm to automatically prune the
unused but archived data using well defined policies. For example, the archived full event, i.e., raw
plus reconstructed quantities (FEVT) is pruned from disk to keep sufficient space for the Tier-0 and
Tier-1 reconstruction workflows to execute smoothly. Most importantly we are able to keep at least
one copy of all recently produced AOD on disk somewhere in the CMS data federation, and duplicate multiple times
as needed for popular data to ensure good access by users.
\subsection{Data Tiers for Analysis}
CMS data is organized in several tiers ranging from RAW data acquired from the
detector, to RECO format for reconstructed data, FEVT combining the two,
full set of analysis objects (AOD) and a newly introduced compressed AOD format, the miniAOD.
Physics analysis often involves a much smaller portion of reconstructed data than is
available. While the raw data acquired from CMS is about 1 MB per event in typical CMS and LHC operating conditions, the
reconstructed objects are typically more than double that size. The AOD defined for
Run-1 was successful but was designed in a rather lax way resulting in a 400 kB
size per event. Careful pruning of unused collections of objects, packing them
in appropriately sized containers resulted in a refined data tier, the miniAOD, which is less than
50 kB per event. The rate of event processing matters to users, so the size of event being small is
also beneficial for computational loads. The miniAOD is now envisioned as the main
data format to be used by bulk of CMS analysts, while niche (10-20\% of analysis) use cases involving
the original AOD format will be supported as needed. In rare cases FEVT access may
also be needed. As the miniAOD improves and matures we anticipate AOD replicas will be small.
CMS would like to ensure that the CMS Data Federation hosts the compact
MiniAOD with appropriate level of replication for the popular data and
a fraction of the AOD data. The estimated miniAOD size of the data federation
for 300-fb$^{-1}$ luminosity data sample is:
(50kB)(1kHz)*(10$^7$s/year)*(6y)=3*10$^9$MB=3PB. Typically
to analyze this data requires an associated MC sample which is three times
its size, leading to a 10PB disk storage requirement for the CMS data federation
for an instance of MiniAOD. Since the MiniAOD is highly processed it is
expected that improvements to the reconstruction, calibration and other
changes will necessitate remaking of MiniAODs. Unfortunately, support of
multiple versions of MiniAOD for several versions are needed
in order to accommodate the analyst needs. Not all analysts can migrate
from one version to another immediately. We estimate that two to three
MiniAOD versions are needed at any time, with the older data deprecated
dynamically as they become stale and unused. This results in expected
data volume of 20-30 PB of storage for MiniAOD.
The AOD data tier is currently used by many analysis groups. It is
anticipated there will remain significant usage for certain analysis
which requires detailed understanding of the detector. Complicated
final states with missing transverse energy, multiple low $P_T$ jets,
$\tau$ final states, etc. The AOD is 410 kB per event, and using
30\% as the fraction of data that needs to be hosted on disk in this
format we get 10PB storage.
The tape archived raw data size for 300 fb$^{-1}$ is 60PB. It is
assumed that the reconstructed FEVT objects are only available
in the intermediate stages of processing and not archived. However,
it is anticipated that a fraction of RAW data, at about 10 PB is
expected to be stored on disk for special reconstruction techniques,
e.g., those with heavy slow particles, and for cache.
\subsection{Improvements to CMS Workflows}
The main objectives of the workflows management middleware is to process data
as quickly as possible, maintain uniform load across all resource sites and enable
fast recovery in case of a site service interruption, e.g., by relocating jobs on an
alternate site, while keep track of the integrity of the combined dataset. Recent
developments in workflow management enable CMS to utilize impermanent
opportunistic resources for data production. These workflow changes are enabling
CMS to take better advantage of owned and opportunistic resources. For instance
the ability to use the high-level trigger farm for MC production and reconstruction
during the down periods of time. Recent tests indicate ability to switch to offline
processing workflows within several minutes. Improved data transfer technology
and remote Xrootd access to CMS data federation are enabling technologies.
CMS also used the LS1 period to reduce the resource needs of algorithms that make up the simulation,
digitization and reconstruction applications. Substantial technical improvements, factors of between two and four,
were made in each of these applications while maintaining the physics performance. The primary changes ame from developing
a new technique for sampling low-energy particles in the simulation, reoptimizing the tracking algorithms. In addition,
CMS worked to develop reconstruction for 25 ns operating conditions of the LHC. This brought significant new challenges, primarily
to the tracking and calorimeter detectors.
\subsection{Opportunistic Resources}
CMS is in exploratory phase for smoothly integrating opportunistic resources
for production and routine use. National research computing sites such as
NERSC and SDSC have large resources, but often requiring additional work
in adaptation of our software suite to smoothly operate there. Some access
restrictions are worked around with user level code, e.g., CVMFS through
Parrot and Docker/Shifter containers on Cray supercomputers. Commercial
clouds such as AWS have also been used, but have cost-implications placing
constraints on workflows. For example, the stage-out of data over the network
is expensive. We have adapted by chaining various stages of workflow so that
the smallest useable unit, say miniAOD is the only output that is staged out
to the CMS data federation. Non-owned campus clusters are accessible both
through their OSG connection if existing generically, or by placing suitable
head-nodes at the participating university CMS group facilities. This latter
use is of particularly important for analysis groups at access their home
resources seamlessly processing data from the central CMS data federation
using centrally supported code and conditions repositories using technologies
such as CVMFS and caching SQUIDs. The CMS HLT cloud using virtual machines
technology is able to quickly bring in very large resource during data taking
down periods for offline workflow processing. Some of the innovations made
there are useable elsewhere.
Final stages of physics analysis often involve workflows that are not
centrally managed CMSSW framework jobs. Technologies such as
CMS Connect are able to use campus grid and department level
computer clusters to bring additional opportunistic uses for these
cases.
\subsection{Improvements to CMSSW Framework (multi-threading )}
\noindent{(David - after that Liz)}
Multicore computing systems have become ubiquitous in the past decade. However, given
computing hardware evolution, the continued
efficient use of available resources, especially memory volume and access, requires
adaptation of our software to a suitable multithreaded framework. CMS has developed
a novel event-processing framework and has engaged in
a systematic effort to make the core (simulation, high-level trigger and reconstruction algorithms)
CMSSW thread-safe. CMS has successfully
deployed this framework in the past year, including Run 2 Tier-0 operations.
CMS has already achieved a large effective memory savings and latency reduction in its reconstruction for Run 2.
The potential capabilities of the CMS multithreaded framework are far beyond those of frameworks that exploit multiple processes
This framework provides CMS the basis to exploit future CPU architectures and provides a flexible and scalable
system for continued development. Further tuning of software and operational
procedures to increase the resource usage efficiency is underway both in the framework and targeted algorithm improvements for threading.
\subsection{Evolution of the computing hardware: CPU, Storage and Network}
\noindent{(Elmer)}