Data Preservation at the Exa-Scale and Beyond Challenges of the

Data Preservation at the
Exa-Scale and Beyond
Challenges of the Next Decade(s)
[email protected]
APA Conference, Brussels, October 2014
The Story So Far…
• Together, we have reached the point where a
generic, multi-disciplinary, scalable e-i/s for LTDP
is achievable – and will hopefully be funded 
• Built on standards, certified via agreed
procedures, using the “Cream of DP services”
• In parallel, Business Cases and Cost Models are
increasingly understood, working closely with
Projects, Communities and Funding Agencies
Topic For RDA-4 Joint DP W/S
• The high-level Use Cases we are being required to
address by FAs are:
1. Open Access (specific samples / purposes);
2. Reproducibility (of data, results, publications);
3. Provision of Data Management Plan(s).
• AFAIK, these “requirements” are not specific to
a given community, i.e. it is ALL disciplines
funded by a given FA that must address these
Open Questions
• Long-term sustainability is still a technical issue
– Let’s assume that we understand the Business Cases & Cost
Models well enough…
– And (we) even have agreed funding for key aspects
• But can the service providers guarantee a multi-decade
service?
– Is this realistic?
– Is this even desirable?
• I will address these issues at the APA conference next
month in Brussels – with a proposal for “a solution”
Background
• 20 years ago – in 1994 – the first Computing R&D
projects for the LHC were proposed
– About 10 years before the expected startup date
• History shows that these projects didn’t start too
early – even including the LHC startup delays
• We now foresee “next generation” data factories
in the 2020s and beyond
• These will generate Exabytes (e.g. HL-LHC) to
Zettabytes (e.g. FCC, SKA) of data and last decades
5
Technology(?)
• Of course, in 1-2 decades we can expect huge
advances in technology
• At least some of these changes are likely to be
disruptive – just look back!
• But you cannot plan based on the unknown
• Eventually, you will have to make decisions based
either on what exists, or what you can be
confident will be delivered, on the needed
timescale
 Major changes in technology during the active
life of current / future projects likely
6
H2020 EINFRA-1-2014
Managing, preserving and computing with big
research data
7) Proof of concept and prototypes of data
infrastructure-enabling software (e.g. for
databases and data mining) for extremely large
or highly heterogeneous data sets scaling to
zetabytes and trillion of objects.
Clean slate approaches to data management
targeting 2020+ 'data factory' requirements of
research communities and large scale facilities
(e.g. ESFRI projects) are encouraged
7
Lunatic Fringe
• But this is clearly the lunatic fringe. What exactly does it have
to do with me?
• Quite a lot: implications for efforts such as
– CTRUST / RDA Certification Interest Group
• Can current certification procedures “scale” to such massive data volumes?
• Can multi-site requirements be addressed?
– RDA Active Data Management Plans
– 4C: Costs of Exa / Zetta scale curation must clearly be well understood and
justified
– RDA Reproducibility Interest Group (and many others)
– DPINFRA: next generation requirements
– [ Preservation VRE: some aspects ~independent of total data volume, some
not ]
– APA CoE
– …
 Significant economies of scale in “shared bit repositories”
8
Suppose these guys can build / share
the most cost effective, scalable and
reliable federated storage services,
e.g. for peta- / exa- / zetta- scale
bit preservation?
Can we ignore them?
9
Next Generation Data Factories
• HL-LHC (https://indico.cern.ch/category/4863/)
– Europe’s top priority should be the exploitation of the full
potential of the LHC, including the high-luminosity upgrade of
the machine and detectors with a view to collecting ten times
more data than in the initial design, by around 2030
– (European Strategy for Particle Physics)
• SKA
– The Square Kilometre Array (SKA) project is an international
effort to build the world’s largest radio telescope, with a square
kilometre (one million square metres) of collecting area
 Typified by SCALE in several dimensions:
– Cost; longevity; data rates & volumes
– Last decades; cost O(EUR 109); EB / ZB data volumes
10
http://science.energy.gov/fundingopportunities/digital-data-management/
• “The focus of this statement is sharing and preservation of digital research
data”
• All proposals submitted to the Office of Science (after 1 October 2014) for
research funding must include a Data Management Plan (DMP) that
addresses the following requirements:
1.
DMPs should describe whether and how data generated in the course of
the proposed research will be shared and preserved.
If the plan is not to share and/or preserve certain data, then the plan must
explain the basis of the decision (for example, cost/benefit considerations,
other parameters of feasibility, scientific appropriateness, or limitations
discussed in #4).
At a minimum, DMPs must describe how data sharing and preservation
will enable validation of results, or how results could be validated if data
are not shared or preserved.
11
LHC experiments increasingly talking about:
1. Open Access for Outreach;
2. Reproducibility of Results.
12
These are becoming
mandatory activities,
fully supported at all
levels of the
Collaborations
13
Computing at the HL-LHC (~2025+)
Predrag Buncic
on behalf of the
Trigger/DAQ/Offline/Computing
Preparatory Group
ALICE: Pierre Vande Vyvre, Thorsten Kollegger, Predrag Buncic; ATLAS: David Rousseau, Benedetto Gorini, Nikos
Konstantinidis; CMS: Wesley Smith, Christoph Schwick, Ian Fisk, Peter Elmer ; LHCb: Renaud Legac, Niko Neufeld
LHCb
Predrag Buncic, October 3, 2013
ECFA Workshop Aix-Les-Bains - 14
ATLAS & CMS @ HL-LHC
Humungous Data Rates
Not Relevant for LT DP
Level 1
Level 1
 AKA “Filters” 
HLT
HLT
5-10 kHz (2MB/event)
Storage
Storage
10-20 GB/s
Predrag Buncic, October 3, 2013
10 kHz (4MB/event)

PEAK OUTPUT 
40 GB/s
ECFA Workshop Aix-Les-Bains - 15
Data: Outlook for HL-LHC
450.0
400.0
350.0
PB
300.0
CMS
250.0
200.0
ATLAS
We are here!
ALICE
LHCb
150.0
100.0
50.0
0.0
Run 1
Run 2
Run 3
Run 4
• Very rough estimate of a new RAW data per year of running using a
simple extrapolation of current data volume scaled by the output rates.
• To be added: derived data (ESD, AOD), simulation, user data…
 At least 0.5 EB / year (x 10 years of data taking)
Predrag Buncic, October 3, 2013
ECFA Workshop Aix-Les-Bains - 16
Data storage issues
•
Our data problems may still look small on the scale of
storage needs of internet giants
•
•
Business e-mail, video, music, smartphones, digital cameras
generate more and more need for storage
The cost of storage will probably continue to go down
but…
•
•
•
•
Commodity high capacity disks may start to look more like tapes,
optimized for multimedia storage, sequential access
Need to be combined with flash memory disks for fast random
access
The residual cost of disk servers will remain
While we might be able to write all this data, how long it will take to
read it back? Need for sophisticated parallel I/O and processing.
+ We have to store this amount of data every year and for
many years to come (Long Term Data Preservation )
Predrag Buncic, October 3, 2013
ECFA Workshop Aix-Les-Bains - 17
WLCG Collaboration Today
•
•
•
•
Distributed infrastructure of 150 computing centers in 40 countries
300+ k CPU cores (~ 2M HEP-SPEC-06)
The biggest site with ~50k CPU cores, 12 T2 with 2-30k CPU cores
Distributed data, services and operation infrastructure
Predrag Buncic, October 3, 2013
ECFA Workshop Aix-Les-Bains - 18
WLCG Collaboration Tomorrow
• How will this evolve to HL-LHC needs?
• To what extent is it applicable to other comparable scale projects?
• Already evolving, most significantly during Long Shutdowns, but also
during data taking!
Predrag Buncic, October 3, 2013
ECFA Workshop Aix-Les-Bains - 19
Today’s state of the art in
0.1EB scale bit preservation
(or “exabit”)
20
Bit-preservation WG one-slider
• Mandate summary (see w3.hepix.org/bit-preservation)
– Collecting and sharing knowledge on bit preservation across HEP
(and beyond)
– Provide technical advise to
– Recommendations for sustainable archival storage in HEP
• Survey on Large HEP archive sites carried out and
presented at last HEPiX
– 19 sites; areas such as archive lifetime, reliability, access,
verification, migration
– HEP Archiving has become a reality by fact rather than by design
– Overall positive but lack of SLA’s, metrics, best practices, and
long-term costing impact
21
Ongoing Work
Two work areas:
1.
Preparing a set of best-practice recommendations for bit-level
preservation within HEP
– ~10 recommendations
– Concentrate more on “what” rather than “how” to do
– Will be circulated to WG participants and surveyed sites summer time
– Feedback will be most appreciated
2.
Defining a simple and customisable model for helping establishing the
long-term cost of bit-level preservation
– Useful for site planning/outlook
– Input for DPHEP – significant fraction of overall Data Preservation cost!
– The rest of this presentation
22
Verification & reliability
Systematic verification of archive data ongoing
–
–
–
•
–
–
From annual bit loss rates of
O(10-12) (2009) to O(10-16) (2012)
New drive generations +
less strain (HSM mounts, TM “hitchback”) +
verification
Differences between vendors getting small
Still, room for improvement
–
–
–
23
“Cold” archive: Users only accessed ~20% of the
data (2013)
All “historic” data verified between 2010-2013
All new and repacked data being verified as well
Data reliability significantly improved over
last 5 years
–
•
~35 PB verified in 2014
Vendor quoted bit error rates: O(10-19..-20)
But, these only refer to media failures
Errors (eg bit flips) appearing in complete chain
No losses
•
“LHC Cost Model” (simplified)
Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year)
10EB
1EB
24
Case B) increasing archive growth
25
Case B) increasing archive growth
Total cost: ~$59.9M
(~$2M / year)
26
From Petabytes to Exabytes
• Can the current computing and data management
models scale by orders of magnitude?
• We cannot simply “scale out” in terms of number of
sites and need much greater resilience against data
loss / corruption, including (semi-)automated
recovery + support for adding / removing sites
• Today, this is often done by the experiments: how will
this work after data taking stops?
How will we cope when (not if) sites no longer
“support” a given experiment?
27
History shows that we will need
many years of R&D to reach a
new scale.
Not all paths will be successful
but we cannot postpone
starting as the whole process,
including the necessary service
hardening, will take many
years. (Decade + ?)
28
Canton of Vaud
Lak
e
Ge
nev
a
LHC
Jura
Prealps
Ain Department
Canton of Geneva
Sa
lèv
e
Haute-Savoie Department
Jura
Schematic of an
80-100 km long
circular tunnel
Future Circular Collider (FCC)
© Copyright CERN 2014
Ma
nd
alla
z
Aravis
Science case
Convince me that this project is scientifically excellent
Project Plan
Convince me that you know what you are doing:
scope, costs and schedule are under control
“Business case ”
Convince me that this is a good use of public money
What did the Tevatron@Fermilab cost?
• Tevatron accelerator
– $120M (1983) = $277M (2012 $)
• Main Injector project
– $290M (1994) = $450M (2012 $)
• Detectors and upgrades
– Guess: 2 x $500M (collider detectors) + $300M (FT)
• Operations
– Say 20 years at $100M/year = $2 billion
• Total cost =
$4 billion
PhD Student Training
• Value of a PhD student
– $2.2M (US Census Bureau, 2002) = $2.8M (2012 $)
• Number of students trained at the Tevatron
– 904 (CDF + DØ)
– 492 (Fixed Target)
– 18 (Smaller Collider experiments)
– 1414 total
• Financial Impact =
$3.96 billion
Superconducting Magnets
• Current value of SC Magnet Industry
– $1.5 Billion p.a.
• Value of MRI industry (the major customer for SC magnets)
– $5 Billion p.a.
• This industry would probably have succeeded anyway –
what we can realistically claim is that the large scale
investment in this technology at the Tevatron significantly
accelerated its development
– Guess – one to two years faster than otherwise?
• Financial Impact =
$5-10 billion
Balance sheet
•
•
•
•
20 year investment in Tevatron
Students
$4B
Magnets and MRI
$5-10B
Computing
$40B
~ $4B
}
~ $50B total
Very rough calculation – but confirms our gut feeling that
investment in fundamental science pays off
I think there is an opportunity for someone to repeat this
exercise more rigorously
cf. STFC study of SRS Impact
http://www.stfc.ac.uk/2428.aspx
We have a good song to
sing in terms of the
scientific, economic and
cultural benefits
of these next generation
data factories.
Data sharing,
Reproducibility and
Measurable Data
Management Plans are
going to be key.
39
Certification
• Next generation data factories will bring new requirements in
terms of certification
• Multi-site certification can be expected to be core
• Today’s “best practices” will need to be extended – possibly
rethought for this new scale
• Room for collaboration with peta- / exa-scale practitioners,
e.g. those from HEPiX WG + RDA IG ???
 Push key storage sites to pursue Certification in a coordinated
fashion
40
Data Management Plans
• Often these are “static” – revised at best every few
years (and hence typically out of date with reality) –
e.g. WLCG Technical Design Report
• Can we switch to a “dashboard mode”, whereby the
current reality can be viewed, with the appropriate
level of detail, through a portal?
• This is something that could “come naturally”,
combining existing displays from data scrubbing,
migration, caching and replication with Reproducibility
& Outreach views: Tabs for Experts, FAs & GP
41
We’re moving towards capturing the analysis
environment so that Reproducibility is part
of the Approval Process for Publication!
42
CERN aims for 100% Gold Open Access for all
its original HEP results, experimental and
theoretical, by end 2016.
43
Costs of Curation
• Given the scale, duration and expected costs of future
generation data factories, a clear understanding of
the costs and benefits of curation must be built in.
• The costs of “bit preservation” can clearly be reduced
through economies of scale, but then not much
further.
– Is there any other way than “state of the art”?
– Around $1M/year/EB in 2040+ !!!
 The real issues relate to manpower intensive areas,
such as knowledge capture and the ability to full reuse the data in the long-term.
44
Reproducibility
• It is exciting to see such key issues being
addressed from “grass root” initiatives, such as
the recent RDA BoF in this area, with many
experts involved!
– Leading hopefully to an Interest Group and concrete
outcomes
– Maybe a “specific” call once mature?
• We have much to learn by sharing expertise and
not repeatedly re-inventing wheels…
45
http://science.energy.gov/fundingopportunities/digital-data-management/
• “The focus of this statement is sharing and preservation of digital research
data”
• All proposals submitted to the Office of Science (after 1 October 2014) for
research funding must include a Data Management Plan (DMP) that
addresses the following requirements:
1.
DMPs should describe whether and how data generated in the course of
the proposed research will be shared and preserved.
If the plan is not to share and/or preserve certain data, then the plan must
explain the basis of the decision (for example, cost/benefit considerations,
other parameters of feasibility, scientific appropriateness, or limitations
discussed in #4).
At a minimum, DMPs must describe how data sharing and preservation
will enable validation of results, or how results could be validated if data
are not shared or preserved.
46
Surely we can address these generic (scientific) requirements
together, using at least some common services:
SCIDIP-ES outputs, CernVM[FS], Zenodo / Invenio, …
A joint VRE (R&D) proposal?
47
2020 Vision for LT DP in HEP
• Long-term – e.g. FCC timescales: disruptive change
– By 2020, all archived data – e.g. that described in DPHEP Blueprint,
including LHC data – easily findable, fully usable by designated
communities with clear (Open) access policies and possibilities to
annotate further
– Best practices, tools and services well run-in, fully documented and
sustainable; built in common with other disciplines, based on
standards
– DPHEP portal, through which data / tools accessed
 “HEP FAIRport”: Findable, Accessible, Interoperable, Re-usable
 Agree with Funding Agencies clear targets & metrics
48
Summary
• Next generation data factories will bring with
them many challenges for computing, networking
and storage
• Data Preservation – and management in general –
will be key to their success and must be an
integral part of the projects: not an afterthought
• We need to start a range of R&D activities now:
these can bring tangible benefits to existing
projects in addition to preparing us for the future
49
50