Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) [email protected] APA Conference, Brussels, October 2014 The Story So Far… • Together, we have reached the point where a generic, multi-disciplinary, scalable e-i/s for LTDP is achievable – and will hopefully be funded • Built on standards, certified via agreed procedures, using the “Cream of DP services” • In parallel, Business Cases and Cost Models are increasingly understood, working closely with Projects, Communities and Funding Agencies Topic For RDA-4 Joint DP W/S • The high-level Use Cases we are being required to address by FAs are: 1. Open Access (specific samples / purposes); 2. Reproducibility (of data, results, publications); 3. Provision of Data Management Plan(s). • AFAIK, these “requirements” are not specific to a given community, i.e. it is ALL disciplines funded by a given FA that must address these Open Questions • Long-term sustainability is still a technical issue – Let’s assume that we understand the Business Cases & Cost Models well enough… – And (we) even have agreed funding for key aspects • But can the service providers guarantee a multi-decade service? – Is this realistic? – Is this even desirable? • I will address these issues at the APA conference next month in Brussels – with a proposal for “a solution” Background • 20 years ago – in 1994 – the first Computing R&D projects for the LHC were proposed – About 10 years before the expected startup date • History shows that these projects didn’t start too early – even including the LHC startup delays • We now foresee “next generation” data factories in the 2020s and beyond • These will generate Exabytes (e.g. HL-LHC) to Zettabytes (e.g. FCC, SKA) of data and last decades 5 Technology(?) • Of course, in 1-2 decades we can expect huge advances in technology • At least some of these changes are likely to be disruptive – just look back! • But you cannot plan based on the unknown • Eventually, you will have to make decisions based either on what exists, or what you can be confident will be delivered, on the needed timescale Major changes in technology during the active life of current / future projects likely 6 H2020 EINFRA-1-2014 Managing, preserving and computing with big research data 7) Proof of concept and prototypes of data infrastructure-enabling software (e.g. for databases and data mining) for extremely large or highly heterogeneous data sets scaling to zetabytes and trillion of objects. Clean slate approaches to data management targeting 2020+ 'data factory' requirements of research communities and large scale facilities (e.g. ESFRI projects) are encouraged 7 Lunatic Fringe • But this is clearly the lunatic fringe. What exactly does it have to do with me? • Quite a lot: implications for efforts such as – CTRUST / RDA Certification Interest Group • Can current certification procedures “scale” to such massive data volumes? • Can multi-site requirements be addressed? – RDA Active Data Management Plans – 4C: Costs of Exa / Zetta scale curation must clearly be well understood and justified – RDA Reproducibility Interest Group (and many others) – DPINFRA: next generation requirements – [ Preservation VRE: some aspects ~independent of total data volume, some not ] – APA CoE – … Significant economies of scale in “shared bit repositories” 8 Suppose these guys can build / share the most cost effective, scalable and reliable federated storage services, e.g. for peta- / exa- / zetta- scale bit preservation? Can we ignore them? 9 Next Generation Data Factories • HL-LHC (https://indico.cern.ch/category/4863/) – Europe’s top priority should be the exploitation of the full potential of the LHC, including the high-luminosity upgrade of the machine and detectors with a view to collecting ten times more data than in the initial design, by around 2030 – (European Strategy for Particle Physics) • SKA – The Square Kilometre Array (SKA) project is an international effort to build the world’s largest radio telescope, with a square kilometre (one million square metres) of collecting area Typified by SCALE in several dimensions: – Cost; longevity; data rates & volumes – Last decades; cost O(EUR 109); EB / ZB data volumes 10 http://science.energy.gov/fundingopportunities/digital-data-management/ • “The focus of this statement is sharing and preservation of digital research data” • All proposals submitted to the Office of Science (after 1 October 2014) for research funding must include a Data Management Plan (DMP) that addresses the following requirements: 1. DMPs should describe whether and how data generated in the course of the proposed research will be shared and preserved. If the plan is not to share and/or preserve certain data, then the plan must explain the basis of the decision (for example, cost/benefit considerations, other parameters of feasibility, scientific appropriateness, or limitations discussed in #4). At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. 11 LHC experiments increasingly talking about: 1. Open Access for Outreach; 2. Reproducibility of Results. 12 These are becoming mandatory activities, fully supported at all levels of the Collaborations 13 Computing at the HL-LHC (~2025+) Predrag Buncic on behalf of the Trigger/DAQ/Offline/Computing Preparatory Group ALICE: Pierre Vande Vyvre, Thorsten Kollegger, Predrag Buncic; ATLAS: David Rousseau, Benedetto Gorini, Nikos Konstantinidis; CMS: Wesley Smith, Christoph Schwick, Ian Fisk, Peter Elmer ; LHCb: Renaud Legac, Niko Neufeld LHCb Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 14 ATLAS & CMS @ HL-LHC Humungous Data Rates Not Relevant for LT DP Level 1 Level 1 AKA “Filters” HLT HLT 5-10 kHz (2MB/event) Storage Storage 10-20 GB/s Predrag Buncic, October 3, 2013 10 kHz (4MB/event) PEAK OUTPUT 40 GB/s ECFA Workshop Aix-Les-Bains - 15 Data: Outlook for HL-LHC 450.0 400.0 350.0 PB 300.0 CMS 250.0 200.0 ATLAS We are here! ALICE LHCb 150.0 100.0 50.0 0.0 Run 1 Run 2 Run 3 Run 4 • Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. • To be added: derived data (ESD, AOD), simulation, user data… At least 0.5 EB / year (x 10 years of data taking) Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 16 Data storage issues • Our data problems may still look small on the scale of storage needs of internet giants • • Business e-mail, video, music, smartphones, digital cameras generate more and more need for storage The cost of storage will probably continue to go down but… • • • • Commodity high capacity disks may start to look more like tapes, optimized for multimedia storage, sequential access Need to be combined with flash memory disks for fast random access The residual cost of disk servers will remain While we might be able to write all this data, how long it will take to read it back? Need for sophisticated parallel I/O and processing. + We have to store this amount of data every year and for many years to come (Long Term Data Preservation ) Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 17 WLCG Collaboration Today • • • • Distributed infrastructure of 150 computing centers in 40 countries 300+ k CPU cores (~ 2M HEP-SPEC-06) The biggest site with ~50k CPU cores, 12 T2 with 2-30k CPU cores Distributed data, services and operation infrastructure Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 18 WLCG Collaboration Tomorrow • How will this evolve to HL-LHC needs? • To what extent is it applicable to other comparable scale projects? • Already evolving, most significantly during Long Shutdowns, but also during data taking! Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 19 Today’s state of the art in 0.1EB scale bit preservation (or “exabit”) 20 Bit-preservation WG one-slider • Mandate summary (see w3.hepix.org/bit-preservation) – Collecting and sharing knowledge on bit preservation across HEP (and beyond) – Provide technical advise to – Recommendations for sustainable archival storage in HEP • Survey on Large HEP archive sites carried out and presented at last HEPiX – 19 sites; areas such as archive lifetime, reliability, access, verification, migration – HEP Archiving has become a reality by fact rather than by design – Overall positive but lack of SLA’s, metrics, best practices, and long-term costing impact 21 Ongoing Work Two work areas: 1. Preparing a set of best-practice recommendations for bit-level preservation within HEP – ~10 recommendations – Concentrate more on “what” rather than “how” to do – Will be circulated to WG participants and surveyed sites summer time – Feedback will be most appreciated 2. Defining a simple and customisable model for helping establishing the long-term cost of bit-level preservation – Useful for site planning/outlook – Input for DPHEP – significant fraction of overall Data Preservation cost! – The rest of this presentation 22 Verification & reliability Systematic verification of archive data ongoing – – – • – – From annual bit loss rates of O(10-12) (2009) to O(10-16) (2012) New drive generations + less strain (HSM mounts, TM “hitchback”) + verification Differences between vendors getting small Still, room for improvement – – – 23 “Cold” archive: Users only accessed ~20% of the data (2013) All “historic” data verified between 2010-2013 All new and repacked data being verified as well Data reliability significantly improved over last 5 years – • ~35 PB verified in 2014 Vendor quoted bit error rates: O(10-19..-20) But, these only refer to media failures Errors (eg bit flips) appearing in complete chain No losses • “LHC Cost Model” (simplified) Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year) 10EB 1EB 24 Case B) increasing archive growth 25 Case B) increasing archive growth Total cost: ~$59.9M (~$2M / year) 26 From Petabytes to Exabytes • Can the current computing and data management models scale by orders of magnitude? • We cannot simply “scale out” in terms of number of sites and need much greater resilience against data loss / corruption, including (semi-)automated recovery + support for adding / removing sites • Today, this is often done by the experiments: how will this work after data taking stops? How will we cope when (not if) sites no longer “support” a given experiment? 27 History shows that we will need many years of R&D to reach a new scale. Not all paths will be successful but we cannot postpone starting as the whole process, including the necessary service hardening, will take many years. (Decade + ?) 28 Canton of Vaud Lak e Ge nev a LHC Jura Prealps Ain Department Canton of Geneva Sa lèv e Haute-Savoie Department Jura Schematic of an 80-100 km long circular tunnel Future Circular Collider (FCC) © Copyright CERN 2014 Ma nd alla z Aravis Science case Convince me that this project is scientifically excellent Project Plan Convince me that you know what you are doing: scope, costs and schedule are under control “Business case ” Convince me that this is a good use of public money What did the Tevatron@Fermilab cost? • Tevatron accelerator – $120M (1983) = $277M (2012 $) • Main Injector project – $290M (1994) = $450M (2012 $) • Detectors and upgrades – Guess: 2 x $500M (collider detectors) + $300M (FT) • Operations – Say 20 years at $100M/year = $2 billion • Total cost = $4 billion PhD Student Training • Value of a PhD student – $2.2M (US Census Bureau, 2002) = $2.8M (2012 $) • Number of students trained at the Tevatron – 904 (CDF + DØ) – 492 (Fixed Target) – 18 (Smaller Collider experiments) – 1414 total • Financial Impact = $3.96 billion Superconducting Magnets • Current value of SC Magnet Industry – $1.5 Billion p.a. • Value of MRI industry (the major customer for SC magnets) – $5 Billion p.a. • This industry would probably have succeeded anyway – what we can realistically claim is that the large scale investment in this technology at the Tevatron significantly accelerated its development – Guess – one to two years faster than otherwise? • Financial Impact = $5-10 billion Balance sheet • • • • 20 year investment in Tevatron Students $4B Magnets and MRI $5-10B Computing $40B ~ $4B } ~ $50B total Very rough calculation – but confirms our gut feeling that investment in fundamental science pays off I think there is an opportunity for someone to repeat this exercise more rigorously cf. STFC study of SRS Impact http://www.stfc.ac.uk/2428.aspx We have a good song to sing in terms of the scientific, economic and cultural benefits of these next generation data factories. Data sharing, Reproducibility and Measurable Data Management Plans are going to be key. 39 Certification • Next generation data factories will bring new requirements in terms of certification • Multi-site certification can be expected to be core • Today’s “best practices” will need to be extended – possibly rethought for this new scale • Room for collaboration with peta- / exa-scale practitioners, e.g. those from HEPiX WG + RDA IG ??? Push key storage sites to pursue Certification in a coordinated fashion 40 Data Management Plans • Often these are “static” – revised at best every few years (and hence typically out of date with reality) – e.g. WLCG Technical Design Report • Can we switch to a “dashboard mode”, whereby the current reality can be viewed, with the appropriate level of detail, through a portal? • This is something that could “come naturally”, combining existing displays from data scrubbing, migration, caching and replication with Reproducibility & Outreach views: Tabs for Experts, FAs & GP 41 We’re moving towards capturing the analysis environment so that Reproducibility is part of the Approval Process for Publication! 42 CERN aims for 100% Gold Open Access for all its original HEP results, experimental and theoretical, by end 2016. 43 Costs of Curation • Given the scale, duration and expected costs of future generation data factories, a clear understanding of the costs and benefits of curation must be built in. • The costs of “bit preservation” can clearly be reduced through economies of scale, but then not much further. – Is there any other way than “state of the art”? – Around $1M/year/EB in 2040+ !!! The real issues relate to manpower intensive areas, such as knowledge capture and the ability to full reuse the data in the long-term. 44 Reproducibility • It is exciting to see such key issues being addressed from “grass root” initiatives, such as the recent RDA BoF in this area, with many experts involved! – Leading hopefully to an Interest Group and concrete outcomes – Maybe a “specific” call once mature? • We have much to learn by sharing expertise and not repeatedly re-inventing wheels… 45 http://science.energy.gov/fundingopportunities/digital-data-management/ • “The focus of this statement is sharing and preservation of digital research data” • All proposals submitted to the Office of Science (after 1 October 2014) for research funding must include a Data Management Plan (DMP) that addresses the following requirements: 1. DMPs should describe whether and how data generated in the course of the proposed research will be shared and preserved. If the plan is not to share and/or preserve certain data, then the plan must explain the basis of the decision (for example, cost/benefit considerations, other parameters of feasibility, scientific appropriateness, or limitations discussed in #4). At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. 46 Surely we can address these generic (scientific) requirements together, using at least some common services: SCIDIP-ES outputs, CernVM[FS], Zenodo / Invenio, … A joint VRE (R&D) proposal? 47 2020 Vision for LT DP in HEP • Long-term – e.g. FCC timescales: disruptive change – By 2020, all archived data – e.g. that described in DPHEP Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further – Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards – DPHEP portal, through which data / tools accessed “HEP FAIRport”: Findable, Accessible, Interoperable, Re-usable Agree with Funding Agencies clear targets & metrics 48 Summary • Next generation data factories will bring with them many challenges for computing, networking and storage • Data Preservation – and management in general – will be key to their success and must be an integral part of the projects: not an afterthought • We need to start a range of R&D activities now: these can bring tangible benefits to existing projects in addition to preparing us for the future 49 50
© Copyright 2024