HiPEAC vision 2015 “The End of the World As We Know It” 35th ORAP Forum Marc Duranton April 2nd, 2015 What is HiPEAC? HiPEAC is a European Network of Excellence on High Performance and Embedded Architecture and Compilation Created in 2004, HiPEAC gathers over 370 leading European academic and industrial computing system researchers from nearly 140 universities and 70 companies in one virtual centre of excellence of 1500 researchers. HPC Academia Architecture (HW) Embedded Systems Industry Compilation (SW) Coordinator: Koen De Bosschere (UGent) 2 HiPEAC mission: HiPEAC encourages compu3ng innova3on in Europe by providing: collabora3on grants, internships, sabba3cals, the semi-‐annual compu3ng systems week, The ACACES summer school, the yearly HiPEAC conference. The 10th HiPEAC conference took place in Amsterdam, The Netherlands, January 19-‐21, 2015 and gathers more than 600 people 3 The HiPEAC Vision 2008 2009 2011 2013 http://www.hipeac.org/vision/ • Electronic and paper version available now • Paper version: • • 2015 Send to the members with the newsletter Was available at HiPEAC 2015 conference Editors: Marc Duranton (FR-CEA), Koen de Bosschere (BE-U Gent), Albert Cohen (FR-INRIA), Jonas Maebe (BE-U Gent), Harm Munk (NL-ASTRON) 4 Glimpse into the HiPEAC Vision 2015 For the first 3me, we have no3ced that the community really starts looking for disrup0ve solu0ons, and that incrementally improving current technologies is considered inadequate to address the challenges that the compu0ng community faces: “The End of the World As We Know It” 5 Structure of the HiPEAC vision 2015 Market Structure Society Technology Course of ac0ons SWOT Europe HiPEAC Recommen-‐ da0ons 6 Structure of the HiPEAC vision 2015 Market Structure Society Technology Course of ac0ons SWOT Europe HiPEAC Recommen-‐ da0ons 7 Moore’s law: increase in transistor density Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç 8 The end of Dennard Scaling Parameter (scale factor = a) Classic Scaling Dimensions 1/a Voltage 1/a Current 1/a Capacitance 1/a Power/Circuit 1/a2 Power Density 1 Delay/Circuit 1/a Current Everything was Scaling easy: • Wait for the next 1/a node technology • Increase 1 frequency • Decrease 1/a Vdd ⇒ Similar increase of sequential >1/a performance 1/aneed to ⇒ No recompile (except a if architectural improvements) ~1 Source: Krisztián Flautner “From niche to mainstream: can critical systems make the transition?” 9 Limited frequency increase ⇒ more cores Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç 10 Limitation by power density and dissipation GP CPU = 200 W (45 nm) Consumer SoC = 10W Mobile SoC = 1 W Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç 11 Why using several compute cores? 1. Using several cores is also an answer to the Law of Diminishing Returns [Pollack’s Rule] : ◦ Effec3veness per transistor decreases when the size of a single core is increased, due to the locality of computa3on ◦ Controlling a larger core and data transport over a single larger core is super-‐linear ◦ Smaller cores are more efficient in ops/mm2/W 2. Large area of today’s microprocessors are for best effort processing and used to cope with unpredictability (branch predic3on, reordering buffers, instruc3ons, caches). 12 Less than 20% of the area for execution units Source: Dan Connors, “OpenCL and CUDA Programming for Mul3core and GPU Architectures» ACACES 2011 13 Stagnation of performance since few years Source from C Moore, « Data Processing in ExaScale-Class Computer Systems », Salishan, April 2011 14 Power limits the active silicon area => more efficient specialized units 15 Specialization leads to more efficiency Type of device Energy / Opera9on CPU GPU Fixed func3on 1690 pJ 140 pJ 10 pJ Source from Bill Dally (nVidia) « Challenges for Future Computing Systems » HiPEAC conference 2015 16 Energy efficient technology: FDSOI UTBB-‐FDSOI performance gain versus conven9onal Bulk CMOS technology. Blue: no body biasing, Green: FBB = +1V. § ISSCC 2014 § Ultra-‐Wide Voltage Range (UWVR) opera3ons: VDD=[0.39V – 1.3V] § High-‐frequency: Fclk > 2.6GHz @ 1.3V Fclk > 450MHz @ 0.39V Fully Depleted – Silicon on Insulator ◦ Improved performance ◦ Improved performance-‐per-‐wa? ◦ Adap9on to variability of loads Prototype chip from Beigné, ISSCC ‘14 § Demonstrated by CEA Tech and STMicroelectronics 17 Potential other optimizations Pper unit = C V2 f + Tsc V Ipeak + V Ileak Average power, peak power, power density, energy-‐delay, ... CIRCUITS ARCHITECTURE COMPILER, OS, APPLICATION Source: P. Ranganathan, “System architectures for servers and datacenters » ACACES 2011 18 Energy consumption of ICT Es3mated consump3on 410 TWh in 2020, 25% for servers = 10 Nuclear Power Plants Servers Television PCs and displays Source: European Commission DG INFSO, Impact of Informa9on and Communica9on Technologies on Energy Efficiency, final report, 2008 19 Cost of moving data Source: Bill Dally, « To ExaScale and Beyond » www.nvidia.com/content/PDF/sc_2010/theater/Dally_SC10.pdf 20 Using the 3rd dimension: 3D stacking WIDEIO Memory MPSoC Exemple: WIDEIO memory stacked on top Si - WIDEIO Memory of a MPSoC in the same package Partnership between LETI, STEricsson, STMicroelectronics and Cadence TSV 80µm High bandwidth: WideIO provides more than 34 Gbytes/s (Currently: 17 GBytes/s) Low power: 4x power efficiency compared to LPDDR2/3 Compatible with FD-SOI FBGA Package ◦ Size 12x12mm, Ball Pitch 0.4mm, 1.2 mm Source: thickness Si SoC Cu Pillar 21 Denis DUTOIT Op3cal Transceiver Off-chip photonics Fiber Ferrule Chip B Micropillars PIC IC Chip D Driver / TIA IC Si interposer or laminate substrate Chip C PCB S1 Off board: AOC, optical modules Off chip: Optical I/O Time 22 In-package photonics Op3cal Transceiver Thermal Dissipation Chip A Chip B Thermal Dissipation Primary I/O Cu pillars Computing Cores Through Silicon Via Chip D Chip C Signal photo diode Digital Cu pillars & proximity lines Signal Power S1 Off board: AOC, optical modules Tx/Rx Off chip: Optical I/O RF Cu pillars RAM Integr. Rx/Tx modul Laser Photonic Interposer Substrate Power Light source Power Power S2 Optical network in package 23 The data transfer challenge Source: Ahmed Jerraya CEA-Leti q Memory-interconnect density is becoming the bottleneck q Bandwidth demand will increase (“data deluge”) Memory link, peak bandwidth and power consumption efficiency Cost for 1TBps memory bandwidth Interface power consumption Multi-core SoC DDR3 DRAM 8.532 GBps 30 mW/Gbps 240 W 6.4 GBps 20 mW/Gbps 160 W 12.8 GBps 4 mW/Gbps 32 W 1066 MHz I/O bus clock, 32 bits, 1.5 V, Double Data Rate Multi-core SoC LPDDR3 DRAM 800 MHz I/O bus clock, 32 bits, 1.2 V, Double Data Rate Multo-core SoC Wide I/O DRAM 200 MHz I/O bus clock, 512 bits, 1.2 V, Single Data Rate Multi-core SoC Photonics DRAM >TBps 1 mW/Gbps Assume 200 MHz 50K pins connected to SERDES to Photonics 8W + MUXDEMUX? 24 Performances of SRAM hardly increase Node 45nm 16nm 14nm 10 nm Density 150 F2 237 F2 300 F2 450 F2 x1.6 Source: Joel Hruska, « Stop obsessing over transistor counts: It’s a terrible way of 25 comparing chips », http://www.extremetech.com SRAM takes more and more SoC area 26 Flash scaling also hits limits • At 20nm, about 70e- stored • Vt not ok for multilevel storage Source: http://www.anandtech.com/show/7237/samsungs-vnand-hitting-the-reset-button-on-nand-scaling 27 The future will be non volatile memories But s3ll in development and which technology will be the winner? From Coughlin Associates 28 Where are we ? We can s3ll put more transistors per mm2 for the coming few years But energy is a limi3ng factor ◦ ◦ ◦ New technologies (FinFet, FDSOI) 3D stacking More efficient architectures, coprocessors SRAM, DRAM didn’t scale anymore Flash is running out of electrons La loi de Kryder pour les disques durs Non-‐vola3le memories are promising ◦ ◦ But which technology, at which density and reliability? S3ll 3me required for replacing completely “old” storage technologies 29 The cost per transistors is not decreasing anymore 30 And the development cost is increasing Rock’s law: cost of IC plant doubles every 4 years Reaching 10th or 100th of $ Billions… 31 “Main drives in compuAng” High Performance Computing 1946 ENIAC, vacuum tube computer, 5KOPS, 150KW 1965 General Electric GE635, 4 processors, 2 MIPS 2012 Bull B510, 10K cores, 4TeraBytes RAM, 200 TFlops Yesterday 2015 Tianhe-2 (MilkyWay-2): Intel Xeon E5-2692 12C 2.200GHz, Intel Xeon Phi 31S1P, 3.12M cores, 1,024 TB RAM, 50 PFlops, 17,8 MW Today Things 32 $$ = Fuel of innovation? HPC: not anymore a drive for component developments Similar to HPC 1970 Mainframe Energy challenge 2010 Servers 2015 IoT Micro controllers Reuse of components 1990 PC How to get specific components for HPC? 2000 Laptops 2010 Smartphones / Tablets Minimum reuse Dedicated SoC 33 Specialization with interposer ð Lower cost: • Improved yield • Reduced NRE ð Lower energy: • Heterogeneity • Shorter wires • Photonic ready Performance ↗ Memory Energy efficiency ↗ 34 Many cores: technology to reduce energy consumption and cost Traditional Chip Large IC with maximum integration Multiples small dies Small dies stacked on a large interposer 6 cm² 25x0.9 cm² 5000 $ wafer cost 89 dies of 6 cm² 20% yield à 281 $ die cost 300 mm Wafer 6 cm² dies à 281 $ die cost IC cost* 95% final test yield à 296 $ IC cost 2 GFLOPS/W Same cost 6 cm² versus 25 cm² 1 die 25 chiplets+ 1 interposer 300 mm Wafer 300 mm Wafer 0.9 cm²dies 25 cm² dies Chiplets: Interposer: 5000 $ wafer cost 714 dies of 0.9 cm² 80% yield 500 $ wafer cost 14 dies of 25 cm² 98% yield à 8.75 $ die cost à 36.44 $ die cost à 255 $ total die cost 90% final test yield à 284 $ 3D-IC cost More energy efficiency 10 GFLOPS/W *: test and package costs are not included but considered equal for both technologies in this exercise 35 See the presentation on “The Machine” from HP Source: P. Ranganathan, “Saving the world together, one server at a time…” ACACES 2011 36 Software cost is rapidly increasing 37 Parallelism and specializaAon are not for free… Frequency limit è parallelism Energy efficiency è heterogeneity Ease of programming 38 Parallelism and specializaAon are not for free… Frequency limit è parallelism Energy efficiency è heterogeneity Ease of programming 39 Managing complexity…. “Nontrivial software written with threads, semaphore, and mutexes is incomprehensible by humans” Edward A. Lee The future of embedded software ARTEMIS 2006 Parallelism seems to be too complex for humans ? 40 Time to think differently? Approximate computing Probabilistic CMOS Neuromorphic Declarative computing programming Graphene Spintronic Quantum… 41 Time to think differently? Approximate computing Probabilistic CMOS Neuromorphic Declarative computing programming Graphene Spintronic Quantum… 42 Time to think differently? Adequate computing Probabilistic CMOS Neuromorphic Declarative computing programming Graphene Spintronic Quantum… 43 We are entering a transition period… 44 Highlights of the HiPEAC Vision 2015 For the first 3me, we have no3ced that the community really starts looking for disrup0ve solu0ons, and that incrementally improving current technologies is considered inadequate to address the challenges that the compu0ng community faces “The End of the World As We Know It” 45 http://www.hipeac.org/vision 46 47
© Copyright 2024