HiPEAC vision 2015 “The End of the World As We Know It”

HiPEAC vision 2015
“The End of the World
As We Know It”
35th ORAP Forum
Marc Duranton
April 2nd, 2015
What is HiPEAC?
— 
HiPEAC is a European Network of Excellence
on High Performance and Embedded
Architecture and Compilation
— 
Created in 2004, HiPEAC gathers over 370 leading
European academic and industrial computing system
researchers from nearly 140 universities and 70
companies in one virtual centre of excellence of 1500
researchers.
HPC
Academia
Architecture (HW)
— 
Embedded Systems
Industry
Compilation (SW)
Coordinator: Koen De Bosschere (UGent)
2
HiPEAC mission:
HiPEAC encourages compu3ng innova3on in Europe by providing: —  collabora3on grants, internships, sabba3cals, —  the semi-­‐annual compu3ng systems week, —  The ACACES summer school, —  the yearly HiPEAC conference. The 10th HiPEAC conference took place in Amsterdam, The Netherlands, January 19-­‐21, 2015 and gathers more than 600 people 3
The HiPEAC Vision
2008
2009
2011
2013
http://www.hipeac.org/vision/
•  Electronic and paper version available now
•  Paper version:
• 
• 
2015
Send to the members with the newsletter
Was available at HiPEAC 2015 conference
Editors: Marc Duranton (FR-CEA), Koen de Bosschere (BE-U Gent),
Albert Cohen (FR-INRIA), Jonas Maebe (BE-U Gent), Harm Munk (NL-ASTRON)
4
Glimpse into the HiPEAC Vision 2015
For the first 3me, we have no3ced that the community really starts looking for disrup0ve solu0ons, and that incrementally improving current technologies is considered inadequate to address the challenges that the compu0ng community faces: “The End of the World As We Know It” 5
Structure of the HiPEAC vision 2015
Market —  Structure
Society Technology Course of ac0ons SWOT Europe HiPEAC Recommen-­‐
da0ons 6
Structure of the HiPEAC vision 2015
Market —  Structure
Society Technology Course of ac0ons SWOT Europe HiPEAC Recommen-­‐
da0ons 7
Moore’s law: increase in transistor density
Source from Kunle Olukotun, Lance Hammond, Herb Sutter,
Burton Smith, Chris Batten, and Krste Asanoviç
8
The end of Dennard Scaling
Parameter
(scale factor = a)
Classic
Scaling
Dimensions
1/a
Voltage
1/a
Current
1/a
Capacitance
1/a
Power/Circuit
1/a2
Power Density
1
Delay/Circuit
1/a
Current
Everything was
Scaling
easy:
•  Wait for the next
1/a node
technology
•  Increase
1
frequency
•  Decrease
1/a Vdd
⇒  Similar increase
of
sequential
>1/a
performance
1/aneed to
⇒  No
recompile (except
a
if architectural
improvements)
~1
Source: Krisztián Flautner “From niche to mainstream: can critical systems
make the transition?”
9
Limited frequency increase ⇒ more cores
Source from Kunle Olukotun, Lance Hammond, Herb Sutter,
Burton Smith, Chris Batten, and Krste Asanoviç
10
Limitation by power density and dissipation
GP CPU = 200 W (45 nm)
Consumer SoC = 10W
Mobile SoC = 1 W
Source from Kunle Olukotun, Lance Hammond, Herb Sutter,
Burton Smith, Chris Batten, and Krste Asanoviç
11
Why using several compute cores?
1. 
Using several cores is also an answer to the Law of Diminishing Returns [Pollack’s Rule] : ◦  Effec3veness per transistor decreases when the size of a single core is increased, due to the locality of computa3on ◦  Controlling a larger core and data transport over a single larger core is super-­‐linear ◦  Smaller cores are more efficient in ops/mm2/W 2. 
Large area of today’s microprocessors are for best effort processing and used to cope with unpredictability (branch predic3on, reordering buffers, instruc3ons, caches). 12
Less than 20% of the area for execution units
Source: Dan Connors, “OpenCL and CUDA Programming for Mul3core and GPU Architectures» ACACES 2011 13
Stagnation of performance since few years
Source from C Moore, « Data Processing in ExaScale-Class
Computer Systems », Salishan, April 2011
14
Power limits the active silicon area
=> more efficient specialized units
15
Specialization leads to more efficiency
Type of device Energy / Opera9on CPU GPU Fixed func3on 1690 pJ 140 pJ 10 pJ Source from Bill Dally (nVidia) « Challenges for Future Computing Systems »
HiPEAC conference 2015
16
Energy efficient technology: FDSOI
UTBB-­‐FDSOI performance gain versus conven9onal Bulk CMOS technology. Blue: no body biasing, Green: FBB = +1V. §  ISSCC 2014 §  Ultra-­‐Wide Voltage Range (UWVR) opera3ons: VDD=[0.39V – 1.3V] §  High-­‐frequency: Fclk > 2.6GHz @ 1.3V
Fclk > 450MHz @ 0.39V — 
Fully Depleted – Silicon on Insulator ◦  Improved performance ◦  Improved performance-­‐per-­‐wa? ◦  Adap9on to variability of loads Prototype chip from Beigné, ISSCC ‘14
§  Demonstrated by CEA Tech and STMicroelectronics 17
Potential other optimizations
Pper unit = C V2 f + Tsc V Ipeak + V Ileak Average power, peak power, power density, energy-­‐delay, ... CIRCUITS ARCHITECTURE COMPILER, OS, APPLICATION Source: P. Ranganathan, “System architectures for servers and datacenters » ACACES 2011 18
Energy consumption of ICT
— 
Es3mated consump3on 410 TWh in 2020, 25% for servers = 10 Nuclear Power Plants Servers
Television
PCs and
displays
Source: European Commission DG INFSO, Impact of Informa9on and Communica9on Technologies on Energy Efficiency, final report, 2008 19
Cost of moving data
Source: Bill Dally, « To ExaScale and Beyond »
www.nvidia.com/content/PDF/sc_2010/theater/Dally_SC10.pdf
20
Using the 3rd dimension: 3D stacking
WIDEIO Memory
MPSoC
Exemple: WIDEIO memory stacked on top
Si - WIDEIO Memory
of a MPSoC in the same package
— 
— 
Partnership between LETI, STEricsson,
STMicroelectronics and Cadence
TSV
80µm
High bandwidth: WideIO provides more
than 34 Gbytes/s (Currently: 17 GBytes/s)
— 
Low power: 4x power efficiency compared to
LPDDR2/3
— 
Compatible with FD-SOI
— 
FBGA Package
◦  Size 12x12mm, Ball Pitch 0.4mm, 1.2 mm
Source:
thickness
Si SoC
Cu Pillar
21
Denis DUTOIT
Op3cal Transceiver Off-chip photonics
Fiber Ferrule Chip B Micropillars
PIC
IC Chip D Driver /
TIA
IC
Si interposer or laminate substrate Chip C PCB S1
Off board: AOC,
optical modules
Off chip: Optical I/O
Time 22
In-package photonics
Op3cal Transceiver Thermal Dissipation
Chip A Chip B Thermal Dissipation
Primary I/O
Cu pillars
Computing
Cores Through
Silicon Via
Chip D Chip C Signal
photo
diode
Digital Cu pillars
& proximity lines
Signal
Power
S1
Off board: AOC,
optical modules
Tx/Rx
Off chip: Optical I/O
RF Cu pillars
RAM Integr. Rx/Tx
modul
Laser
Photonic Interposer Substrate Power
Light source
Power
Power
S2
Optical network
in package
23
The data transfer challenge
Source: Ahmed Jerraya CEA-Leti
q  Memory-interconnect density is becoming the bottleneck
q  Bandwidth demand will increase (“data deluge”)
Memory link, peak bandwidth and power consumption
efficiency
Cost for
1TBps
memory
bandwidth
Interface
power
consumption
Multi-core
SoC
DDR3
DRAM
8.532 GBps
30 mW/Gbps
240 W
6.4 GBps
20 mW/Gbps
160 W
12.8 GBps
4 mW/Gbps
32 W
1066 MHz I/O bus clock, 32 bits, 1.5 V, Double Data Rate
Multi-core
SoC
LPDDR3
DRAM
800 MHz I/O bus clock, 32 bits, 1.2 V, Double Data Rate
Multo-core
SoC
Wide I/O
DRAM
200 MHz I/O bus clock, 512 bits, 1.2 V, Single Data Rate
Multi-core
SoC
Photonics
DRAM
>TBps
1 mW/Gbps
Assume 200 MHz 50K pins connected to SERDES to Photonics
8W
+
MUXDEMUX?
24
Performances of SRAM hardly increase
Node 45nm 16nm 14nm 10 nm Density 150 F2 237 F2 300 F2 450 F2 x1.6
Source: Joel Hruska, « Stop obsessing over transistor counts: It’s a terrible way of
25
comparing chips », http://www.extremetech.com
SRAM takes more and more SoC area
26
Flash scaling also hits limits
•  At 20nm, about 70e- stored
•  Vt not ok for multilevel storage
Source: http://www.anandtech.com/show/7237/samsungs-vnand-hitting-the-reset-button-on-nand-scaling
27
The future will be non volatile memories
But s3ll in development and which technology will be the winner? From Coughlin Associates
28
Where are we ?
We can s3ll put more transistors per mm2 for the coming few years —  But energy is a limi3ng factor ◦ 
◦ 
◦ 
New technologies (FinFet, FDSOI) 3D stacking More efficient architectures, coprocessors SRAM, DRAM didn’t scale anymore —  Flash is running out of electrons —  La loi de Kryder pour les disques durs Non-­‐vola3le memories are promising — 
◦ 
◦ 
But which technology, at which density and reliability? S3ll 3me required for replacing completely “old” storage technologies 29
The cost per transistors is not
decreasing anymore
30
And the development cost is increasing
Rock’s law: cost of IC plant doubles every 4 years
Reaching 10th or 100th of $ Billions…
31
“Main drives in compuAng”
High Performance Computing
1946
ENIAC, vacuum tube computer,
5KOPS, 150KW
1965
General Electric GE635,
4 processors, 2 MIPS
2012
Bull B510, 10K cores,
4TeraBytes RAM, 200 TFlops
Yesterday
2015
Tianhe-2 (MilkyWay-2): Intel
Xeon E5-2692 12C 2.200GHz,
Intel Xeon Phi 31S1P, 3.12M
cores, 1,024 TB RAM, 50
PFlops, 17,8 MW
Today
Things
32
$$ = Fuel of innovation?
HPC: not anymore a drive for component developments
Similar to HPC 1970 Mainframe Energy challenge 2010 Servers 2015 IoT Micro controllers Reuse of components 1990 PC How to get specific components for HPC? 2000 Laptops 2010 Smartphones / Tablets Minimum reuse Dedicated SoC 33
Specialization with interposer
ð Lower cost: •  Improved yield •  Reduced NRE ð Lower energy: •  Heterogeneity •  Shorter wires •  Photonic ready Performance ↗ Memory Energy efficiency ↗ 34
Many cores: technology to reduce energy
consumption and cost
Traditional Chip
Large IC with maximum
integration
Multiples small dies
Small dies stacked on a large interposer
6 cm²
25x0.9 cm²
5000 $ wafer cost
89 dies of 6 cm²
20% yield
à 281 $ die cost
300 mm
Wafer
6 cm² dies
à 281 $ die cost
IC cost*
95% final test yield
à 296 $ IC cost
2 GFLOPS/W
Same cost 6 cm² versus 25
cm²
1 die
25 chiplets+ 1 interposer
300 mm
Wafer
300 mm
Wafer
0.9 cm²dies
25 cm² dies
Chiplets:
Interposer:
5000 $ wafer cost
714 dies of 0.9 cm²
80% yield
500 $ wafer cost
14 dies of 25 cm²
98% yield
à 8.75 $ die cost
à 36.44 $ die cost
à 255 $ total die cost
90% final test yield
à 284 $ 3D-IC cost
More energy efficiency
10 GFLOPS/W
*: test and package costs are not included but considered equal for both technologies in this exercise
35
See the presentation on “The Machine” from HP
Source: P. Ranganathan, “Saving the world together, one server at a time…” ACACES 2011
36
Software cost is rapidly increasing
37
Parallelism and specializaAon are not for free…
Frequency limit è parallelism Energy efficiency è heterogeneity Ease of programming 38
Parallelism and specializaAon are not for free…
Frequency limit è parallelism Energy efficiency è heterogeneity Ease of programming 39
Managing complexity….
“Nontrivial software written with
threads, semaphore, and mutexes is
incomprehensible by humans”
Edward A. Lee
The future of embedded software
ARTEMIS 2006
Parallelism seems to be too complex
for humans ?
40
Time to think differently?
—  Approximate
computing
—  Probabilistic
CMOS
—  Neuromorphic
—  Declarative
computing
programming
—  Graphene
—  Spintronic
—  Quantum…
41
Time to think differently?
—  Approximate
computing
—  Probabilistic
CMOS
—  Neuromorphic
—  Declarative
computing
programming
—  Graphene
—  Spintronic
—  Quantum…
42
Time to think differently?
—  Adequate
computing
—  Probabilistic
CMOS
—  Neuromorphic
—  Declarative
computing
programming
—  Graphene
—  Spintronic
—  Quantum…
43
We are entering a transition period…
44
Highlights of the HiPEAC Vision 2015
For the first 3me, we have no3ced that the community really starts looking for disrup0ve solu0ons, and that incrementally improving current technologies is considered inadequate to address the challenges that the compu0ng community faces “The End of the World As We Know It” 45
http://www.hipeac.org/vision
46
47