Slides

Architectures and precision analysis for modelling
atmospheric variables with chaotic behaviour
FCCM 2015
Peter D. D¨
uben‡
Francis P. Russell†
Xinyu Niu† Wayne Luk†
Tim N. Palmer‡
Imperial College London† & Oxford University‡
05/05/2015
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
1 / 16
Weather Modelling
Atmospheric modelling requires simulation of a chaotic system.
Chaotic systems are highly sensitive to initial conditions - the
butterfly effect.
There are many sources of uncertainly in atmospheric simulations.
Sets of multiple simulations, called emsembles are run to build a
distribution of outcomes.
Lorenz’96 is a simple model designed by Lorenz to study predictability
in chaotic systems.
We study the two-scale Lorenz’96 system which models small and
large-scale dynamics.
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
2 / 16
The Lorenz’96 system (K=6, J=8)
local-scale values
Y4,6
Y5,6
Y6,6
Y7,6
Y8,6
Y1,1
Y3,6
Y2,1
Y2,6
Y3,1
Y1,6
Y4,1
Y8,5
Y5,1
Y7,5
Y6,1
X6
Y6,5
Y7,1
X1
Y5,5
Y8,1
Y4,5
Y1,2
X5
Y3,5
Y2,2
global-scale values
Y2,5
Y3,2
X2
Y1,5
Y4,2
Y8,4
Y5,2
X4
Y7,4
Y6,2
X3
Y6,4
Y7,2
Y5,4
dXk
= − Xk−1 (Xk−2 − Xk+1 )
dt
J
hc X
− Xk + F −
Yj,k
b
j=1
dYj,k
= − cbYj+1,k (Yj+2,k
dt
− Yj−1,k )
hc
− cYj,k +
Xk
b
Y8,2
Y4,4
Y1,3
Y3,4
Y2,3
Y2,4
Y3,3
Y1,4
Y8,3
Francis P. Russell
Y7,3
Y6,3
Y5,3
Y4,3
Typically, the largest J of interest
≈ 128. We increase the value of K
when increasing system size (up to
≈ 800,000).
Chaotic Systems in Hardware
05/05/2015
3 / 16
Contributions
A hardware design of the two-scale Lorenz’96 system, useful for
studying the effects of multi-scale interactions in weather modelling.
Demonstration of how trade-offs can be made between precision and
throughput for the hardware implementation of a system with chaotic
behaviour.
Present an analysis of the precision reduction impact on a hardware
implementation of Lorenz’96 using metrics appropriate for chaotic
systems.
Show the effects on power and performance, and compare to an
optimised CPU implementation.
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
4 / 16
Hardware Design
Off-chip
Memory
Pipelined Runge-Kutta
timestepping.
Different precisions for global
and local-scale quantities.
Strip
Padding
Deinterleave
R-K
Step 1
R-K
Step 2
Local-scale values processed
with customisable vector width.
Periodic boundary conditions
are problematic for a streaming
design.
Periodic properties of the
system are used to enable
in-place updates, “rotating”
the system each update.
Francis P. Russell
R-K
Step 4
R-K
Step 3
Off-chip
Memory
Chaotic Systems in Hardware
Pad
Interleave
05/05/2015
5 / 16
How much precision is enough?
Computational scientists typically only have to choose between single
and double precision.
Choosing double precision is the simplest way to avoid thinking about
the way individual degrees of freedom have been discretised.
We can usually compare a reduction precision implementation against
an analytical solution for simple test cases and a high-precision
implementation for more complex ones.
Neither of these approaches is directly applicable to chaotic systems due to
the Butterfly effect.
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
6 / 16
Error metrics
We cannot compare long runs of Lorenz’96 at different precisions
using standard error metrics.
Even when starting from the same initial conditions, we expect the
chaotic nature to magnify effects of different number representations,
order of operations and other implementation artefacts.
We use the Hellinger distance to compare the probability density
functions of the global and local-scale values of the simulation.
For two probability density functions p and q, the Hellinger distance is
defined as:
s Z
2
p
1 p
H(p, q) =
p(x) − q(x) dx
2
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
7 / 16
Estimating the error
Using an arbitrary precision CPU-implementation we vary the precision of
local and global-scale associated values. Distances are compared against
those of a double-precision implementation.
20
15
10
5
5
10
15
20
global-scale mantissa (bits)
0.945
0.840
0.735
0.630
0.525
0.420
0.315
0.210
0.105
0.000
Local-scale (Y) distances
local-scale mantissa (bits)
local-scale mantissa (bits)
Global-scale (X) distances
0.54
0.48
0.42
0.36
0.30
0.24
0.18
0.12
0.06
0.00
20
15
10
5
5
10
15
20
global-scale mantissa (bits)
Precision changes in variables at one scale do not typically affect those at
the other, so we could optimise them independently.
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
8 / 16
Estimating appropriate mantissa sizes
Using the CPU-based reduced-precision analysis, we can calculate
minimum mantissa widths for local and global-scale values, depending on
the amount of error we are willing to accept.
Run
Changed initial conditions
c × 0.99, F × 0.99
c × 1.01, F × 1.01
c × 0.9, F × 0.9
c × 1.1, F × 1.1
H
(global)
H
(local)
2.788e-3
4.605e-3
5.042e-3
4.047e-2
3.620e-2
1.939e-4
1.505e-3
1.719e-3
1.625e-2
1.415e-2
Est.
min.
mantissa
(global)
15
13
13
11
11
Est.
min.
mantissa
(local)
12
10
10
9
9
Scaling c and F represents variation in the input parameters by 1 and 10%
due to measurement uncertainty.
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
9 / 16
Profiling exponent ranges
We also use the CPU-based implementation to profile exponent ranges for
the global and local-scale values.
local-scale
frequency (normalised)
frequency (normalised)
global-scale
0.25
0.2
0.15
0.1
0.05
0
-15
-10
-5
0
5
base-2 exponent
10
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-15
-10
-5
0
5
base-2 exponent
10
Most exponents are ≤ 9 so we need at least 5 bits for both global and
local-scale (signed) exponent values.
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
10 / 16
The resulting designs
We build designs1 at different precisions, allowing us to process different
number of input elements per cycle (the vector width).
DFE1 Uses single precision.
DFE2 Estimated precision for minimal effect on result.
DFE32 /4 Estimated as similar error to 1% variation in input
parameters.
Build
DFE1
DFE2
DFE3
DFE4
1
2
Global-scale
type
float(8, 24)
float(6, 15)
float(5, 13)
float(5, 13)
Local-scale
type
float(8, 24)
float(5, 12)
float(5, 10)
float(5, 10)
Vector
width
8
16
16
24
Utilisation
Logic DSP
55.00 48.16
69.85 32.84
64.28 32.84
79.15 47.92
(%)
BRAM
24.25
28.67
27.63
33.74
Maxeler MAX3A Vectis Dataflow Engine (Xilinx Virtex6 SXT475).
Used to facilitate accuracy comparisons with the CPU-based implementation.
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
11 / 16
Precision Results
For J = 64 (which the CPU-simulations were run with) we have similar
errors to those predicted by the CPU-simulations. For J = 144, we have
slightly larger errors, but still acceptable for use.
Build
Changed initial cond.
c × 0.99, F × 0.99
c × 1.01, F × 1.01
c × 0.9, F × 0.9
c × 1.1, F × 1.1
DFE1 (g=f(8,24), l=f(8,24), w=8)
DFE2 (g=f(6,15), l=f(5,12), w=16)
DFE3 (g=f(5,13), l=f(5,10), w=16)
DFE4 (g=f(5,13), l=f(5,10), w=24)
Francis P. Russell
J = 64
H
H
(global)
(local)
2.788e-3
1.939e-4
4.605e-3
1.505e-3
5.042e-3
1.719e-3
4.047e-2
1.625e-2
3.620e-2
1.415e-2
2.934e-3
2.881e-4
2.776e-3
2.707e-4
3.267e-3
6.742e-4
N/A
N/A
Chaotic Systems in Hardware
J = 144
H
H
(global)
(local)
7.029e-3
6.291e-4
1.399e-2
1.726e-3
1.067e-2
1.861e-3
1.167e-1
1.487e-2
8.826e-2
1.236e-2
7.472e-3
1.180e-3
4.320e-2
2.443e-3
7.289e-2
1.012e-2
7.404e-2
1.005e-2
05/05/2015
12 / 16
Precision Results
For J = 64 (which the CPU-simulations were run with) we have similar
errors to those predicted by the CPU-simulations. For J = 144, we have
slightly larger errors, but still acceptable for use.
Build
Changed initial cond.
c × 0.99, F × 0.99
c × 1.01, F × 1.01
c × 0.9, F × 0.9
c × 1.1, F × 1.1
DFE1 (g=f(8,24), l=f(8,24), w=8)
DFE2 (g=f(6,15), l=f(5,12), w=16)
DFE3 (g=f(5,13), l=f(5,10), w=16)
DFE4 (g=f(5,13), l=f(5,10), w=24)
Francis P. Russell
J = 64
H
H
(global)
(local)
2.788e-3
1.939e-4
4.605e-3
1.505e-3
5.042e-3
1.719e-3
4.047e-2
1.625e-2
3.620e-2
1.415e-2
2.934e-3
2.881e-4
2.776e-3
2.707e-4
3.267e-3
6.742e-4
N/A
N/A
Chaotic Systems in Hardware
J = 144
H
H
(global)
(local)
7.029e-3
6.291e-4
1.399e-2
1.726e-3
1.067e-2
1.861e-3
1.167e-1
1.487e-2
8.826e-2
1.236e-2
7.472e-3
1.180e-3
4.320e-2
2.443e-3
7.289e-2
1.012e-2
7.404e-2
1.005e-2
05/05/2015
12 / 16
Precision Results
For J = 64 (which the CPU-simulations were run with) we have similar
errors to those predicted by the CPU-simulations. For J = 144, we have
slightly larger errors, but still acceptable for use.
Build
Changed initial cond.
c × 0.99, F × 0.99
c × 1.01, F × 1.01
c × 0.9, F × 0.9
c × 1.1, F × 1.1
DFE1 (g=f(8,24), l=f(8,24), w=8)
DFE2 (g=f(6,15), l=f(5,12), w=16)
DFE3 (g=f(5,13), l=f(5,10), w=16)
DFE4 (g=f(5,13), l=f(5,10), w=24)
Francis P. Russell
J = 64
H
H
(global)
(local)
2.788e-3
1.939e-4
4.605e-3
1.505e-3
5.042e-3
1.719e-3
4.047e-2
1.625e-2
3.620e-2
1.415e-2
2.934e-3
2.881e-4
2.776e-3
2.707e-4
3.267e-3
6.742e-4
N/A
N/A
Chaotic Systems in Hardware
J = 144
H
H
(global)
(local)
7.029e-3
6.291e-4
1.399e-2
1.726e-3
1.067e-2
1.861e-3
1.167e-1
1.487e-2
8.826e-2
1.236e-2
7.472e-3
1.180e-3
4.320e-2
2.443e-3
7.289e-2
1.012e-2
7.404e-2
1.005e-2
05/05/2015
12 / 16
Precision Results
For J = 64 (which the CPU-simulations were run with) we have similar
errors to those predicted by the CPU-simulations. For J = 144, we have
slightly larger errors, but still acceptable for use.
Build
Changed initial cond.
c × 0.99, F × 0.99
c × 1.01, F × 1.01
c × 0.9, F × 0.9
c × 1.1, F × 1.1
DFE1 (g=f(8,24), l=f(8,24), w=8)
DFE2 (g=f(6,15), l=f(5,12), w=16)
DFE3 (g=f(5,13), l=f(5,10), w=16)
DFE4 (g=f(5,13), l=f(5,10), w=24)
Francis P. Russell
J = 64
H
H
(global)
(local)
2.788e-3
1.939e-4
4.605e-3
1.505e-3
5.042e-3
1.719e-3
4.047e-2
1.625e-2
3.620e-2
1.415e-2
2.934e-3
2.881e-4
2.776e-3
2.707e-4
3.267e-3
6.742e-4
N/A
N/A
Chaotic Systems in Hardware
J = 144
H
H
(global)
(local)
7.029e-3
6.291e-4
1.399e-2
1.726e-3
1.067e-2
1.861e-3
1.167e-1
1.487e-2
8.826e-2
1.236e-2
7.472e-3
1.180e-3
4.320e-2
2.443e-3
7.289e-2
1.012e-2
7.404e-2
1.005e-2
05/05/2015
12 / 16
Performance (J=144, float system size: 227KiB - 453MiB)
4.5e+09
FPGA1 (g=f(5,13), l=f(5,10), w=24) (DFE4)
FPGA1 (g=f(6,15), l=f(5,12), w=16) (DFE2)
FPGA1 (g=f(8,24), l=f(8,24), w=8) (DFE1)
CPU2 (single-precision, 12 cores)
CPU2 (double-precision, 12 cores)
4e+09
elements/second
3.5e+09
3e+09
6.93×
2.5e+09
2e+09
5.37×
1.5e+09
1e+09
2.83×
5e+08
0
100
1
2
0.514×
1000
10000
K (log-scale)
100000
1e+06
Maxeler MAX3A Vectis DFE (Xilinx Virtex6 SXT475 @ 150MHz).
Dual-socket hyperthreaded Intel Xeon X5660s @ 2.67GHz.
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
13 / 16
Power Efficiency
We measure the power requirements of a software implementation on two
different systems and compare against the most power efficient.
Build Throughput
(elements/s)
System 12 (4-cores) 1.63e8
System 23 (6-cores) 2.05e8
System 2 (12-cores) 4.05e8
DFE1 (System 1) 1.14e9
DFE2 (System 1) 2.17e9
DFE4 (System 1) 2.81e9
Power (W)
2081
3991
5031
137
143
146
Efficiency
Relative
(elements/J) Efficiency
7.83e5
0.972
5.14e5
0.638
8.05e5
1.00
8.35e6
10.4
1.52e7
18.9
1.92e7
23.9
1
Power consumption of unused Maxeler cards subtracted.
Hyperthreaded Intel Core i7 870 @ 2.93GHz
3
Dual-socket hyperthreaded Intel Xeon X5660s @ 2.67GHz.
2
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
14 / 16
Future work
More complex models like shallow water.
More sophisticated means for incorporating input-value uncertainty
into a simulation.
Generate designs from a domain-specific languages, allowing domain
specialists to supply uncertainty information directly.
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
15 / 16
Conclusions
Obtaining near-identical numerical behaviour to a reference solution is
not always the best goal.
Chaotic systems force us to re-evaluate how we measure the
correctness of a simulation.
Chaos-aware precision reduction enables us to exploit input-parameter
uncertainty and nature of the chaotic behaviour itself.
We demonstrated how to achieve this for simple chaotic system and
the performance and power benefits over a conventional CPU
implementation.
Weather modelling is a problem that drives the purchase of
supercomputers. Only reconfigurable computing enables us to exploit
the properties of chaotic systems to reduce power and improve
performance.
Francis P. Russell
Chaotic Systems in Hardware
05/05/2015
16 / 16