1 - John Wickerson

cla
d
e
i
rif
Remote-scoperectified
promotion verified
John Wickerson (Imperial), Mark Batty (Kent), Ally Donaldson (Imperial), Brad Beckmann (AMD)
REMS workshop, Cambridge
22 April 2015
In brief
•
RSP is a GPGPU language extension from AMD
that enables efficient work-stealing
•
We worked with AMD to formalise their design (at
language and HW level). This led to a corrected
and improved implementation.
•
Formalise early in the design process!
This talk
1. What is RSP?
2. Adding RSP to the OpenCL memory model
3. A formalised implementation of OpenCL+RSP
This talk
1. What is RSP?
2. Adding RSP to the OpenCL memory model
3. A formalised implementation of OpenCL+RSP
C11: flat thread structure
T0
T1
T2
T3
T4
T5
OpenCL: thread groupings
device
device
workgroup
T0
workgroup
T1
T2
workgroup
T3
T4
T5
GPUs: hierarchical memory
device
device
workgroup
T0
workgroup
T1
T2
L1 CACHE
workgroup
T3
L1 CACHE
L2 CACHE
T4
T5
L1 CACHE
L2 CACHE
GLOBAL MEMORY
Memory scopes
Memory scopes
store(x,42)
load(x)
workgroup
T0
workgroup
T1
T2
T3
Memory scopes
store(x,42,WG)
load(x,WG)
workgroup
T0
workgroup
T1
T2
T3
Memory scopes
store(x,42,WG)
load(x,WG)
workgroup
T0
workgroup
T1
T2
T3
f
!
y
t
aul
Memory scopes
store(x,42,DV)
load(x,DV)
workgroup
T0
workgroup
T1
T2
T3
ok!
Memory scopes
store(x,42,DV)
load(x,WG)
workgroup
T0
fa
*
!
y
ult
workgroup
T1
T2
T3
*...but the
OpenCL
standard
could easily
be extended
to allow this
Work-stealing
store(headA,_,WG) //push
store(headA,_,WG) //pop
workgroup A
T0
T1
tailA headA
workgroup B
T2
T3
tailB headB
Work-stealing
store(headA,_,WG) //push
store(headA,_,???) //steal
store(headA,_,WG) //pop
workgroup A
T0
T1
tailA headA
workgroup B
T2
headB
tailB
T3
o
t
y
a
w
no
s
i
h
t
g
plu
n
i
hole L!
C
n
e
Op
Remote-scope promotion
store(x,42,DV)
T0
load(x,DV)
T1
T2
T3
Remote-scope promotion
store(x,42,WG)
T0
load(x,DV)
T1
T2
T3
Remote-scope promotion
store(x,42,WG)
T0
load(x,DV,remote)
T1
T2
T3
Remote-scope promotion
store(x,42,DV,remote)
T0
T1
load(x,WG)
T2
T3
Work-stealing
store(headA,_,WG) //push
store(headA,_,DV,remote) //steal
store(headA,_,WG) //pop
workgroup A
T0
T1
tailA headA
workgroup B
T2
headB
tailB
T3
This talk
1. What is RSP?
2. Adding RSP to the OpenCL memory model
3. A formalised implementation of OpenCL+RSP
version
We present
in Fig. 2 the full OpenCL model (including exe
Wepresent
presentininFig.
Fig.22the
thefull
fullOpenCL
OpenCLmodel
model(including
(includingexe
ex
We
We present
Fig. 2language.
the full OpenCL
model (including
e
barriers)
in thein.cat
In the following,
we discu
barriers)ininthe
the.cat
.catlanguage.
language.InInthe
thefollowing,
following,we
wediscu
disc
barriers)
we di
changes
thatinmust
be
made
to the model,
which
are highlig
barriers)
the
.cat
language.
In
the
following,
changesthat
thatmust
mustbebemade
madetotothe
themodel,
model,which
whichare
arehighlig
highli
changes
thechanges
figure. that must be made to the model, which are highl
thefigure.
figure.
the
the figure.
2
n
sw
=
rf
\
(6
=
)
\
(¬na)
2
2
thd
on
n
sw
=
rf
\
(6
=
)
\
(¬na)
sw
=
rf
\
(6
=
)
\
(¬na)
2
thd
thd
ion
• In C11ra:
sw = rf \ (6=thd ) \ (¬na)
equential
sequential
equential
e sequential
uggestion
sw
=
rf
\
(6
=
thd ) \ (=incl )
suggestion
uggestion
sw
=
rf
\
(6
=
)
\
(=
)
sw
=
rf
\
(6
=
)
\
(=
)
thd
incl
thd
incl
• In OpenCLra:
rccommosuggestion
sw
=
rf
\
(6
=
)
\
(=
)
thd
incl
accommoccommoo
accommohroughout
0
0
0
throughout
hroughout
ee==incl ee0 0==(e(eincl e0e0^^ee incl e0e)0 )
e =incl
incle =
incle ^0 e
incle ) 0
d throughout
0 (e incl
incl
e =incl e = (e incl e ^ e incl e )
removed
eremoved
removed
0
0
0
eeincl ee0 ==(e(e22na
^^ee==thd e0e)0 )__
be removed
na
e incl
incle =
thde0 ) _
0
0 (e 2 na ^ e =thd
(e(e(e
22WG
^
e
=
e
e incl e =
2
na
^
e
=
0 )0e_) _
wg
thd
WG^^ee==wg
_
(e
2
WG
wgee) )_
0
fore
a
(e
2
WG
^
e
=
e
)
_
eee2
DV
wg
efore
fore aa
2
DV
2
DV
before
a
ations
e 2 DV0
rations
ations
0
0
0
e
=
e
e
0 = (e incl e0 0^ e
0 )0 _
incl
incl
ehe
side
erations
•
e
=
e
=
(e

e
^
e
incle =
incle0 ^0 e
inclee) )_
side In OpenCL+RSP:
e =incl
0 (e incl
incl
e side
0_
e =incl e =
(e

e
^
e
e
)
_
0^
(e

e
e
2
rem)
_
0
incl
incl
the side
incl
rem)
(e(e0incl
inclee ^0^ee22rem)
0 __
02
(e
rem
^
e
e
e
^
e
2
rem)
0(e
0e)0 ) _
incl
incl
(e
2
rem
^
e
(e
2
rem
^
e
incle ) 0
ore
a
incl
0
efore
ore aa
(e 2 rem ^ e incl e )
Execution
operabefore
Execution barriers
barriers and
and relaxed
relaxed atomics
atomics When
When orches
orche
opera-a
pecification
nt
the full
entthe
thefull
full
nt
esent
er)
inthe
§B.
sent
full
der)
in§B.
§B.
er)
in
order)
in §B.
including
,including
including
el, including
Scope inclusion
Testing OpenCL+RSP
programs
•
We extended Herd to support the new memory model.
•
We simulated the 12 litmus tests designed by the AMD
developers to define their expectations of RSP.
•
We found 8 were good, but: 2 had unintentional races, 1 enforced broken behaviour, and 1 forbade reasonable behaviour.
•
We also found (and fixed) bugs in their work-stealing
queue implementation
This talk
1. What is RSP?
2. Adding RSP to the OpenCL memory model
3. A formalised implementation of OpenCL+RSP
Implementing RSP
•
Model of GPU hardware
•
Assembly-like language, with instructions modelled
as state transformers
•
Mapping from OpenCL+RSP operations to
sequences of assembly instructions
•
Can then prove that all behaviours of the compiled
program are allowed by the OpenCL+RSP MM.
New compilation scheme
na or WG
r=load(x)
store(x,r)
r=fetch_inc(x)
DV (not remote)
DV (remote)
New compilation scheme
na or WG
r=load(x)
LD r x
store(x,r)
ST r x
r=fetch_inc(x)
INCL1 r x
DV (not remote)
DV (remote)
New compilation scheme
r=load(x)
store(x,r)
r=fetch_inc(x)
na or WG
DV (not remote)
LD r x
LD r x
INVL1 WG
ST r x
FLUL1 WG
ST r x
INCL1 r x
FLUL1 WG
INCL2 r x INVL1 WG
DV (remote)
New compilation scheme
r=load(x)
store(x,r)
r=fetch_inc(x)
na or WG
DV (not remote)
DV (remote)
LD r x
LD r x
INVL1 WG
LD r x FLUL1 DV
INVL1 WG
ST r x
FLUL1 WG
ST r x
FLUL1 WG
INVL1 DV
ST r x
}
FLUL1 WG
INCL2 r x INVL1 WG
FLUL1
INVL1
INCL2
FLUL1
INVL1
}
INCL1 r x
WG
DV
r x
DV
WG
LKrmw
LKrmw
Old compilation scheme
r=load(x)
store(x,r)
r=fetch_inc(x)
na or WG
DV (not remote)
DV (remote)
LD r x
INVL1 WG
LD r x
FLUL1 DV
INVL1 WG
LD r x
}
LK x
ST r x
FLUL1 WG
ST r x
FLUL1 WG
ST r x
INVL1 DV
}
LK x
INCL1 r x
FLUL1 WG
INVL1 WG
INCL2 r x
FLUL1
INVL1
INCL2
INVL1
}
DV
WG
r x
DV
LK x
LKrmw
New compilation scheme
r=load(x)
store(x,r)
r=fetch_inc(x)
na or WG
DV (not remote)
DV (remote)
LD r x
LD r x
INVL1 WG
LD r x FLUL1 DV
INVL1 WG
ST r x
FLUL1 WG
ST r x
FLUL1 WG
INVL1 DV
ST r x
}
FLUL1 WG
INCL2 r x INVL1 WG
FLUL1
INVL1
INCL2
FLUL1
INVL1
}
INCL1 r x
WG
DV
r x
DV
WG
LKrmw
LKrmw
Contributions
•
Extended OpenCL mm to include RSP
•
•
Extended Herd with new mm, and used it to find and
fix bugs in RSP litmus tests and programs
Formalised implementation of RSP (model of GPU
hardware + assembly language)
•
Found and fixed bugs in original compilation
scheme
•
Proved new improved scheme correct
cla
d
e
i
rif
Remote-scoperectified
promotion verified
John Wickerson (Imperial), Mark Batty (Kent), Ally Donaldson (Imperial), Brad Beckmann (AMD)
REMS workshop, Cambridge
22 April 2015
Spare slides
A corrupt MP
*x=42;
store(y,1,DV);
T0
T1
if(load(y,DV)==1)
print(*x);
T2
T3
A corrupt MP
ST x 42;
FLUL1 WG;
ST y 1;
T0
INVL1 WG;
LD y; //1
LD x;
T1
T2
T3
A corrupt MP
ST x 42;
FLUL1 WG;
ST y 1;
T0
INVL1 WG;
LD y; //1
LD x;
T1
T2
T3
A corrupt MP
ST x 42;
FLUL1 WG;
ST y 1;
T0
INVL1 WG;
LD y; //1
LD x;
T1
T2
T3
A corrupt MP
ST x 42;
FLUL1 WG;
ST y 1;
T0
INVL1 WG;
LD y; //1
LD x;
T1
T2
T3
x=0
A corrupt MP
ST x 42;
FLUL1 WG;
ST y 1;
INVL1 WG;
LD y; //1
LD x;
T0
T1
x=42
T2
T3
x=0
A corrupt MP
ST x 42;
FLUL1 WG;
ST y 1;
INVL1 WG;
LD y; //1
LD x;
T0
T1
T2
x=42
T3
x=0
x=42
A corrupt MP
ST x 42;
FLUL1 WG;
ST y 1;
INVL1 WG;
LD y; //1
LD x;
T0
T1
T2
x=42
y=1
T3
x=0
x=42
A corrupt MP
ST x 42;
FLUL1 WG;
ST y 1;
INVL1 WG;
LD y; //1
LD x;
T0
T1
T2
x=42
y=1
T3
x=0
x=42
y=1
A corrupt MP
ST x 42;
FLUL1 WG;
ST y 1;
INVL1 WG;
LD y; //1
LD x;
T0
T1
T2
x=42
y=1
T3
x=0
y=1
x=42
y=1
A corrupt MP
ST x 42;
FLUL1 WG;
ST y 1;
INVL1 WG;
LD y; //1
LD x;
T0
T1
T2
x=42
y=1
T3
x=0
y=1
x=42
y=1