cla d e i rif Remote-scoperectified promotion verified John Wickerson (Imperial), Mark Batty (Kent), Ally Donaldson (Imperial), Brad Beckmann (AMD) REMS workshop, Cambridge 22 April 2015 In brief • RSP is a GPGPU language extension from AMD that enables efficient work-stealing • We worked with AMD to formalise their design (at language and HW level). This led to a corrected and improved implementation. • Formalise early in the design process! This talk 1. What is RSP? 2. Adding RSP to the OpenCL memory model 3. A formalised implementation of OpenCL+RSP This talk 1. What is RSP? 2. Adding RSP to the OpenCL memory model 3. A formalised implementation of OpenCL+RSP C11: flat thread structure T0 T1 T2 T3 T4 T5 OpenCL: thread groupings device device workgroup T0 workgroup T1 T2 workgroup T3 T4 T5 GPUs: hierarchical memory device device workgroup T0 workgroup T1 T2 L1 CACHE workgroup T3 L1 CACHE L2 CACHE T4 T5 L1 CACHE L2 CACHE GLOBAL MEMORY Memory scopes Memory scopes store(x,42) load(x) workgroup T0 workgroup T1 T2 T3 Memory scopes store(x,42,WG) load(x,WG) workgroup T0 workgroup T1 T2 T3 Memory scopes store(x,42,WG) load(x,WG) workgroup T0 workgroup T1 T2 T3 f ! y t aul Memory scopes store(x,42,DV) load(x,DV) workgroup T0 workgroup T1 T2 T3 ok! Memory scopes store(x,42,DV) load(x,WG) workgroup T0 fa * ! y ult workgroup T1 T2 T3 *...but the OpenCL standard could easily be extended to allow this Work-stealing store(headA,_,WG) //push store(headA,_,WG) //pop workgroup A T0 T1 tailA headA workgroup B T2 T3 tailB headB Work-stealing store(headA,_,WG) //push store(headA,_,???) //steal store(headA,_,WG) //pop workgroup A T0 T1 tailA headA workgroup B T2 headB tailB T3 o t y a w no s i h t g plu n i hole L! C n e Op Remote-scope promotion store(x,42,DV) T0 load(x,DV) T1 T2 T3 Remote-scope promotion store(x,42,WG) T0 load(x,DV) T1 T2 T3 Remote-scope promotion store(x,42,WG) T0 load(x,DV,remote) T1 T2 T3 Remote-scope promotion store(x,42,DV,remote) T0 T1 load(x,WG) T2 T3 Work-stealing store(headA,_,WG) //push store(headA,_,DV,remote) //steal store(headA,_,WG) //pop workgroup A T0 T1 tailA headA workgroup B T2 headB tailB T3 This talk 1. What is RSP? 2. Adding RSP to the OpenCL memory model 3. A formalised implementation of OpenCL+RSP version We present in Fig. 2 the full OpenCL model (including exe Wepresent presentininFig. Fig.22the thefull fullOpenCL OpenCLmodel model(including (includingexe ex We We present Fig. 2language. the full OpenCL model (including e barriers) in thein.cat In the following, we discu barriers)ininthe the.cat .catlanguage. language.InInthe thefollowing, following,we wediscu disc barriers) we di changes thatinmust be made to the model, which are highlig barriers) the .cat language. In the following, changesthat thatmust mustbebemade madetotothe themodel, model,which whichare arehighlig highli changes thechanges figure. that must be made to the model, which are highl thefigure. figure. the the figure. 2 n sw = rf \ (6 = ) \ (¬na) 2 2 thd on n sw = rf \ (6 = ) \ (¬na) sw = rf \ (6 = ) \ (¬na) 2 thd thd ion • In C11ra: sw = rf \ (6=thd ) \ (¬na) equential sequential equential e sequential uggestion sw = rf \ (6 = thd ) \ (=incl ) suggestion uggestion sw = rf \ (6 = ) \ (= ) sw = rf \ (6 = ) \ (= ) thd incl thd incl • In OpenCLra: rccommosuggestion sw = rf \ (6 = ) \ (= ) thd incl accommoccommoo accommohroughout 0 0 0 throughout hroughout ee==incl ee0 0==(e(eincl e0e0^^ee incl e0e)0 ) e =incl incle = incle ^0 e incle ) 0 d throughout 0 (e incl incl e =incl e = (e incl e ^ e incl e ) removed eremoved removed 0 0 0 eeincl ee0 ==(e(e22na ^^ee==thd e0e)0 )__ be removed na e incl incle = thde0 ) _ 0 0 (e 2 na ^ e =thd (e(e(e 22WG ^ e = e e incl e = 2 na ^ e = 0 )0e_) _ wg thd WG^^ee==wg _ (e 2 WG wgee) )_ 0 fore a (e 2 WG ^ e = e ) _ eee2 DV wg efore fore aa 2 DV 2 DV before a ations e 2 DV0 rations ations 0 0 0 e = e e 0 = (e incl e0 0^ e 0 )0 _ incl incl ehe side erations • e = e = (e e ^ e incle = incle0 ^0 e inclee) )_ side In OpenCL+RSP: e =incl 0 (e incl incl e side 0_ e =incl e = (e e ^ e e ) _ 0^ (e e e 2 rem) _ 0 incl incl the side incl rem) (e(e0incl inclee ^0^ee22rem) 0 __ 02 (e rem ^ e e e ^ e 2 rem) 0(e 0e)0 ) _ incl incl (e 2 rem ^ e (e 2 rem ^ e incle ) 0 ore a incl 0 efore ore aa (e 2 rem ^ e incl e ) Execution operabefore Execution barriers barriers and and relaxed relaxed atomics atomics When When orches orche opera-a pecification nt the full entthe thefull full nt esent er) inthe §B. sent full der) in§B. §B. er) in order) in §B. including ,including including el, including Scope inclusion Testing OpenCL+RSP programs • We extended Herd to support the new memory model. • We simulated the 12 litmus tests designed by the AMD developers to define their expectations of RSP. • We found 8 were good, but: 2 had unintentional races, 1 enforced broken behaviour, and 1 forbade reasonable behaviour. • We also found (and fixed) bugs in their work-stealing queue implementation This talk 1. What is RSP? 2. Adding RSP to the OpenCL memory model 3. A formalised implementation of OpenCL+RSP Implementing RSP • Model of GPU hardware • Assembly-like language, with instructions modelled as state transformers • Mapping from OpenCL+RSP operations to sequences of assembly instructions • Can then prove that all behaviours of the compiled program are allowed by the OpenCL+RSP MM. New compilation scheme na or WG r=load(x) store(x,r) r=fetch_inc(x) DV (not remote) DV (remote) New compilation scheme na or WG r=load(x) LD r x store(x,r) ST r x r=fetch_inc(x) INCL1 r x DV (not remote) DV (remote) New compilation scheme r=load(x) store(x,r) r=fetch_inc(x) na or WG DV (not remote) LD r x LD r x INVL1 WG ST r x FLUL1 WG ST r x INCL1 r x FLUL1 WG INCL2 r x INVL1 WG DV (remote) New compilation scheme r=load(x) store(x,r) r=fetch_inc(x) na or WG DV (not remote) DV (remote) LD r x LD r x INVL1 WG LD r x FLUL1 DV INVL1 WG ST r x FLUL1 WG ST r x FLUL1 WG INVL1 DV ST r x } FLUL1 WG INCL2 r x INVL1 WG FLUL1 INVL1 INCL2 FLUL1 INVL1 } INCL1 r x WG DV r x DV WG LKrmw LKrmw Old compilation scheme r=load(x) store(x,r) r=fetch_inc(x) na or WG DV (not remote) DV (remote) LD r x INVL1 WG LD r x FLUL1 DV INVL1 WG LD r x } LK x ST r x FLUL1 WG ST r x FLUL1 WG ST r x INVL1 DV } LK x INCL1 r x FLUL1 WG INVL1 WG INCL2 r x FLUL1 INVL1 INCL2 INVL1 } DV WG r x DV LK x LKrmw New compilation scheme r=load(x) store(x,r) r=fetch_inc(x) na or WG DV (not remote) DV (remote) LD r x LD r x INVL1 WG LD r x FLUL1 DV INVL1 WG ST r x FLUL1 WG ST r x FLUL1 WG INVL1 DV ST r x } FLUL1 WG INCL2 r x INVL1 WG FLUL1 INVL1 INCL2 FLUL1 INVL1 } INCL1 r x WG DV r x DV WG LKrmw LKrmw Contributions • Extended OpenCL mm to include RSP • • Extended Herd with new mm, and used it to find and fix bugs in RSP litmus tests and programs Formalised implementation of RSP (model of GPU hardware + assembly language) • Found and fixed bugs in original compilation scheme • Proved new improved scheme correct cla d e i rif Remote-scoperectified promotion verified John Wickerson (Imperial), Mark Batty (Kent), Ally Donaldson (Imperial), Brad Beckmann (AMD) REMS workshop, Cambridge 22 April 2015 Spare slides A corrupt MP *x=42; store(y,1,DV); T0 T1 if(load(y,DV)==1) print(*x); T2 T3 A corrupt MP ST x 42; FLUL1 WG; ST y 1; T0 INVL1 WG; LD y; //1 LD x; T1 T2 T3 A corrupt MP ST x 42; FLUL1 WG; ST y 1; T0 INVL1 WG; LD y; //1 LD x; T1 T2 T3 A corrupt MP ST x 42; FLUL1 WG; ST y 1; T0 INVL1 WG; LD y; //1 LD x; T1 T2 T3 A corrupt MP ST x 42; FLUL1 WG; ST y 1; T0 INVL1 WG; LD y; //1 LD x; T1 T2 T3 x=0 A corrupt MP ST x 42; FLUL1 WG; ST y 1; INVL1 WG; LD y; //1 LD x; T0 T1 x=42 T2 T3 x=0 A corrupt MP ST x 42; FLUL1 WG; ST y 1; INVL1 WG; LD y; //1 LD x; T0 T1 T2 x=42 T3 x=0 x=42 A corrupt MP ST x 42; FLUL1 WG; ST y 1; INVL1 WG; LD y; //1 LD x; T0 T1 T2 x=42 y=1 T3 x=0 x=42 A corrupt MP ST x 42; FLUL1 WG; ST y 1; INVL1 WG; LD y; //1 LD x; T0 T1 T2 x=42 y=1 T3 x=0 x=42 y=1 A corrupt MP ST x 42; FLUL1 WG; ST y 1; INVL1 WG; LD y; //1 LD x; T0 T1 T2 x=42 y=1 T3 x=0 y=1 x=42 y=1 A corrupt MP ST x 42; FLUL1 WG; ST y 1; INVL1 WG; LD y; //1 LD x; T0 T1 T2 x=42 y=1 T3 x=0 y=1 x=42 y=1
© Copyright 2024