Blade – A Timing Violation Resilient Asynchronous Template Dylan Hand∗, Matheus Trevisan Moreira∗†, Hsin-Ho Huang∗, Danlei Chen∗, Frederico Butzke‡, Zhichao Li∗, Matheus Gibiluka†, Melvin Breuer∗, Ney Laert Vilar Calazans†, and Peter A. Beerel∗ May 4th, 2015 * University of Southern California, Los Angeles, CA † Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brazil ‡ Universidade de Santa Cruz do Sul, Santa Cruz do Sul, Brazil Delay Overheads Cycle time of clocked logic Logic Time PVT margin Clock margin Logic gates Flip-flop alignment [Dreslinksi et al., IEEE Proc. 2010] Traditional synchronous design suffers from increased margins • Worse at low and near-threshold regions MOTIVATION | 2 Data Dependent Delays Cycle time of clocked logic Logic gates Worst-Case Data Number of Operations Logic Time Data delay in Plasma CPU Delay Slower Delay variation due to data is rarely exploited in traditional designs MOTIVATION | 3 Data Dependent Delays Worst-Case Delay Cycle time of clocked logic Logic gates Worst-Case Data Number of Operations Logic Time Data delay in Plasma CPU Delay Slower Delay variation due to data is rarely exploited in traditional designs MOTIVATION | 3 Data Dependent Delays Average Delay Cycle time of clocked logic Logic gates Worst-Case Data Data delay in Plasma CPU Number of Operations Logic Time Worst-Case Delay Delay Slower Delay variation due to data is rarely exploited in traditional designs MOTIVATION | 3 Resilient Design CLK = 0 Sequential Gate Shadow Sequential Logic Error Detection Next Pipeline Stage To Control Logic Allow and detect timing errors in datapath • Correct via architectural replay or gating/pausing clock RESILIENT DESIGN | 4 Resilient Design CLK = 0 Sequential Gate Shadow Sequential Logic Error Detection Next Pipeline Stage To Control Logic Allow and detect timing errors in datapath • Correct via architectural replay or gating/pausing clock RESILIENT DESIGN | 4 Resilient Design CLK = 1 Sequential Gate Shadow Sequential Next Pipeline Stage Logic Error Detection 0 To Control Logic Allow and detect timing errors in datapath • Correct via architectural replay or gating/pausing clock RESILIENT DESIGN | 4 Resilient Design CLK = 1 Sequential Gate Shadow Sequential Next Pipeline Stage Logic Error Detection 1 To Control Logic Allow and detect timing errors in datapath • Correct via architectural replay or gating/pausing clock Effective approaches have been elusive! RESILIENT DESIGN | 4 Metastability Issue Shadow FF Latch CLK Logic Next Pipeline Stage To Control Logic [Bubble Razor, Fojtik, 2013] The Problem [Beer et al, 2014] • Flop metastability can propagate to ERR signal • Metastability in control path can cause system failure RESILIENT DESIGN | 5 Metastability Issue Shadow FF Latch CLK Logic Next Pipeline Stage To Control Logic [Bubble Razor, Fojtik, 2013] The Problem [Beer et al, 2014] • Flop metastability can propagate to ERR signal • Metastability in control path can cause system failure RESILIENT DESIGN | 5 Metastability Issue Shadow FF Latch CLK Logic Next Pipeline Stage X To Control Logic X [Bubble Razor, Fojtik, 2013] The Problem [Beer et al, 2014] • Flop metastability can propagate to ERR signal • Metastability in control path can cause system failure RESILIENT DESIGN | 5 Handling Metastability CLK Latch Transition Detector Logic Next Pipeline Stage To Control Logic [RazorII, Das, 2009] Transition detector may exhibit metastability • Error signal must go through synchronizer, increasing delay • Uses architectural replay to recover from errors RESILIENT DESIGN | 6 Handling Metastability CLK Logic Latch Transition Detector Next Pipeline Stage X To Control Logic [RazorII, Das, 2009] Transition detector may exhibit metastability • Error signal must go through synchronizer, increasing delay • Uses architectural replay to recover from errors RESILIENT DESIGN | 6 Handling Metastability CLK Transition Detector Next Pipeline Stage Logic Latch X FF FF To Control Logic Synchronizer [RazorII, Das, 2009] Transition detector may exhibit metastability • Error signal must go through synchronizer, increasing delay • Uses architectural replay to recover from errors RESILIENT DESIGN | 6 Handling Metastability CLK Latch Transition Detector Next Pipeline Stage Logic FF FF 1 or 0 To Control Logic Synchronizer [RazorII, Das, 2009] Transition detector may exhibit metastability • Error signal must go through synchronizer, increasing delay • Uses architectural replay to recover from errors RESILIENT DESIGN | 6 Hold Time Concerns Ring Oscillator Clock Generator CLK Data In STOP MS Detector Comb. Logic In FF Internal Data Latch Out [SafeRazor, Cannizzaro, 2014] Relies on latch for error correction Hold times are problematic RESILIENT DESIGN | 7 Hold Time Concerns Ring Oscillator Clock Generator STOP Data In MS Detector Comb. Logic In FF CLK = 1 Internal Data Latch Out [SafeRazor, Cannizzaro, 2014] Relies on latch for error correction Hold times are problematic RESILIENT DESIGN | 7 Resiliency Landscape Design Template Sync / Async MTBF Safe Avoids Replay Logic Hold Low Error Time Penalty Robust Bubble Razor Sync No Yes Yes Yes Razor II Sync Yes No No No SafeRazor Async Yes* Yes No Yes RESILIENT DESIGN | 8 Resiliency Landscape Design Template Sync / Async MTBF Safe Avoids Replay Logic Hold Low Error Time Penalty Robust Bubble Razor Sync No Yes Yes Yes Razor II Sync Yes No No No SafeRazor Async Yes* Yes No Yes Blade Async Yes Yes Yes Yes Blade combines the best features of past resiliency schemes RESILIENT DESIGN | 8 Outline Our proposed resilient solution – Blade Case study – 3-stage Plasma CPU • Automated flow • Area efficiency features • Results and comparisons Conclusions and future work OUTLINE | 21 Blade Template Asynchronous Controller δ Reconfigurable Delay Lines Combinational Logic L.data CLK L.ack L.req Blade Controller RE.req RE.ack Δ R.ack R.req Sample LE.req LE.ack Err 2 EDL Error Detection Logic Blade Stage R.data Error Detecting Latch Single Rail Datapath BLADE TEMPLATE | 10 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err = 0 δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err = 0 δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err = 1 δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err = 1 δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 11 Δ Δ EDL Error Detection Logic Err δ Combinational Logic Sample Controller B CLK Controller A Sample CLK Positive Hold Margins Err EDL Error Detection Logic Handshaking delays create positive hold margin! BLADE TEMPLATE | 12 Error Detection Logic From other Q-Flops tTD delay X In D Latch +C CE = 0 { Err0 = 0 { CLK = 0 Err1 = 0 Sample = 0 Blade Controller Q-Flop Out Q EDL Error Detection Logic C-element stores error signal, which is sampled by Q-Flop BLADE TEMPLATE | 13 Error Detection Logic From other Q-Flops tTD delay X In D Latch +C CE = 1 { Err0 = 0 { CLK = 1 Err1 = 0 Sample = 0 Blade Controller Q-Flop Out Q EDL Error Detection Logic C-element stores error signal, which is sampled by Q-Flop BLADE TEMPLATE | 13 Error Detection Logic From other Q-Flops tTD delay X In D Latch +C CE = 1 { Err0 = 0 { CLK = 1 Err1 = 0 Sample = 1 Blade Controller Q-Flop Out Q EDL Error Detection Logic C-element stores error signal, which is sampled by Q-Flop BLADE TEMPLATE | 13 Error Detection Logic From other Q-Flops tTD delay X In D Latch +C CE = 1 { Err0 = 0 { CLK = 1 Err1 = 1 Sample = 1 Blade Controller Q-Flop Out Q EDL Error Detection Logic C-element stores error signal, which is sampled by Q-Flop BLADE TEMPLATE | 13 Metastable-Safe Operation From other Q-Flops tTD delay X In D Latch +C CE = 0 { Err0 = 0 { CLK = 0 Err1 = 0 Sample = 0 Blade Controller Q-Flop Out Q EDL Error Detection Logic Q-Flop prevents metastability propagation to control path BLADE TEMPLATE | 14 Metastable-Safe Operation From other Q-Flops tTD delay X In D Latch +C CE = X { Err0 = 0 { CLK = 1 Err1 = 0 Sample = 0 Blade Controller Q-Flop Out Q EDL Error Detection Logic Q-Flop prevents metastability propagation to control path BLADE TEMPLATE | 14 Metastable-Safe Operation From other Q-Flops tTD delay X In D Latch +C CE = X { Err0 = 0 { CLK = 1 Err1 = 0 Sample = 1 Blade Controller Q-Flop Out Q EDL Error Detection Logic Q-Flop prevents metastability propagation to control path BLADE TEMPLATE | 14 Metastable-Safe Operation Err0 = 1 From other Q-Flops tTD delay X In D Latch +C CE = 0 Some time later… { { CLK = 1 Err1 = 0 Sample = 1 Blade Controller Q-Flop Out Q EDL Error Detection Logic Q-Flop prevents metastability propagation to control path BLADE TEMPLATE | 14 Overhead Reduction tcomp tTD From other Q-Flops delay delay X + C { From other C-elements In D Latch Err0 { Sample CLK { Err1 Blade Controller Q-Flop Out Q EDL Error Detection Logic C-element and OR gates amortize overhead over many EDLs BLADE TEMPLATE | 15 Overhead Reduction tcomp tTD From other Q-Flops delay delay X + C { From other C-elements In D Latch Err0 { Sample CLK { Err1 Blade Controller 4-Input C-element collects output from 3 EDLs Q-Flop Out Q EDL Error Detection Logic C-element and OR gates amortize overhead over many EDLs BLADE TEMPLATE | 15 Overhead Reduction tcomp tTD From other Q-Flops delay delay X + C { From other C-elements In D Latch Err0 { Sample CLK { Err1 Blade Controller Q-Flop Out Q EDL Error Detection Logic 4-Input OR gate collects output from 4 C-elements C-element and OR gates amortize overhead over many EDLs BLADE TEMPLATE | 15 Overhead Reduction tcomp tTD From other Q-Flops delay delay X + C { From other C-elements In D Latch Err0 { Sample CLK { Err1 Blade Controller Q-Flop Each Q-Flop covers 12 EDLs Out Q EDL Error Detection Logic C-element and OR gates amortize overhead over many EDLs BLADE TEMPLATE | 15 Controller Implementation L.req L.ack LE.req LE.ack goR δ Left R.req R.ack RE.req RE.ack Right edi goL goD Bottom edo Δ delay Blade Controller Sample Δ CLK Err[1] Err[0] Three part burst-mode state machine • Implemented using 3D [Yun, 1992] BLADE TEMPLATE | 16 Controller Implementation L.req+ / LE.req+ δ L.req L.ack LE.ack+ / goR+ LE.req LE.ack goL- / L.ack- goR Left Right LE.ack- / goR- edi goL L.req- / LE.req- Blade Controller Δ goL+ / Lack+ R.req R.ack RE.req RE.ack goD Bottom edo Δ delay Sample CLK Err[1] Err[0] Three part burst-mode state machine • Implemented using 3D [Yun, 1992] BLADE TEMPLATE | 16 Controller Implementation goR+ R.ack- / goD+ R.req+ L.req L.ack LE.req LE.ack RE.req+ Err[0]+ / RE.ack+ goD- RE.req+ Err[1]+ / edo+ goR δ Left Err[1]- edi- / Err[0]- / Right edi+ / edi+ / RE.ack+ goD- edo- RE.ack- goD- edo- R.req R.ack RE.req RE.ack edi RE.req- Err[0]+ / Err[0]- / goL RE.req- Err[1]+ / edo+ goD Err[1]- edi- / edo Bottom Δ RE.ack- goD- goR- R.ack+ / goD+ R.req- delay Blade Controller Sample Δ CLK Err[1] Err[0] Three part burst-mode state machine • Implemented using 3D [Yun, 1992] BLADE TEMPLATE | 16 Controller Implementation L.req L.ack LE.req LE.ack δ Left goR Right goD- delay- / Sample- goD+ / CLK+ delay+ / Bottom goL+ Sample+ CLK- Sample goD- delay/ Sample- edi goD goL Blade Controller R.req R.ack RE.req RE.ack edo Δ delay+ / goL- Sample+ delay CLKΔ CLK goD+ /Err[1] CLK+ Err[0] Three part burst-mode state machine • Implemented using 3D [Yun, 1992] BLADE TEMPLATE | 16 Case Study: Plasma MIPS OpenCore 3-stage pipeline 28nm FDSOI @ 666MHz (w/ ideal clock and Vdd) Type Count Combinational 11,740 Buf/Inv 1,683 Seq. (Non-RF) 531 RF 2,048 Total 14,319 [1] http://opencores.org/project,plasma CASE STUDY | 17 Automatic Conversion Flow Convert single-clock sync RTL design to Blade Re-uses synchronous EDA tools and libraries Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation Async Netlist RTL Spec Seamless integration into existing flows Blade Design Flow CASE STUDY | 18 Synthesis Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation CLK FF FF Synthesize sync RTL design using standard EDA tools CASE STUDY | 19 Latches Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation CLK2 CLK1 M S M S Replace flip flops with master-slave latches Two-phase non-overlapping clocking CASE STUDY | 20 Latch Retiming Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation CLK2 CLK1 M S S M Retime latches to spread logic delay across stages Allow time borrowing to reduce area overhead CASE STUDY | 21 EDL Insertion Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation CLK2 CLK1 M EDL TB M EDL TB Replace non-TB latches with error detecting latches CASE STUDY | 22 EDL Insertion Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation CLK2 CLK1 EDL M TB EDL M TB Replace non-TB latches with error detecting latches Not all latches need be error detecting CASE STUDY | 22 Async Control Blade Controller EDL Sync Synthesis FF to Latch Conversion Blade Controller TB Latch Retiming Blade Controller EDL Add EDL + Async Control Simulation Blade Controller TB Remove synchronous clock trees Add Blade controllers and delay lines CASE STUDY | 23 Simulation Blade Controller EDL Sync Synthesis Blade Controller TB FF to Latch Conversion Latch Retiming Add EDL + Async Control Blade Controller Simulation Blade Controller EDL TB Back annotated SDF simulation using final netlist CASE STUDY | 24 Resynthesis EDL EDL A D EDL EDL Logic B E EDL EDL C F CASE STUDY | 25 Resynthesis EDL EDL A D Critical Path EDL EDL Logic B E EDL C EDL Delay > δ F CASE STUDY | 25 Resynthesis EDL EDL A D Critical Path EDL EDL Logic B E EDL C EDL Delay > δ F Add set_max_delay from A to F = δ CASE STUDY | 25 Resynthesis EDL EDL A D EDL EDL Logic B E EDL C Latch Delay < δ F Add set_max_delay from A to F = δ CASE STUDY | 25 Resynthesis EDL EDL A D EDL EDL Logic B Delay > δ EDL C E Correlated EDL Latch Delay < δ F Add set_max_delay from A to F = δ CASE STUDY | 25 Resynthesis EDL EDL A D EDL Latch Logic B Delay < δ EDL C E Latch Delay < δ F Add set_max_delay from A to F = δ CASE STUDY | 25 Brute Force Resynthesis Best run reduced error rate by 39% and EDLs by 27% (-1.9% area) Best Run 122 Correlated EDLs! Evaluated hundreds of resynthesis runs • Each run sets a max delay constraint to a single latch Chose result that led to largest reduction in area and error rate CASE STUDY | 26 Area and Performance 2% 24% AND/OR Trees C-elements Q-Flops 24% Plasma running CoreMark Delay Elements Controllers 6% 12% FF to Latch + EDL 32% Overall area overhead is 8.4% Performance increases • Average: 19% • Peak: 42% CASE STUDY | 27 Performance Comparison with Margins Must add margins for PVT variation, clock skew / jitter, and/or aging • Synchronous frequency degraded to accommodate Margin in Blade design is only imposed when an error occurs • A 30% error rate reduces impact of margin by ~70% Margin Synchronous Blade (30% ER) % Advantage 0% 666MHz 800MHz 20% 15% 566MHz 764MHz 35% 30% 466Mhz 728MHz 56% CASE STUDY | 28 Other Related Work Canary Circuits [Sato, 2007] • Removes some PVT margins • But cannot take advantage of data dependency Bundled Data Designs [Sutherland’89, Nowick’97] • Speculative completion sensing exploits some data dependency • Margins impact performance on every cycle • No observability of errors Soft Mousetrap [Liu, 2013] • Hold time constraints remain difficult to meet CONCLUSIONS | 29 Conclusions Blade Template • Achieves higher performance by exploiting data dependency • Benefits from average vs worst-case MS resolution times • Reduces impact of margins for PVT variations • Enables voltage scaling for power savings Plasma Case Study • Highlights design and CAD techniques for area efficiency • Achieves 19% increase in performance with 8.4% area overhead CONCLUSIONS | 30 Questions?
© Copyright 2024