Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Presenters: Abraham Addisie, Vaibhav Gogte Yuhao Zhu*, Yangdong Deng‡, Yubei Chen‡ *Electrical and Computer Engineering University of Texas at Austin ‡Institute of Microelectronics Tsinghua University 1/45 Outline • • • • • • • • • Introduction Motivation Related work GPU Overview Hermes Architecture Adaptive warp scheduling Hardware Implementation Experimental Analysis Conclusion 2 Introduction Processing of an IP packet at a router Receive an IP packet 1. 2. 3. 4. 5. Mac Header: Source Mac :mx Dest Mac :my ---------------------------IP Header: Source IP :x Dest IP :y ---------------------------Data Checking IP Header Packet Classification Routing Table Lookup Decrementing Time to Live (TTL) value IP Fragmentation (if > Max Transmission Unit) IP Packet Processing Mac Header: Source Mac :new Dest Mac :new ---------------------------IP Header: Source IP :x Dest IP :y ---------------------------Data New processing requirements are being added to the list • Deep packet inspection 3 Motivation Internet traffic is increasing exponentially • Multimedia application, social network, internet of things High Throughput Router New high processing demanding task is being added • Deep packet inspection Network protocols are being added and modified • Transition from IPv4(32 bit) to IPv6(128 bit) High Programmable Router 4 Related Work ASIC based router Network processor based router GPP (software) based router ASIC based router: • Long design turnaround • High non-recurring engineering cost NP based router: • No effective programming model • Intel discontinue its NP router business GPP (Software) based router: • Low performance GPU based router: • High performance + High programmability 5 Related Work – CPU vs GPU Throughput GPP (Software) based router Low throughput processor GPU based software router High throughput processor Packetshader: Han and et. al[2010] 6 Exploiting High Throughput GPU for IP routing Processing of a Packet is independent with the others • Data level parallelism = Packet level parallelism Packet Queue Parallel Processing by GPU Batching GPU based router is shown to outperform software based router by 30x (in terms of throughput) Packetshader: Han and et. al[2010] 7 Limitation of existing GPU based router Memory mapping from CPU’s main memory to GPU’s device memory through PCIe bus with a pick bandwidth of 8GBps • GPU throughput = 30x CPU’s , without memory mapping • Reduced to 5x CPU’s , with memory mapping overhead Cannot guarantee minimum latency for an individual packet Architecture of NVIDIA GTX480 Solution: Hermes 8 Hermes, integrated CPU/GPU IP routing Lower packet transferring overhead • Shared memory Lower per packet latency • Adaptive warp scheduling Shared Memory Hierarchy 9 Adaptive Warp Issue no. of packets to be processed Monitor the packets Task FIFO ------------- CPU S M P S M P S M P S M P S M P S M P S M P S M P S M P Minimum 1 warp fetch granularity Data transfer Shared Memory Tradeoff in updating the FIFO: Too large – average packet delay increases Too low – complicated GPU fetch scheduling Arrival pattern of packets Available resources in GPU 10 In Order Commit UDP protocol users expect packets to arrive in order Maps DCQ entry to wrap ID Lookup Table (LUT) Warp id DCQ entry id DCQ entry id Shader Core Warp Allocator DCQ Warp id Warp id ... ... Warp Scheduler . . . Write Back Stage Records warp ids in flight Warps committed in order 11 Hardware and Area Overhead Task FIFO • 32 bit - 1028 entries • Area = 0.053 mm2 Delay Commit Queue • Size depends on maximally allowed concurrent warps (MCWs) and shader cores • 8 bit – 1028 entries • Area = 0.013 mm2 Hardware Overhead Negligible! DCQ-Warp LUT • Size depends on number of MCWs • 16 bit – 32 entries • Area = 0.006 mm2 12 Experimental Setup Cycle Accurate GPGPU-Sim to evaluate performance Benchmarks • Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection • Both burst and sparse patterns QoS parameters – throughput, delay, delay variance 13 Throughput evaluation Computing rates of benchmark applications • Better resource utilization with increasing MCW Burst traffic without DCQ Sparse traffic without DCQ • • • No packet queueing CPU/GPU still unable to deliver at input rate Outperforms CPU/GPU by a factor of 5 14 Delay analysis Burst traffic without DCQ Divergent branches takes higher processing time starving the packets Simple processing in GPU, overlap of CPU side waiting with GPU processing Packet Delay reduction by 81.2%! Delay - with DCQ vs without DCQ 15 Conclusion • Lack of QoS and CPU-GPU communication overhead major bottleneck • Hermes – closely coupled CPU-GPU solution • Meet stringent delay requirements • Enable QoS through optimized configuration • Minimal hardware extension • Novel high quality packet processing engine for future software routers 16 Discussion points • Are GPUs really easy to program for processing packets? • How does the performance and area overhead compare with ASIC based routers? • Is router programmability really a crucial concern? 17