Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Hermes: An Integrated CPU/GPU
Microarchitecture for IP Routing
Presenters: Abraham Addisie, Vaibhav Gogte
Yuhao Zhu*, Yangdong Deng‡, Yubei Chen‡
*Electrical and Computer Engineering
University of Texas at Austin
‡Institute
of Microelectronics
Tsinghua University
1/45
Outline
•
•
•
•
•
•
•
•
•
Introduction
Motivation
Related work
GPU Overview
Hermes Architecture
Adaptive warp scheduling
Hardware Implementation
Experimental Analysis
Conclusion
2
Introduction
Processing of an IP packet at a router
Receive an
IP packet
1.
2.
3.
4.
5.
Mac Header:
Source Mac :mx
Dest Mac :my
---------------------------IP Header:
Source IP :x
Dest IP :y
---------------------------Data
Checking IP Header
Packet Classification
Routing Table Lookup
Decrementing Time to Live (TTL) value
IP Fragmentation (if > Max Transmission Unit)
IP Packet Processing
Mac Header:
Source Mac :new
Dest Mac :new
---------------------------IP Header:
Source IP :x
Dest IP :y
---------------------------Data
New processing requirements are being added to the list
• Deep packet inspection
3
Motivation
Internet traffic is increasing exponentially
• Multimedia application, social network, internet of things
High Throughput Router
New high processing demanding task is being added
• Deep packet inspection
Network protocols are being added and modified
• Transition from IPv4(32 bit) to IPv6(128 bit)
High Programmable
Router
4
Related Work
ASIC based router
Network processor based router
GPP (software) based router
ASIC based router:
•
Long design turnaround
•
High non-recurring engineering cost
NP based router:
•
No effective programming model
•
Intel discontinue its NP router
business
GPP (Software) based router:
•
Low performance
GPU based router:
•
High performance + High
programmability
5
Related Work – CPU vs GPU Throughput
GPP (Software) based router
Low throughput processor
GPU based software router
High throughput processor
Packetshader: Han and et. al[2010]
6
Exploiting High Throughput GPU for IP routing
Processing of a Packet is independent with the others
• Data level parallelism = Packet level parallelism
Packet Queue
Parallel Processing
by GPU
Batching
GPU based router is shown to outperform software
based router by 30x (in terms of throughput)
Packetshader: Han and et. al[2010]
7
Limitation of existing GPU based router
Memory mapping from CPU’s main memory to GPU’s device
memory through PCIe bus with a pick bandwidth of 8GBps
• GPU throughput = 30x CPU’s , without memory mapping
• Reduced to 5x CPU’s , with memory mapping overhead
Cannot guarantee minimum latency for an individual packet
Architecture of NVIDIA GTX480
Solution: Hermes
8
Hermes, integrated CPU/GPU IP routing
Lower packet transferring overhead
• Shared memory
Lower per packet latency
• Adaptive warp scheduling
Shared Memory Hierarchy
9
Adaptive Warp Issue
no. of packets to be
processed
Monitor the
packets
Task FIFO
-------------
CPU
S
M
P
S
M
P
S
M
P
S
M
P
S
M
P
S
M
P
S
M
P
S
M
P
S
M
P
Minimum 1 warp fetch
granularity
Data transfer
Shared Memory
Tradeoff in updating the FIFO:
Too large – average packet delay increases
Too low – complicated GPU fetch scheduling
Arrival pattern of
packets
Available
resources in GPU
10
In Order Commit
UDP protocol users expect packets to arrive in order
Maps DCQ entry to wrap
ID
Lookup Table (LUT)
Warp id
DCQ entry id
DCQ
entry id
Shader Core
Warp
Allocator
DCQ
Warp id
Warp id
...
...
Warp Scheduler
.
.
.
Write Back
Stage
Records warp ids in flight
Warps committed in order
11
Hardware and Area Overhead
Task FIFO
• 32 bit - 1028 entries
• Area = 0.053 mm2
Delay Commit Queue
• Size depends on maximally allowed concurrent warps (MCWs) and
shader cores
• 8 bit – 1028 entries
• Area = 0.013 mm2
Hardware Overhead Negligible!
DCQ-Warp LUT
• Size depends on number of MCWs
• 16 bit – 32 entries
• Area = 0.006 mm2
12
Experimental Setup
Cycle Accurate GPGPU-Sim to evaluate performance
Benchmarks
• Checking IP header  Packet classification  Routing table
lookup  Decrementing TTL  IP fragmentation and Deep
packet inspection
• Both burst and sparse patterns
QoS parameters – throughput, delay, delay variance
13
Throughput evaluation
Computing rates of benchmark applications
•
Better resource utilization with
increasing MCW
Burst traffic without DCQ
Sparse traffic without DCQ
•
•
•
No packet queueing
CPU/GPU still unable to deliver at input rate
Outperforms CPU/GPU by a factor of 5
14
Delay analysis
Burst traffic without DCQ
Divergent branches takes
higher processing time
starving the packets
Simple processing in GPU,
overlap of CPU side waiting
with GPU processing
Packet Delay reduction
by 81.2%!
Delay - with DCQ vs without DCQ
15
Conclusion
• Lack of QoS and CPU-GPU communication overhead major
bottleneck
• Hermes – closely coupled CPU-GPU solution
• Meet stringent delay requirements
• Enable QoS through optimized configuration
• Minimal hardware extension
• Novel high quality packet processing engine for future software
routers
16
Discussion points
• Are GPUs really easy to program for processing packets?
• How does the performance and area overhead compare with ASIC
based routers?
• Is router programmability really a crucial concern?
17