Download Report

Comparative Evaluation of FPGA and ASIC
Implementations of Bufferless and Buffered
Routing Algorithms for On-Chip Networks
Yu Cai Ken Mai Onur Mutlu
Carnegie Mellon University
Motivation
n 
In many-core chips, on-chip interconnect (NoC)
consumes significant power.
q 
n 
Intel Terascale: ~30%; MIT RAW: ~40% of system power
Must devise low power routing mechanisms
Core
L1
L2 Slice
Router
2
Bufferless Deflection Routing
n 
Key idea: Packets are never buffered in the network. When two packets
contend for the same link, one is deflected. (BLESS [Moscibroda, ISCA
2009])
New traffic can be injected
whenever there is a free
output link.
C
No buffers è lower power, smaller area
C
Conceptually simple
Destination
3
Motivation
n 
Previous work has evaluated bufferless deflection routing in
software simulator (BLESS [Moscibroda, ISCA 2009])
q 
q 
q 
q 
n 
Energy savings
Area reduction
Minimal performance loss
Unfortunately: No real silicon implementation comparison
è Silicon power, area, latency, and performance?
Goal: Implement BLESS in FPGA for apple to apple
comparison with other buffered routing algorithms.
4
Bufferless Router
n 
Pipeline the oldest first algorithm into two stages
q 
Rank priority and oldest first arbiter
5
Oldest-First Arbiter
n 
n 
Latency of arbiter cells linearly correlated with port number
Latency of each arbiter cells also correlated with port number
6
Bufferless Router with Buffers
.
..
Controller
(FSM)
Buffer
Route
Computing
W
N
E
S
Re
Ages Rank Priorities
n 
Force
Schedule
Force
Schedule
P
I
P
E
L
I
N
E
R
E
G
S
P
I
P
E
L
I
N
E
Highest Priority
Lowest Priority
Feedback
Smart
Crossbar
P
I
P
E
L
I
N
E
Oldest-first
Arbiter
R
E
G
S
Route
Configuration
R
E
G
S
Force schedule can force the flits to be assigned an output
port, although it is not the desired port
7
Baseline buffered Router
Flits
Credits
North
West
East
Side
South
East
North
West
Router
Credits
Side
Resource
Logic
South
VC Req/Rel
VC
Response
Route
Compute
VC1
Switch Req
Switch Res
Virtual
Channel
Allocator
Switch
Allocator
VC0
Credits
GROP
Flits
Flits
Input Controller
CrossBar
8
Emulation Platform Setup
n 
FPGA platform (Berkeley Emulation Enginee II)
q 
n 
Virtext2Pro FPGA
Network topology
q 
q 
3x3 and 4x4 Mesh network
3x3 and 4x4 Torus network
x
Res
Res
Router
0000
Res
Res
Router
0100
Res
Router
0001
Res
Res
Res
Res
y
Router
0100
Router
0001
Router
1110
Res
Res
Res
Router
1001
Router
0110
Router
1101
Res
Router
1010
Res
Router
0111
Router
1100
Res
Res
Res
Router
0011
Res
Router
1000
Router
0101
Router
0010
Router
1111
Res
Res
Res
Res
Router
1011
Router
0000
Router
1101
Router
1010
Res
Res
Res
Res
Router
0111
Router
1100
Router
1001
Router
0110
Res
Res
Res
Res
Router
0011
Router
1000
Router
0101
Router
0010
Res
x
Router
1110
Res
Router
1011
Router
1111
y
(a) 4x4 Torus Network
(b) 4x4 Mesh Network
9
Emulation Configuration
n 
Router
q 
q 
q 
n 
Buttered router with 1, 2 and 4 virtual channels
Bufferless router
Bufferless router with one flit buffer in the input port
Traffic patterns
10
FPGA Area Comparison
11
FPGA Frequency Analysis
12
FPGA Power Analysis
13
Network Power Analysis
14
Packet Latency (4x4 Torus)
15
Packet Latency (4x4 Mesh)
16
ASIC Implementation results
17
Conclusion
n 
n 
n 
First realistic and detailed implementation of deflectionbased bufferless routing in on-chip networks
Bufferless routing is efficiently implementable in mesh and
torus based on-chip networks, leading to significant area
(38%), power consumption (30%) reductions over the best
buffered router implementation on a 65nm ASIC design
We identify that oldest-first arbitration, which is used to
guarantee livelock freedom of bufferless routing algorithms,
as a key design challenge of bufferless router design
q 
With careful design, oldest-first arbitration can be pipelined
and can outperform virtual channel arbitration
18
Thank You.
Comparative Evaluation of FPGA and ASIC
Implementations of Bufferless and Buffered
Routing Algorithms for On-Chip Networks
Yu Cai Ken Mai Onur Mutlu
Carnegie Mellon University