Slides

Supporting
Differentiated Services in Computers
via Programmable Architecture for
Resourcing-on-Demand
(PARD)
Jiuyue Ma, Xiufeng Sui, Ninghui Sun, Yupeng Li, ZihaoYu,
Bowen Huang, Tianni Xu, Zhicheng Yao, Yun Chen,
Haibin Wang, Lixin Zhang, Yungang Bao
ICT, CAS
Huawei
2015.03.16
Data Center Era
•
Data center as a
Infrastructure
•
•
•
Internet service
Cloud computing
Sharing in data centers
•
Google: Millions of jobs
over 12,000 servers in
a month
2
Diverse Workloads
•
•
•
Latency-Critical
•
Search engine
•
online-shopping
Throughput-Oriented
•
Data analyst
•
Indexing
6min
Others
•
C. Reiss et al. Heterogeneity and Dynamicity of Clouds at Scale:
Google Trace Analysis, SOCC, 2012.
Test-and-Debug
3
Sharing -> Interference
Noisy Neighbors
Christine Wang. Intel® Xeon® Processor E5-2600 v3 Product Family Performance & Platform Solutions. 2014.
[Yang et.al. ISCA ’13] Bubble-Flux: Precise Online QoS Management
for Increased Utilization in Warehouse Scale Computer
4
[Kambadur et.al SC’12] Measuring Interference Between Live
Datacenter Applications
Prior Works
•
Covering whole system
stack
• From hardware to
user applications
#
Layer
1
Datacenter
2
However,
App
3
•
•
•
Scenario-specific
Contention-varying
Time-consuming
maintenance activities [Dean, Commun. ACM 2013][Dean, 2012] or
backup jobs[Yu NSDI’2011]
4
small packets triggered Nagle’s algorithm [Yu NSDI’2011]
5
limited buffers [Yu NSDI’2011]
6
Delayed ACK results in RTO [Yu NSDI’2011]
network
stack
7
8
9
OS
TCP congestion control [Alizadeh, SIGCOMM’2010][Alizadeh,
NSDI’2012]
packet scheduling[Vamanan, SIGCOMM’2012][Wilson,
SIGCOMM’2011][Zats, SIGCOMM Comput. Commun. 2012][Hong,
SIGCOMM Comput. Commun. 2012]
kernel sockets[Leverich EuroSys’14]
10
lock contention[Kapoor, SoCC’2012]
11
context switch[Leverich EuroSys’14]
kernel
kernel scheduling overhead[Leverich EuroSys’14]
13
SMT load imbalance[Leverich EuroSys’14]
14
IRQ imbalance[Leverich EuroSys’14]
15
• Ad-hoc solutions
global file system [Dean, Commun. ACM 2013]
background deamon [Dean, Commun. ACM 2013]
12
•
Contention Point
power
16
17
Hypervisor
18
C-State[Leverich EuroSys’14]
DVFS[Leverich EuroSys’14]
Virtual Machine Scheduling[Xu, NSDI’2013][Wang, INFOCOM’2010]
[Xu, SoCC’2013]
Network Isolation[Wang, INFOCOM’2010][Shieh, NSDI’2011][Xu,
SoCC’2013][Jeyakumar, NSDI’2013]
19
SMT[Zhang, MICRO’2014]
20
shared caches[Leverich EuroSys’14][Tang, ISCA’2011][Kasture,
ASPLOS’2014][Sanchez, ISCA’2011][Sanchez, MICRO’2010]
[Qureshi, MICRO’2006][Thereska, SOSP’2013]
Hardware
21
memory[Tang, ISCA’2011][Yang, ISCA’2013][Yang, ISCA’2013]
[Muralidhara, MICRO’2011][Delimitrou, IISWC’2013]
22
NIC[Radhakrishnan, NSDI’2014]
23
I/O[Mesnier, SIGOPS’2011][Delimitrou, IISWC’2013]
5
Integrated Solutions
•
•
Online Service Data Center
• Over-provisioning
+
Batch-Workload Data Center
•
Highly shared
Software Optimization
cgroup, backup
request
LXC, priority,
sync-backup-tasks
Google Datacenters Utilization: (Jan-Mar, 2013)[1]
Online
Service
Batch
Workload
v.s.
75%
30%
[1] L. Barrosa, J. Clidaras, U. Holzle, The Datacenter as a Computer (2nd Edition), July, 2013.
7
Data Center Era
2010s
Search, On-line shopping,
Cloud computing, …
Applications
sharing
infrastructure
Priority, Throughput, Latency, …
Different
Requirements
QoS v.s. Utilization
QoS Problem
Separate
Online/Offline
Service
9
Data Center Era
2010s
Internet Era
1990s
Search, On-line shopping,
Cloud computing, …
Applications
sharing
infrastructure
HTTP, FTP, VoIP, Stream
Media, Game, …
Priority, Throughput, Latency, …
Different
Requirements
VoIP, Game, … : Latency-critical
FTP, VoD, …: Bandwidth-sensitive
Email: Best Effort
QoS v.s. Utilization
QoS Problem
QoS
Separate
Online/Offline
Service
1994, Integrated services
1998, Differentiated Services
9
Software-Defined Networking (SDN)
•
Each packet tagged with a flowid
•
Control Plane
•
•
Packet filters
•
Tag-based rules
Programming Interface
•
OpenFlow => access control plane
•
API => business applications
10
Rationale of QoS Technologies
•
IntServ, DiffServ:
• Propagate network applications’ QoS requirements
to the network hardware
•
Also recognized by the architecture community
11
Rationale of QoS Technologies
•
IntServ, DiffServ:
• Propagate network applications’ QoS requirements
to the network hardware
•
Also recognized by the architecture community
11
Rationale of QoS Technologies
•
IntServ, DiffServ:
• Propagate network applications’ QoS requirements
to the network hardware
•
Also recognized by the architecture community
“New, high-level interfaces are
required to convey programmer and
compiler knowledge to the
hardware.”
21st Century Computer Architecture
12
Observation
A Computer is inherently a Network
Apply networking QoS technologies to
computer architecture?
13
Observation
A Computer is inherently a Network
Apply networking QoS technologies to
computer architecture?
13
Yes!
PARD
Programmable Architecture for Resourcing-on-Demand
w/o loss of QoS
14
Three Challenges
How-to support QoS?
Programmable Architecture for Resourcing-on-Demand
How-to design?
How-to deploy?
15
Challenge #1
How to enable computer hardware to
distinguish different applications?
APP0
Single Application
APP1
…
APPn
…
Core
Hypervisor
Core
Core
…
Shared Last Level Cache
Shared Last Level Cache
I/O
Chipset
Disk
I/O
Chipset
Memory
Ctrl
NIC
Core
Core
Core
reality
Disk
16
Memory
Ctrl
NIC
expect
Tagging Each Application
Single Application
•
APP0
APP1
…
APPn
Tagging Grain
VM-level tagging
•
Container-level tagging
•
Process-level tagging
•
Thread-level tagging
•
Object-level tagging
Fine-Grain Tagging
•
Local Tagging
Cross-Server Tagging
•
17
Connect to network tagging
mechanism
In Response to This Morning’s Keynote
Timing information
(e.g. deadlines)
can be integrated
into tags to covey
software’s timing
requirements to
the hardware
18
Tagging Source
VM0
VM1
Core
Core
VMn
Core
…
Shared Last Level Cache
I/O
Chipset
Disk
Disk
Memory
Controller
Disk
NIC
19
Tagging Source
VM0
VM1
Core
Core
DS-id
DS-id
Add tag
registers
VMn
Core
…
DS-id
Shared Last Level Cache
I/O
Chipset
DS-id
Disk
Memory
Controller
DS-id
Disk
DS-id
DS-id
Disk
NIC
DS-id
DS-id
19
Tagging Datapath
VM0
VM1
Core
Core
DS-id
DS-id
VMn
Core
…
DS-id
Shared Last Level Cache
I/O
Chipset
DS-id
Disk
Memory
Controller
DS-id
Disk
DS-id
DS-id
Disk
NIC
DS-id
DS-id
20
Tagging Datapath
VM0
VM1
VMn
Core
Core
Core
DS-id
DS-id
…
Core -> …
DS-id
Shared Last Level Cache
Tagged Request
I/O
Chipset
DS-id
Disk
DS-id
Disk
DS-id
DS-id
Disk
Memory
Controller
NIC
DS-id
DS-id
20
Tagging Datapath
VM0
VM1
Core
Core
DS-id
DS-id
VMn
Core
…
DS-id
Shared Last Level Cache
I/O
Chipset
DS-id
Disk
Memory
Controller
DS-id
Disk
DS-id
DS-id
Disk
NIC
DS-id
DS-id
21
Tagging Datapath
VM0
VM1
VMn
Core
Core
Core
DS-id
DS-id
…
Dev -> …
DS-id
Shared Last Level Cache
Tagged Response Memory
& DMA
Controller
I/O
Chipset
DS-id
Disk
DS-id
Disk
DS-id
DS-id
Disk
NIC
DS-id
DS-id
21
How to Use Tag?
VM0
VM1
Core
Core
DS-id
DS-id
VMn
Core
…
Cache
Partition
DS-id
Shared Last Level Cache
I/O CP
Chipset
DS-id
Disk
CP
Memory
Controller
Rate Limit
DS-id
Disk
CP
DS-id
DS-id
Disk
NIC
DS-id
DS-id
22
Priority-based
Scheduling
Challenge #2
How to design control planes for a
diversity of hardware?
Control Plane (CP)
23
CP Design Choices
Table-based
Processor-based
loop:%
%%%%rbld
%%
%%%%rbrd%r1,%<req.type%offset>%%
%%%%cmp%r1,%REQUEST
be%.request%
%%%%cmp%r1,%RESPONSE
be%.response%
.dispatch:%
%%%%rbst
%
b%.loop%
.request:%
%%%%call%encrypt
%
b%.dispatch%
.response:%
%%%%call%decrypt
%
b.%dispatch%
v.s.
•
•
•
Simple to implement, Fast
Inflexible
•
24
Support advanced
functionalities
Complicated, slow
Advanced Functionality
op
>
stat
<
=
Trigger Table
threshold
DS-id2
Stat1
DS-id3
…
Stat2
…
Cond-2
Action-2
DS-id2
Cond-3
Action-3
+
firmware
action script
Stat2
…
Parameter Table
…
Param1
Param2
…
DS-id2
Param1
Param2
…
…
DS-id1
action
signal
DS-id1
DS-id3
Action-1
Trigger => Action
Statistics Table
Stat1
Cond-1
…
e.g.
miss_rate > 30%
DS-id1
DS-id1
…
e.g. adjust way mask
26
Final CP Design
Three Tables + Programming Interface + Interrupt Line
Parameter Table
Statistics Table
Trigger Table
DS-id1
Param1
Param2
…
DS-id1
Stat1
Stat2
…
DS-id1
Cond-1
Action-1
DS-id2
Param1
Param2
…
DS-id2
Stat1
Stat2
…
DS-id1
Cond-2
Action-2
DS-id2
Cond-3
Action-3
DS-id3
…
…
DS-id3
…
…
…
Programming
Interface
Compare
Control Plane
•
Three Control Table: Parameter / Statistics / Trigger
•
A Programming Interface: Control Tables R/W
•
A Interrupt Logic: Send Interrupt when trigger condition meet
27
Final CP Design
Three Tables + Programming Interface + Interrupt Line
Parameter Table
Trigger Table
Statistics Table
DS-id1
Param1
Param2
…
DS-id1
Stat1
Stat2
…
DS-id1
Cond-1
Action-1
DS-id2
Param1
Param2
…
DS-id2
Stat1
Stat2
…
DS-id1
Cond-2
Action-2
DS-id2
Cond-3
Action-3
DS-id3
…
…
DS-id3
…
…
…
Programming
Interface
Compare
Control Plane
•
Three Control Table: Parameter / Statistics / Trigger
•
A Programming Interface: Control Tables R/W
•
A Interrupt Logic: Send Interrupt when trigger condition meet
27
Final CP Design
Three Tables + Programming Interface + Interrupt Line
Parameter Table
Trigger Table
Statistics Table
DS-id1
Param1
Param2
…
DS-id1
Stat1
Stat2
…
DS-id1
Cond-1
Action-1
DS-id2
Param1
Param2
…
DS-id2
Stat1
Stat2
…
DS-id1
Cond-2
Action-2
DS-id2
Cond-3
Action-3
DS-id3
…
…
DS-id3
…
…
…
Programming
Interface
Compare
Control Plane
•
Three Control Table: Parameter / Statistics / Trigger
•
A Programming Interface: Control Tables R/W
•
A Interrupt Logic: Send Interrupt when trigger condition meet
27
Final CP Design
Three Tables + Programming Interface + Interrupt Line
Parameter Table
Trigger Table
Statistics Table
DS-id1
Param1
Param2
…
DS-id1
Stat1
Stat2
…
DS-id1
Cond-1
Action-1
DS-id2
Param1
Param2
…
DS-id2
Stat1
Stat2
…
DS-id1
Cond-2
Action-2
DS-id2
Cond-3
Action-3
DS-id3
…
…
DS-id3
…
…
…
Programming
Interface
Compare
Control Plane
•
Three Control Table: Parameter / Statistics / Trigger
•
A Programming Interface: Control Tables R/W
•
A Interrupt Logic: Send Interrupt when trigger condition meet
27
Integrate into HW Components
Cache Controller
Memory Controller
Common Control Plane Structure
28
Challenge #3
How to define/program resourcing-ondemand policy into hardware
Parameter Table
Statistics Table
Trigger Table
DS-id1
Param1
Param2
…
DS-id1
Stat1
Stat2
…
DS-id1
Cond-1
Action-1
DS-id2
Param1
Param2
…
DS-id2
Stat1
Stat2
…
DS-id1
Cond-2
Action-2
DS-id2
Cond-3
Action-3
DS-id3
…
…
DS-id3
…
…
…
Programming
Interface
Compare
Control Plane
Policy?
29
Platform Resource Manager (PRM)
•
•
•
•
Augmented IPMI
Connect all control planes (CP)
Run linux-based firmware
Abstract CPs as files
/sys/cpa
cpa0
ident
type
ldoms
ldom0
parameter
param1
VM0
VM1
Core
Core
DS-id
DS-id
VMn
param2
statistics
trigger
Core
…
DS-id
ldom1
CP
Shared Last Level Cache
ldom2
cpa1
CP
cpa2
MemoryCP
Controller
I/O
Chipset
DS-id
Disk
CP
DS-id
DS-id
Disk
CP
Disk
CP
NIC
CP
DS-id
Programming
DS-id
Monitoring & Interrupts
30
Centralized
PRM
Control Plane File Structure
/sys/cpa
Point to a CP
cpa0
ident
type
Tags
ldoms
ldom0
parameter
statistics
trigger
ldom1
ldom2
cpa1
Tags
cpa2
Parameter Table
DS-id1
Param1
Param2
…
DS-id2
Param1
Param2
…
DS-id3
…
…
Statistics Table
DS-id1
Stat1
Stat2
…
DS-id2
Stat1
Stat2
…
DS-id3
…
…
Trigger Table
DS-id1
Cond-1
Action-1
DS-id1
Cond-2
Action-2
DS-id2
Cond-3
Action-3
…
31
Access Control Planes
Query Control Plane Info
cat /sys/cpa/cpa0/ident
cat /sys/cpa/cpa0/type
Query Parameters
cat /sys/cpa/cpa0/…/parameter/param1
Setting Parameters
echo 10 > /sys/cpa/cpa0/…/parameter/param2
32
/sys/cpa
cpa0
ident
type
ldoms
ldom0
parameter
param1
param2
statistics
trigger
ldom1
ldom2
cpa1
cpa2
Program “Trigger->Action”
Statistics Table
Trigger Table
DS-id1
Stat1
Stat2
…
DS-id1
Cond-1
Action-1
DS-id2
Stat1
Stat2
…
DS-id1
Cond-2
Action-2
DS-id2
Cond-3
Action-3
DS-id3
…
…
…
Compare
33
/sys/cpa
cpa0
ident
type
ldoms
ldom0
parameter
statistics
trigger
0
1
ldom1
ldom2
cpa1
cpa2
cpaX
64-bit
16-bit
Control Plane
Address Space
Program “Trigger->Action”
CP”
’
IDENT type
IDENT_HIGH
addr
cmd
data
32-bit
Table Selection
offset X X
tag
1. Register
trigger
tag
waymask
...
pardtrigger
/dev/cpa0
capacity
tag miss_rate
-ldom=0
-action=0
tag
stats
op
val
-stats=miss_rate
-cond=gt,30
0
...
om0_t0.sh
illme.sh
miss_rate
...
>
...
30%
...
2. Prepare action scripts
Example 2: /cpa0_ldom0_t0.sh
e trigger
d=gt,30
”>
m0/triggers/0
1
2
3
4
5
6
# !/bin/sh
echo “<log message>” > /log/triggers.log
cur_mask=`cat /sys/cpa/.../waymask`
miss_rate=`cat /sys/cpa/.../miss_rate`
capacity=`cat /sys/cpa/.../capacity`
target=update_mask(
$cur_mask, $miss_rate, $capacity)
7 echo $targe > /sys/cpa/…/waymask
ntrol Plane Programming Methodology.
3. Install trigger action script
echo “/cpa0_ldom0_t0.sh” >
s occur: (1) The/sys/cpa/cpa0/ldoms/ldom0/triggers/0
control plane uses the DS-id
er table to get corresponding address mapping
ow-buffer id. (2) The requested LDom phys-
33
/sys/cpa
cpa0
ident
type
ldoms
ldom0
parameter
statistics
trigger
0
1
ldom1
ldom2
cpa1
cpa2
Implementation
 现与评测
simulator Open Sourced *
• Full-system cycle-accurate
Full-system cycle-accurate simulator
Ongoing Work
Xilinxdevelopment
VC709 evaluation
board
• FPGA prototype on FPGA
board
! 
! 
Firmware
PRM
Logic
Domain #0
Unmodified
UnmodifiedSimulator
Unmodified
Full-system
Linux
Linux
Linux
Interference
Firmware
Unmodified
Linux
Logic
Domain #1
Unmodified
App
Logic
Micro
Logic
Benchmark
Domain #2
Domain #3
Unmodified Linux
Simulated PARD
Server
Simulated
PARD Server
(based on gem5, 4*core, 8GB)
*available at http://github.com/fsg-ict/PARD-gem5
34
Full System Simulator of a Server
•
Based on GEM5, supporting OoO models
•
4*core -> 4*LDom
•
Cache
-
•
•
Memory
-
Address Mapping
-
Priority
I/O
-
35
MissRate -> WayMask
Bandwidth
Fully HW-supported Virtualization
Memory Bandwidth (GB/s)
2
2
Boot Unmodified-Linux
T
1.5
LDom0
Run 437.leslie3d
Boot OS
Bash Ready
1
0
3
LDom1
Bash Ready
Run 470.lbm
2
1
0
3
2
Boot OS
LDom2
Boot OS
1
0
500
Bash Ready
Run CacheFlush
1000
1500
2000
Simulated Time (ms)
LDom0
CacheFlush
Occupied Last Level Cache (MB)
3
2500
TCacheFlush
Run 437.leslie3d
1
0.5
0
echo 0xFF00 > /sys/cpa/cpa0/ldom0/parameters/waymask
2
LDom1
1.5
Run 470.lbm
1
0.5
0
echo 0x00FF > /sys/cpa/cpa0/ldom1/parameters/waymask
2
LDom2
1.5
Run CacheFlush
1
0.5
0
echo 0x00FF > /sys/cpa/cpa0/ldom2/parameters/waymask
Figure 7. Dynamically Partition a PARD server into Four LDoms and Launch Three LDoms in turn.
• Partition singel PARD server into 4 logic domain (LDom)
7. Evaluation
tion. Thus, these VMs may contend for hardware resources such as
LLC capacity, as shown in the figure.
• Boot
3 the
of experiments
4 LDoms
w/ unmodified
linux-2.6.28.4
This section
describes
we conducted
on both of
the simulation and FPGA platforms. The goal is to verify new
functionalities
enabled by PARD architecture and the overhead of
•
current PARD control plane design.
For experimental methodology, we leveraged GEM5’s SimpleTiming mode to boot Linux, launch and warmup workloads,
made checkpoints, and then switched to cycle-accurate Out-of- 36
In this experiment, in order to guarantee reasonable LLC capacity for LDom0, we manually ran three echo commands (shown in
the figure) to adjust LLC capacity. Since the LLC of the simulated
server is 16-way, the way mask bits “0xFF00” indicates that the
LLC control plane allocates eight ways for LDom0 and the mask
bits “0x00FF” means that LDom1 and LDom2 share the other eight
ways. Consequently, the percentage of LDom0’s LLC capacity in-
Manually adjust cache partition after system up
Fully HW-supported Virtualization
Memory Bandwidth (GB/s)
2
LDom0
Run 437.leslie3d
Boot OS
Bash Ready
1
0
3
LDom1
0
3
2
Bash Ready
Run 470.lbm
Boot OS
LDom2
Boot OS
1
0
437.leslie3d
470.lbm
2
1
TCacheFlush
500
Occupied Last Level Cache (MB)
3
CacheFlush
Bash Ready
Run CacheFlush
1000
1500
2000
Simulated Time (ms)
2500
2
TCacheFlush
LDom0
1.5
Run 437.leslie3d
1
0.5
0
echo 0xFF00 > /sys/cpa/cpa0/ldom0/parameters/waymask
2
LDom1
1.5
Run 470.lbm
1
0.5
0
echo 0x00FF > /sys/cpa/cpa0/ldom1/parameters/waymask
2
LDom2
1.5
Run CacheFlush
1
0.5
0
echo 0x00FF > /sys/cpa/cpa0/ldom2/parameters/waymask
Figure 7. Dynamically Partition a PARD server into Four LDoms and Launch Three LDoms in turn.
• Partition singel PARD server into 4 logic domain (LDom)
7. Evaluation
tion. Thus, these VMs may contend for hardware resources such as
LLC capacity, as shown in the figure.
• Boot
3 the
of experiments
4 LDoms
w/ unmodified
linux-2.6.28.4
This section
describes
we conducted
on both of
the simulation and FPGA platforms. The goal is to verify new
functionalities
enabled by PARD architecture and the overhead of
•
current PARD control plane design.
For experimental methodology, we leveraged GEM5’s SimpleTiming mode to boot Linux, launch and warmup workloads,
made checkpoints, and then switched to cycle-accurate Out-of- 36
In this experiment, in order to guarantee reasonable LLC capacity for LDom0, we manually ran three echo commands (shown in
the figure) to adjust LLC capacity. Since the LLC of the simulated
server is 16-way, the way mask bits “0xFF00” indicates that the
LLC control plane allocates eight ways for LDom0 and the mask
bits “0x00FF” means that LDom1 and LDom2 share the other eight
ways. Consequently, the percentage of LDom0’s LLC capacity in-
Manually adjust cache partition after system up
Trigger => Action
Utilization: 25% ==> 100%
x3
memcached
LDom#0
Cache Flush MicroBenchmark
LDom#1
LDom#2
PARD Server
37
LDom#3
Trigger => Action
• Change of memcached’s Cache MissRate
Cache Miss Rate
40%
T2
30%
20%
T0
T1
20KRPS
T3
10%
0%
60
80
100
120
140
160
180
Simulated Time (ms)
•
T0: memcached alone (monopolize the cache)
•
T1: startup 3*CacheFlush (shared cache, increased miss rate)
•
T2: trigger condition met (MissRate > 30%), apply way-partition mechanism
•
T3: MissRate restored (~10%)
39
Trigger => Action
• Change of memcached’s Cache MissRate
Memcached Response Time
30
20%
10%
0%
60
•
T2
solo
w/ LLC Trigger
shared
30%
Response Time (ms)
Cache Miss Rate
40%
20
T0
T1
20KRPS
T3
memcached
alone
co-run with interference
10
80
100
120
140
160
180
w/ LLC Trigger
Simulated Time (ms)
T0: memcached
alone (monopolize the cache)
0
10
12.5
15
17.5
20
22.5
25
•
T1: startup 3*CacheFlush (shared cache, increased miss rate)
•
T2: trigger condition met (MissRate > 30%), apply way-partition mechanism
•
T3: MissRate restored (~10%)
Kilo Requests Per Seconds (KRPS)
39
Trigger => Action
• Change of memcached’s Cache MissRate
Memcached Response Time
30
T2
solo
w/ LLC Trigger
utilization
shared
30%
20%
10%
0%
60
•
Response Time (ms)
Cache Miss Rate
40%
20
20KRPS
25%->100%
T1
T3
T0
memcached
alone
co-run with interference
10
80
100
120
140
160
180
w/ LLC Trigger
Simulated Time (ms)
T0: memcached
alone (monopolize the cache)
0
10
12.5
15
17.5
20
22.5
25
•
T1: startup 3*CacheFlush (shared cache, increased miss rate)
•
T2: trigger condition met (MissRate > 30%), apply way-partition mechanism
•
T3: MissRate restored (~10%)
Kilo Requests Per Seconds (KRPS)
39
Trigger => Action
• Change of memcached’s Cache MissRate
Memcached Response Time
30
T2
solo
w/ LLC Trigger
utilization
shared
30%
20%
10%
0%
60
•
Response Time (ms)
Cache Miss Rate
40%
20
20KRPS
25%->100%
T1
T3
T0
memcached
alone
co-run with interference
10
80
100
120
140
160
17.5KRPS
Simulated Time (ms)
180
w/ LLC Trigger
T0: memcached
alone (monopolize the cache)
0
10
12.5
15
17.5
20
22.5
25
•
T1: startup 3*CacheFlush (shared cache, increased miss rate)
•
T2: trigger condition met (MissRate > 30%), apply way-partition mechanism
•
T3: MissRate restored (~10%)
Kilo Requests Per Seconds (KRPS)
39
Trigger => Action
• Change of memcached’s Cache MissRate
Memcached Response Time
30
T2
solo
w/ LLC Trigger
utilization
shared
30%
20%
10%
0%
60
•
Response Time (ms)
Cache Miss Rate
40%
20
20KRPS
25%->100%
T1
T3
T0
memcached
alone
co-run with interference
10
80
100
120
140
160
180
w/ LLC Trigger
Simulated Time (ms)
T0: memcached
alone (monopolize the cache)
0
10
12.5
15
17.5
20
22.5
25
•
T1: startup 3*CacheFlush (shared cache, increased miss rate)
•
T2: trigger condition met (MissRate > 30%), apply way-partition mechanism
•
T3: MissRate restored (~10%)
Kilo Requests Per Seconds (KRPS)
39
Firmware
Full-system Simulator
Unmodified
App
Interference
Micro
Benchmark
CP Overhead
Unmodified Linux
Simulated PARD Server
•
Preliminary FPGA prototype (Xilinx VC709 xc7vx690t)
- Cache controller: OpenSPARC T1 L2Cache
- Memory controller: Xilinx Mig 7Series
•
Results
- Pipeline structure of LLC/MC hide latency introduced by
control plane logic
- Control plane logic did not introduce too much resource
overhead (3.5% for LLC, 10.1% for MC)
40
Summary
•
Data centers confront with a tough trade-off between
utilization and apps’ QoS
•
A computer is inherently a network so that
networking technologies can be applied to computer
architecture
•
We propose PARD that provides a new interface for
software to convey QoS requirements to the hardware
41
Q&A
Supporting
Differentiated Services in Computers
via Programmable Architecture for
Resourcing-on-Demand
(PARD)
Get update of PARD simulator at http://github.com/fsg-ict/PARD-gem5
Backup: Pipelined Cache
•
Pipeline of Write Request
Receive
Write
Request
Access
TagArray
Access DataArray
Access
LRUHistory
Lookup Parameter Table
Access
MSHR
Update
TagArray
Update Statistics Table
Enhanced LRU
with Way-Partition
43
Check Trigger Table
Send
Memory
Request