GPUvm: Why Not Virtualizing GPUs at the Hypervisor?
Yusuke Suzuki* in collaboraBon with Shinpei Kato**, Hiroshi Yamada***, Kenji Kono* * Keio University ** Nagoya University *** Tokyo University of Agriculture and Technology Graphic Processing Unit (GPU)
•  GPUs are used for data-‐parallel computaBons –  Composed of thousands of cores –  Peak double-‐precision performance exceeds 1 TFLOPS –  Performance-‐per-‐waT of GPUs outperforms CPUs •  GPGPU is widely accepted for various uses –  Network Systems [Jang et al. ’11], FS [Silberstein et al. ’13] [Sun et al. ’12], DBMS [He et al. ’08] etc. NVIDIA/GPU
L1 L1 L1 L1 L1 L1
L2 Cache
L1
Video Memory
CPU
Main Memory
MoBvaBon
•  GPU is not the ﬁrst-‐class ciBzen of cloud compuBng environment –  Can not mulBplex GPGPU among virtual machines (VM) –  Can not consolidate VMs that run GPGPU applicaBons •  GPU virtualizaBon is necessary –  VirtualizaBon is the norms in the clouds VM
Share a single GPU among VMs VM
VM
Hypervisor
GPU
Physical Machine
VirtualizaBon Approaches •  Categorized into three approaches 1.  I/O pass-‐through 2.  API remoBng 3.  Para-‐virtualizaBon I/O pass-‐through •  Amazon EC2 GPU instance, Intel VT-‐d –  Assign physical GPUs to VMs directly –  MulBplexing is impossible VM
Assign GPUs to VMs directly VM
VM
…
Hypervisor
GPU
GPU
GPU
API remoBng •  GViM [Gupta et al. ’09], rCUDA [Duato et al ’10], VMGL [Largar-‐Cavilla et al. ’07] etc. –  Forward API calls from VMs to the host’s GPUs –  API and its version compaBbility problem –  Enlarge the trusted compuBng base (TCB) Host
Library Library vv4
4
Driver
VM
Wrapper Library v4
VM
Wrapper Library v5
…
Hypervisor
GPU
Forwarding API calls
Para-‐virtualizaBon •  VMWare SVGA2 [Dowty ’09] LoGV [GoEschalk et al. ’10] –  Expose an ideal GPU device model to VMs –  Guest device driver must be modiﬁed or rewriTen Host
Driver
VM
VM
Library
Library
PV Driver
PV Driver
…
Hypervisor
GPU
Hypercalls
Goals •  Fully virtualize GPUs –  allow mulJple VMs to share a single GPU –  without any driver modiﬁcaJon •  Vanilla driver can be used “as is” in VMs •  GPU runBme can be used “as is” in VMs •  IdenBfy performance boTlenecks of full VM
virtualizaBon –  GPU details are not open… Library
Driver
VM
Library
Driver
Virtual GPU
Virtual GPU
GPU
Outline
• 
• 
• 
• 
• 
• 
MoBvaBon & Goals GPU Internals Proposal: GPUvm Experiments Related Work Conclusion GPU Internals
•  PCIe connected discrete GPU (NVIDIA, AMD GPU) •  Driver accesses to GPU w/ MMIO through PCIe BARs •  Three major components –  GPU compuJng cores, GPU channel and GPU memory Driver, Apps (CPU)
MMIO
PCIe BARs
GPU Channel
GPU Channel
…
…
…
GPU Channels
GPU CompuBng Cores
GPU Memory
GPU
GPU Channel & CompuBng Cores
•  GPU channel is a hardware unit to submit commands to GPU compuBng cores •  The number of GPU channels is ﬁxed •  MulBple channels can be acBve at a Bme App
App
GPU Commands
GPU GPU Channel Channel
…
GPU
Commands are executed on compuBng cores
…
…CompuBng
GPU CompuBng Cores
GPU Memory
•  Memory accesses from compuBng cores are conﬁned by GPU page tables App
App
GPU Channel
GPU Channel
GPU Commands
GPU
… Pointer to GPU Page Table
… GPU Virtual Address
…
GPU CompuBng Cores
GPU Memory
GPU Page Table GPU Physical Address
GPU Page Table Uniﬁed Address Space
•  GPU and CPU memory spaces are uniﬁed –  GPU virtual address (GVA) is translated CPU physical addresses as well as GPU physical addresses (GPA) App
GPU Commands
GPU Channel
GPU
…
…
…
GVA
GPU Page Table GPU CompuBng Cores
GPU Memory
CPU physical address
GPA
Uniﬁed Address Space
CPU Memory
DMA handling in GPU
•  DMAs from compuBng cores are issued with GVA –  Conﬁned by GPU Page Tables •  DMAs must be isolated between VMs App
GPU Commands
GPU Channel
GPU
…
…Memcpy(GVA1, GVA2)
… GVA1 translated GPU CompuBng Cores
GPU Memory
to GPA
GPU Page Table DMA
GVA2 translated to CPU physical address
CPU Memory
Outline
• 
• 
• 
• 
• 
• 
MoBvaBon & Goals GPU Internals Proposal: GPUvm Experiments Related Work Conclusion GPUvm overview
•  Isolate GPU channel, compuBng cores & memory VM1
…
…
GPU GPU Channel Channel
Assigned to VM2
…
…
Time Sharing
GPU CompuBng Cores
…
…
GPU GPU Channel Channel
Assigned to VM1
Virtual GPU
VM2
…
…
Virtual GPU
…
…
GPU
GPU Memory
Assigned to VM1
Assigned to VM2
…
GPUvm Architecture
•  Expose the Virtual GPU to each VM and intercept & aggregate MMIO to them •  Maintain Virtual GPU views and arbitrate accesses to physical GPU Host
GPUvm VM
VM
Library
Library
Driver
Driver
Virtual GPU
Virtual GPU
…
GPUvm Hypervisor
GPU
Intercept MMIOs
GPUvm components
1.  GPU shadow page table –  Isolate GPU memory 2.  GPU shadow channel –  Isolate GPU channels 3.  GPU fair-‐share scheduler –  Isolate GPU Bme using GPU compuBng cores GPU Shadow Page Table
•  Create GPU shadow page tables –  Memory accesses from GPU compuBng cores are conﬁned by GPU shadow page tables VM1
GPU Commands
Virtual GPU
Virtual GPU
GPU Channel
GPU Channel
…
…
Access not allowed
…
…
VM2
…
…
…
GPU Channel
…
…
…
GPU Memory
GVA
GPU
GPU GPU Keep Shadow Page consistency
Page Table Table Access allowed
GPU Shadow Page Table & DMA
•  DMA is also conﬁned by GPU shadow page tables –  Since DMA is issued with the GVA •  Other DMAs can be intercepted by MMIO handling VM1
VM2
GPU Commands
Virtual GPU
Virtual GPU
GPU Channel
GPU Channel
…
…
…
…
…
…
GPU Channel
…
…
…
GPU Memory
DMA with GVAs
GPU
GPU Shadow Page Table DMA not allowed
DMA allowed
CPU Memory
VM1 Memory
VM2 Memory
GPU Shadow Channel
•  Channels are logically parBBoned for VMs •  Maintain mappings between virtual & shadow channels VM1
VM2
Virtual GPU GPU Virtual Channels 0 1 2
Virtual GPU GPU Virtual Channels 0 1 2
…
0 1 2 …
…
VM1 VM2 0 1 2 …
0 1 2 …
66 …
64 65
0 1 2
Assigned to VM1
66
64 65
Assigned to VM2
Mappings between virtual & shadow channels
GPU Shadow Channels
GPU
…
…
GPU Fair-‐Share Scheduler
•  Schedules non-‐preempBve command execuBons •  Employs BAND scheduling algorithm [Kato et al. ’12] •  GPUvm can employ exisBng algorithms –  VGRIS [Yu et al. ’13], Pegasus [Gupta et al. ’12], TimeGraph [Kato et al. ’11], Disengaged Scheduling [Menychtas et al. ’14] VM2
VM1
VM2 Queue
Assigned to VM2
…
…
Assigned to VM1
…
…
VM1 Queue
GPU fair-‐share scheduler
Virtual VGPU
Channels
…
Virtual VGPU
Channels
GPU Shadow Channels
GPU
Time
…
OpBmizaBon Techniques •  Introduce several opBmizaBon techniques to reduce overhead caused by GPUvm 1.  BAR Remap 2.  Lazy Shadowing 3.  Para-‐virtualizaBon BAR Remap
•  MMIO through PCIe BARs is intercepted by GPUvm •  Allow direct BAR accesses to the non-‐virtualizaBon-‐sensiBve areas Naive
w/ BAR Remap
Guest Driver Guest Driver MMIO
MMIO
Intercepted Intercepted
SensiBve Non-‐sensiBve Virtual BAR GPUvm Aggregated
Issued
Physical BAR SensiBve Non-‐sensiBve Virtual BAR Aggregated
GPUvm Issued
Physical BAR Direct Access
Lazy Shadowing •  Page-‐fault-‐driven shadowing cannot be applied –  When fault occurs, computaBon cannot be resumed •  Scanning enBre page tables incurs high overhead •  Delay the reﬂecBon to the shadow page tables unBl the channel is used TLB ﬂush
Scan 3 Bmes Scan
Naive
Scan only once Lazy Shadowing
Ignore
Channel becomes acBve
Channel becomes acBve
Time
Para-‐virtualizaBon
•  Shadowing is sBll a major source of overhead •  Provide para-‐virtualized driver –  Manipulate page table entries through hypercalls (similar to Xen direct-‐paging) –  Provide a mulBcall interface that can batch several hypercalls into one (borrowed from Xen) •  Eliminate cost of scanning enBre page tables Outline
• 
• 
• 
• 
• 
• 
MoBvaBon & Goals GPU Internals Proposal: GPUvm Experiments Related Work Conclusion EvaluaBon Setup
•  ImplementaBon –  Xen 4.2.0, Linux 3.6.5 –  Nouveau [hEp://nouveau.freedesktop.org/] •  Open-‐source device driver for NVIDIA GPUs –  Gdev [Kato et al. ’12] •  Open-‐source CUDA runBme •  Xeon E5-‐24700, NVIDIA Quadro6000 GPU •  Schemes –  NaJve: non-‐virtualized –  FV Naive: Full-‐virtualizaBon w/o opBmizaBons –  FV OpJmized: FV w/ opBmizaBons –  PV: Para-‐virtualizaBon Overhead
•  Signiﬁcant overhead over NaBve RelaJve Jme (log-‐scaled)
–  Can be miBgated by opBmizaBon techniques –  PV is faster than FV since it eliminates shadowing –  PV is sBll 2-‐3x slower than NaBve •  Hypercalls, MMIO intercepBon 1000.00 275.39 100.00 Shadowing & MMIO handling reduced
40.10 10.00 Shadowing eliminated
2.08 1.00 FV Naive FV OpBmized PV madd (short term workload) 1.00 NaBve Performance at Scale
•  FV incurs large overhead in 4-‐ and 8-‐ VM case –  Since page shadowing locks GPU resources VirtualizaBon overhead 400 300 200 1 2 4 Number of GPU Contexts
8 NaBve PV FV OpBmized NaBve PV FV OpBmized NaBve PV FV OpBmized PV 0 NaBve 100 FV OpBmized Time (seconds)
Kernel execuBon Bme Performance IsolaBon
•  In FIFO and CREDIT a long-‐running task occupies GPU •  BAND achieves fair-‐share GPU UBlizaBon (%)
100"
100"
short"
100"
long"
short"
long"
short"
75"
75"
75"
50"
50"
50"
25"
25"
25"
0"
0"
0"
0"
10"
20"
30"
40"
Time%(seconds)
FIFO
50"
0"
10"
20"
30"
40"
Time%(seconds)
CREDIT
50"
0"
10"
20"
30"
40"
Time%(seconds)
BAND
long"
50"
Outline
• 
• 
• 
• 
• 
• 
MoBvaBon & Goals GPU Internals Proposal: GPUvm Experiments Related Work Conclusion Related Work
•  I/O pass-‐through –  Amazon EC2 •  API remoBng –  GViM [Gupta et al. ’09], vCUDA [Shi et al. ’12], rCUDA [Duato et al ’10], VMGL [Largar-‐Cavilla et al. ’07], gVirtuS [Giunta et al. ’10] •  Para-‐virtualizaBon –  VMware SVGA2 [Dowty et al. ’09], LoGV [GoEschalk et al. ’10] •  Full-‐virtualizaBon –  XenGT [Tian et al. ’14] –  GPU Architecture is diﬀerent (Integrated Intel GPU) Conclusion
•  GPUvm shows the design of full GPU virtualizaBon –  GPU shadow page table –  GPU shadow channel –  GPU fair-‐share scheduler •  Full-‐virtualizaBon exhibits non-‐trivial overhead –  MMIO handling •  Intercept TLB ﬂush and scan page table –  OpBmizaBons and para-‐virtualizaBon reduce this overhead –  However sBll 2-‐3 Bmes slower