ppt

Task Based Execution of GPU
Applications with Dynamic Data
Dependencies
Mehmet E Belviranli
Chih H Chou
Laxmi N Bhuyan
Rajiv Gupta
GP-GPU Computing
 GPUs enable high throughput data
& compute intensive computations
 Data is partitioned into a grid of
“Thread Blocks” (TBs)
 Thousands of TBs in a grid can
be executed in any order
 No HW support for efficient inter-TB
communication
 High scalability & throughput for
independent data
 Challenging & inefficient for
inter-TB dependent data
The Problem
 Data-dependent & irregular applications
 Simulations (n-body, heat)
 Graph algorithms (BFS, SSSP)
 Inter-TB synchronization
!
!
!
 Sync through global memory
 Irregular task graphs
 Static partitioning fails
 Heterogeneous execution
 Unbalanced distribution
Data
Dependency Graph
The Solution
“Task based execution”
•
Transition from SIMD -> MIMD
Challenges





Breaking applications into tasks
Task to SM assignment
Dependency tracking
Inter–SM communication
Load Balancing
5
Proposed Task Based Execution
Framework




Persistent Worker TBs (per SM)
Distributed task queues (per SM)
In-GPU dependency tracking &
scheduling
Load balancing via different queue
insertion policies
6
Overview
(4). Output
(3). Retrieve &
Execute
(2). Queue
(5). Resolve Dependencies
(1). Grab a ready Task
(6). Grab new
7
Concurrent Worker&Scheduler
Worker
Scheduler
8
Queue Access &
Dependency Tracking

IQS and OQS


Parallel task pointer retrieval


Efficient signaling mechanism via
global memory
Queues store pointers to tasks
Parallel dependency check
Queue Insertion Policy
t
Round robin:

Better load balancing

Poor cache locality
Tail submit: [J. Hoogerbrugge et al.]:


First child task is always
processed by the same SM
with parent.
Increased locality
Time = 1t+1
TX
SM 1: TU
SM 1: TU
Round
Robin
SM 2:
TX
SM 2:
SM 3:
SM 3:
SM 4:
SM 4:
TX
Tail
Submit
Time = 2t+2
TY
TX
TY
Tx
SM 1: TU TX
SM 1: TV TX
SM 2:
SM 2:
SM 3:
SM 3:
SM 4:
SM 4:
TY
10
API
user_task is called by
worker_kernel
Application specific data is
added under WorkerContext
and Task
11
Experimental Results
NVIDIA Tesla 2050


14 SMs, 3GB memory
Applications:



Heat 2D: Simulation of heat dissipation over
a 2D surface
BFS: Breadth-first-search
Comparison:


Central queue vs. distributed queue
12
Applications
Heat 2D:

Regular dependencies, wavefront parallelism.

Each tile is a task, intra-tile and inter-tile parallelism
13
Applications
BFS:

Irregular dependencies.

Unreached neighbors of a node forms a task
14
Runtime
15
Scalability
16
Future Work
S/W support for:
•
Better task representation
•
More task insertion policies
•
Automated task graph partitioning
for higher SM utilization.
17
Future Work
H/W support for:
•
Fast inter-TB sync
•
Support for TB to SM affinity
•
“Sleep” support for TBs
18
Conclusion
Transition from SIMD -> MIMD
 Task-based execution model





Per-SM task assignment
In-GPU dependency tracking
Locality aware queue management
Room for improvement with added
HW and SW support
19