Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta GP-GPU Computing  GPUs enable high throughput data & compute intensive computations  Data is partitioned into a grid of “Thread Blocks” (TBs)  Thousands of TBs in a grid can be executed in any order  No HW support for efficient inter-TB communication  High scalability & throughput for independent data  Challenging & inefficient for inter-TB dependent data The Problem  Data-dependent & irregular applications  Simulations (n-body, heat)  Graph algorithms (BFS, SSSP)  Inter-TB synchronization ! ! !  Sync through global memory  Irregular task graphs  Static partitioning fails  Heterogeneous execution  Unbalanced distribution Data Dependency Graph The Solution “Task based execution” • Transition from SIMD -> MIMD Challenges      Breaking applications into tasks Task to SM assignment Dependency tracking Inter–SM communication Load Balancing 5 Proposed Task Based Execution Framework     Persistent Worker TBs (per SM) Distributed task queues (per SM) In-GPU dependency tracking & scheduling Load balancing via different queue insertion policies 6 Overview (4). Output (3). Retrieve & Execute (2). Queue (5). Resolve Dependencies (1). Grab a ready Task (6). Grab new 7 Concurrent Worker&Scheduler Worker Scheduler 8 Queue Access & Dependency Tracking  IQS and OQS   Parallel task pointer retrieval   Efficient signaling mechanism via global memory Queues store pointers to tasks Parallel dependency check Queue Insertion Policy t Round robin:  Better load balancing  Poor cache locality Tail submit: [J. Hoogerbrugge et al.]:   First child task is always processed by the same SM with parent. Increased locality Time = 1t+1 TX SM 1: TU SM 1: TU Round Robin SM 2: TX SM 2: SM 3: SM 3: SM 4: SM 4: TX Tail Submit Time = 2t+2 TY TX TY Tx SM 1: TU TX SM 1: TV TX SM 2: SM 2: SM 3: SM 3: SM 4: SM 4: TY 10 API user_task is called by worker_kernel Application specific data is added under WorkerContext and Task 11 Experimental Results NVIDIA Tesla 2050   14 SMs, 3GB memory Applications:    Heat 2D: Simulation of heat dissipation over a 2D surface BFS: Breadth-first-search Comparison:   Central queue vs. distributed queue 12 Applications Heat 2D:  Regular dependencies, wavefront parallelism.  Each tile is a task, intra-tile and inter-tile parallelism 13 Applications BFS:  Irregular dependencies.  Unreached neighbors of a node forms a task 14 Runtime 15 Scalability 16 Future Work S/W support for: • Better task representation • More task insertion policies • Automated task graph partitioning for higher SM utilization. 17 Future Work H/W support for: • Fast inter-TB sync • Support for TB to SM affinity • “Sleep” support for TBs 18 Conclusion Transition from SIMD -> MIMD  Task-based execution model      Per-SM task assignment In-GPU dependency tracking Locality aware queue management Room for improvement with added HW and SW support 19