Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta GP-GPU Computing GPUs enable high throughput data & compute intensive computations Data is partitioned into a grid of “Thread Blocks” (TBs) Thousands of TBs in a grid can be executed in any order No HW support for efficient inter-TB communication High scalability & throughput for independent data Challenging & inefficient for inter-TB dependent data The Problem Data-dependent & irregular applications Simulations (n-body, heat) Graph algorithms (BFS, SSSP) Inter-TB synchronization ! ! ! Sync through global memory Irregular task graphs Static partitioning fails Heterogeneous execution Unbalanced distribution Data Dependency Graph The Solution “Task based execution” • Transition from SIMD -> MIMD Challenges Breaking applications into tasks Task to SM assignment Dependency tracking Inter–SM communication Load Balancing 5 Proposed Task Based Execution Framework Persistent Worker TBs (per SM) Distributed task queues (per SM) In-GPU dependency tracking & scheduling Load balancing via different queue insertion policies 6 Overview (4). Output (3). Retrieve & Execute (2). Queue (5). Resolve Dependencies (1). Grab a ready Task (6). Grab new 7 Concurrent Worker&Scheduler Worker Scheduler 8 Queue Access & Dependency Tracking IQS and OQS Parallel task pointer retrieval Efficient signaling mechanism via global memory Queues store pointers to tasks Parallel dependency check Queue Insertion Policy t Round robin: Better load balancing Poor cache locality Tail submit: [J. Hoogerbrugge et al.]: First child task is always processed by the same SM with parent. Increased locality Time = 1t+1 TX SM 1: TU SM 1: TU Round Robin SM 2: TX SM 2: SM 3: SM 3: SM 4: SM 4: TX Tail Submit Time = 2t+2 TY TX TY Tx SM 1: TU TX SM 1: TV TX SM 2: SM 2: SM 3: SM 3: SM 4: SM 4: TY 10 API user_task is called by worker_kernel Application specific data is added under WorkerContext and Task 11 Experimental Results NVIDIA Tesla 2050 14 SMs, 3GB memory Applications: Heat 2D: Simulation of heat dissipation over a 2D surface BFS: Breadth-first-search Comparison: Central queue vs. distributed queue 12 Applications Heat 2D: Regular dependencies, wavefront parallelism. Each tile is a task, intra-tile and inter-tile parallelism 13 Applications BFS: Irregular dependencies. Unreached neighbors of a node forms a task 14 Runtime 15 Scalability 16 Future Work S/W support for: • Better task representation • More task insertion policies • Automated task graph partitioning for higher SM utilization. 17 Future Work H/W support for: • Fast inter-TB sync • Support for TB to SM affinity • “Sleep” support for TBs 18 Conclusion Transition from SIMD -> MIMD Task-based execution model Per-SM task assignment In-GPU dependency tracking Locality aware queue management Room for improvement with added HW and SW support 19
© Copyright 2024