How to Combine OpenMP, Streams, and ArrayFire for Maximum Multi-GPU Throughput Shehzan Mohammed @shehzanm @arrayfire Outline ● Introduction to ArrayFire ● Case Study 1: glasses.com ● Case Study 2: Accelerate Diagnostics ArrayFire ● World’s leading GPU experts ○ ○ In the industry since 2007 NVIDIA Partner ● Deep experience working with thousands of customers ○ ○ ○ Analysis Acceleration Algorithm development ● GPU Training ○ ○ Hands on course with a CUDA engineer Customized to meet your needs ArrayFire ● Hundreds of parallel functions ○ Targeting image processing, machine learning, etc. ● Support for multiple languages ○ ● ● ● ● C/C++, Fortran, Java and R Linux, Windows, Mac OS X OpenGL based graphics Based around one data structure Just-in-Time (JIT) ○ Combine multiple operations into one kernel ● GFOR, the only data parallel loop ArrayFire Functions ● Hundreds of parallel functions ○ Building blocks (non-exhaustive) ■ Reductions, scan, sort ■ Set operations ■ Statistics ■ Matrix operations ■ Image processing ■ Signal processing ■ Sparse matrix ■ Visualizations Case 1: glasses.com Case 1: Glasses.com ● 3D face reconstruction from images ● Image and Coordinate geometry processing ● Came to us with a slow application ○ ○ ○ Made use of OpenCV and OpenMP One thread per PC, 8 threads: 30+ seconds Developed on OSX Case 1: Glasses.com ● Required a significant hardware investment ○ ○ ○ Increased maintenance Financially not viable in production Had windows infrastructure The challenge: Speed, Speed, and much more Challenge 1: Multithreading ● Multithreading benefits CPU code ○ ○ Calling CUDA from multiple threads may not offer too much benefit ■ Overheads of memory management, kernel launches ■ Streams, pinned mem, etc required to harness full potential GPU parallelism is faster than CPU for these operations ● Goal: ○ ○ Make host code single threaded Move all multithreaded sections to GPU Challenge 1: Multithreading ● Most multithreaded sections easily ported ○ ○ Images are easy to combine and operate on at once ArrayFire gfor ● Some sections more difficult ○ ○ ○ Require serial access and/or complex operations Needs to be run on host - more memory transfers Need combination of OpenMP and ArrayFire Challenge 2: Multithreading ArrayFire ● ArrayFire was not thread safe ○ ○ ○ ○ Designed for GPU performance Required substantial work ■ iterative process Trade-offs ■ Cost of adding critical section vs ■ Cost of adding multithreading support Limiting access to data for each thread Challenge 2: Multithreading ArrayFire ● Required adding critical sections around key operations ● Constant memory and textures ○ ○ No way to make this thread safe ■ Except critical section Add critical section vs use global memory ■ Analyse and customize for specific operation Challenge 3: Batching ● Image operations can be easily batched ○ Most operations work on pixel-level or neighborhood-level ● Problem when operations are more complex ● Batching does not always map ○ ○ like affine matrix multiplications Indexing needs to be changes - expensive memcpys Challenge 3: Batching ● Used OpenMP for parallelism ○ ○ One frame per thread Optimized for CPU ● One CPU thread + GPU ○ Parallelism on GPU vs. Parallelism on CPU ● Combined OpenMP threads Challenge 3: Batching ● Many small operations ○ Individually it didn’t make sense to port to the GPU ● Increase dimensionality of the data ○ ○ 2D -> 3D GFOR and Strided Access ● Moved to single threaded code Challenge 3: Batching ● Call custom CUDA kernels ○ Special indexing float * bound = boundary.device<float>(); kernel<<< threads, blocks >>>(bound, boundary.elements()); ● Specialized Matrix Multiply ○ ○ ○ ssyrk vs. gemm 2x faster concurrent execution using streams Challenge 3: Batching ● Results ○ 90ms -> 28ms on a GTX 690 ● Other Improvements ○ ○ ○ Overlapped pinned memory transfers Generic to Specialized matrix multiply Streams Concurrent Computation ● Overlap CPU and GPU computation ○ ○ CPU handles variable length data sets one frame at a time GPU handles fixed length data sets all frames concurrently #pragma omp sections { #pragma omp section { // GPU Code } #pragma omp section { // CPU Code } } Results ● 1 Process (5 threads): 8 seconds ● 6 Processes(2 threads): 22 seconds ● Demo Demo Case 2: Accelerate Diagnostics Case 2: Accelerate Diagnostics ● ● ● ● ● Multithreaded Java code with CUDA integration Image processing of large images (4k x 4k each) Port to C++ Hard time constraint Hard reliability constraint The challenge: Maximize PCIe throughput ○ ○ Image processing is very parallel Memory transfer is majority of application run time Case 2: Accelerate Diagnostics ● Target hardware: ○ ○ ○ Intel Xeon CPU 2 GTX Titans per system 64 GB RAM ● Required speed up: ~5x ● Required reliability: 48 hour stress test The Framework ● ● ● ● Master thread - scheduling and management Slave threads - each handling 1 ‘pipeline’ Each pipeline handled one ‘site’ at a time Continuous execution ● Pipeline - serial flow of execution for one site ● “Rabbit Hole” ● Site - independent data set of images ● Rabbit The Framework - Initial Master Thread Reads Site Database Pipe 0 Pipe 1 Site Site GPU 0 GPU 1 Master Thread ● Minimalist ● Initializes and controls pipeline ● Feeds sites to pipelines Master Thread Thread 0 Thread 1 Pipe 0 Pipe 1 Site Site GPU 0 GPU 1 Pipeline ● Serial execution within pipelines ● Processes one site at a time Pipe 0 Site Challenge 1: CPU Parallelism How to parallelize pipelines independently? ● Each thread processes one pipeline ○ At pipeline level, application is single threaded ● Allot one GPU to each pipeline ● Pipelines initialized once per run ● Perpetual execution Parallelism: Results ● On CPU side, worked fine ● GPU - not so much ○ ○ ○ ○ Too many blocking syncs to allocate and deallocate memory Copy/Kernel execution collisions between threads No concurrency Extremely slow memory transfer speeds ■ Each image is 16mb, multiple transfers per kernel call ○ Although pragmatically parallel, execution was almost serial, probably slower Parallelism: Results ● On CPU side, worked fine ● GPU - not so much ○ ○ ○ ○ Too many blocking syncs to allocate and deallocate memory Copy/Kernel execution collisions between threads No concurrency Extremely slow memory transfer speeds ■ Each image is 16mb, multiple transfers per kernel call ○ Although pragmatically parallel, execution was almost serial, probably slower Challenge 2A: Pinned Memory Transfer speeds can double with Pinned Memory ● For pageable memory, CUDA first transfers to pinned and then to GPU. ● Non-pageable (pinned) memory not pageable by OS ● CUDA skips pageable->pinned memory transfer Pageable Pageable DRAM Pinned Host DRAM Pinned Device Pageable Memory Copy Host Device Pinned Memory Copy Challenge 2B: Streams Concurrency, concurrency, concurrency ● Increases PCIe throughput ○ ○ Streams allow simultaneous copy and execution Together with pinned memory, allows asynchronous copy ● Each pipeline has one stream allotted to it ○ Stream remains active through the lifetime of the pipeline ○ All CUDA operations and kernel launches are done asynchronously for the stream Use of cudaStreamSynchronize (vs cudaDeviceSynchronize) ○ Pinned Memory and Streams: Results ● Memory transfers speeds increased ~2x ● Problems: ○ ○ ○ ○ Allocating/freeing pinned memory is a full system block ■ All threads on CPU, all streams on GPU are blocked ■ Very very bad ■ Benefits of using streams is negated Device memory alloc/free is also a blocking sync Too many memory API calls negating the benefits of parallelism Possible memory leaks - very bad for reliability ■ Will reveal in stress testing Challenge 3: Better Memory Management Minimize number of memory allocation and deletion ● On CPU and GPU ● The memory used in the processing of each site is deterministic and constant ● Solution: Create a memory manager Memory Manager ● Goals: ○ ○ ○ ○ ○ ○ Manage host and device memory for each pipeline Allocate, free memory Assign, retract memory Manage transfers between host and device Ensure consistency between host and device memory Free memory only at the end of the application Memory Manager Memory Manager “Mirrored Array” Type Stream Flags Size Device Pointer Host Pointer Dev Mem Host Mem Create() Free() Push() Pull() Update() Release() Memory Manager ● Memory usage is deterministic, memory once allocated can be reused as needed. ○ ○ ○ ○ After 1st site run, no new pinned or device memory will need to be created Most of the memory required can be created in initialization Same chunk of memory can be reused using pointers Pointers can release the memory back to manager when processing is completed Better Memory Management: Results ● Drastic reduction in alloc/free calls ● Much better parallelism ○ ○ Streams are much more concurrent as blocks are reduced CPU threads do not need to be synced ● Stable across multiple site processing ● Memory leaks are easily discovered ○ Increase in usage after 1st run shows leaks ○ Memory can be used to make sure all memory is release at the end of each site The Framework Master Thread Pipe 0 Pipe 1 Stream 0 Stream 1 Memory Memory Site Site GPU 0 GPU 1 Results ● Significant performance improvements ● Excellent PCIe throughput ● Highly parallel ● GPU kernel execution lower compared to memcpy times Master Thread Pipe 0 Pipe 1 Stream 0 Stream 1 Memory Memory Site Site GPU 0 GPU 1 The Framework - Final ● Increase pipelines to 4 ○ Master Thread 2 per GPU ● 4 pipelines good for CPU ○ ○ ○ 4 heavy processing threads 1 master light thread 4 threads = optimal usage Pipe 0 Pipe 1 Pipe 2 Pipe 3 Stream 0 Stream 1 Stream 2 Stream 3 Memory Memory Memory Memory Site Site Site GPU 0 Site GPU 1 Results ● Improvement in times ○ Master Thread Almost 2x better than required ● Stable memory usage ● GPU usage optimal ● Problems? Pipe 0 Pipe 1 Pipe 2 Pipe 3 Stream 0 Stream 1 Stream 2 Stream 3 Memory Memory Memory Memory Site Site Site GPU 0 Site GPU 1 Results ● Improvement in times ○ Master Thread Almost 2x better than required ● Stable memory usage ● GPU usage optimal ● Problem: OVERHEATING! Pipe 0 Pipe 1 Pipe 2 Pipe 3 Stream 0 Stream 1 Stream 2 Stream 3 Memory Memory Memory Memory Site Site Site GPU 0 Site GPU 1 Results ● Problem: OVERHEATING! ● Solution: ○ ○ ○ Use software tools to lower gpu clock speeds Control fan speeds on gpu Create target power and temperature ● No major reduction in performance Case 2: Takeaways ● Application is only as fast as its slowest part ● True multithreading is awesome ○ Not easy - but can be done ● Memory management is crucial to parallelism ● Be ready to tackle any problem ○ Overheating? Really? Q&A