How to Combine OpenMP, Streams, and ArrayFire for Maximum Multi-GPU Throughput Shehzan Mohammed

How to Combine OpenMP, Streams, and
ArrayFire for Maximum Multi-GPU
Throughput
Shehzan Mohammed
@shehzanm
@arrayfire
Outline
● Introduction to ArrayFire
● Case Study 1: glasses.com
● Case Study 2: Accelerate Diagnostics
ArrayFire
● World’s leading GPU experts
○
○
In the industry since 2007
NVIDIA Partner
● Deep experience working with thousands of
customers
○
○
○
Analysis
Acceleration
Algorithm development
● GPU Training
○
○
Hands on course with a CUDA engineer
Customized to meet your needs
ArrayFire
● Hundreds of parallel functions
○
Targeting image processing, machine learning, etc.
● Support for multiple languages
○
●
●
●
●
C/C++, Fortran, Java and R
Linux, Windows, Mac OS X
OpenGL based graphics
Based around one data structure
Just-in-Time (JIT)
○
Combine multiple operations into one kernel
● GFOR, the only data parallel loop
ArrayFire Functions
● Hundreds of parallel functions
○
Building blocks (non-exhaustive)
■ Reductions, scan, sort
■ Set operations
■ Statistics
■ Matrix operations
■ Image processing
■ Signal processing
■ Sparse matrix
■ Visualizations
Case 1:
glasses.com
Case 1: Glasses.com
● 3D face reconstruction from images
● Image and Coordinate geometry processing
● Came to us with a slow application
○
○
○
Made use of OpenCV and OpenMP
One thread per PC, 8 threads: 30+ seconds
Developed on OSX
Case 1: Glasses.com
● Required a significant hardware investment
○
○
○
Increased maintenance
Financially not viable in production
Had windows infrastructure
The challenge: Speed, Speed, and much more
Challenge 1: Multithreading
● Multithreading benefits CPU code
○
○
Calling CUDA from multiple threads may not offer too much benefit
■ Overheads of memory management, kernel launches
■ Streams, pinned mem, etc required to harness full potential
GPU parallelism is faster than CPU for these operations
● Goal:
○
○
Make host code single threaded
Move all multithreaded sections to GPU
Challenge 1: Multithreading
● Most multithreaded sections easily ported
○
○
Images are easy to combine and operate on at once
ArrayFire gfor
● Some sections more difficult
○
○
○
Require serial access and/or complex operations
Needs to be run on host - more memory transfers
Need combination of OpenMP and ArrayFire
Challenge 2: Multithreading ArrayFire
● ArrayFire was not thread safe
○
○
○
○
Designed for GPU performance
Required substantial work
■ iterative process
Trade-offs
■ Cost of adding critical section vs
■ Cost of adding multithreading support
Limiting access to data for each thread
Challenge 2: Multithreading ArrayFire
● Required adding critical sections around key
operations
● Constant memory and textures
○
○
No way to make this thread safe
■ Except critical section
Add critical section vs use global memory
■ Analyse and customize for specific operation
Challenge 3: Batching
● Image operations can be easily batched
○
Most operations work on pixel-level or neighborhood-level
● Problem when operations are more complex
● Batching does not always map
○
○
like affine matrix multiplications
Indexing needs to be changes - expensive memcpys
Challenge 3: Batching
● Used OpenMP for parallelism
○
○
One frame per thread
Optimized for CPU
● One CPU thread + GPU
○
Parallelism on GPU vs. Parallelism on CPU
● Combined OpenMP threads
Challenge 3: Batching
● Many small operations
○
Individually it didn’t make sense to port to the GPU
● Increase dimensionality of the data
○
○
2D -> 3D
GFOR and Strided Access
● Moved to single threaded code
Challenge 3: Batching
● Call custom CUDA kernels
○
Special indexing
float * bound = boundary.device<float>();
kernel<<< threads, blocks >>>(bound, boundary.elements());
● Specialized Matrix Multiply
○
○
○
ssyrk vs. gemm
2x faster
concurrent execution using streams
Challenge 3: Batching
● Results
○
90ms -> 28ms on a GTX 690
● Other Improvements
○
○
○
Overlapped pinned memory transfers
Generic to Specialized matrix multiply
Streams
Concurrent Computation
● Overlap CPU and GPU computation
○
○
CPU handles variable length data sets one frame at a time
GPU handles fixed length data sets all frames concurrently
#pragma omp sections
{
#pragma omp section
{
// GPU Code
}
#pragma omp section
{
// CPU Code
}
}
Results
● 1 Process (5 threads): 8 seconds
● 6 Processes(2 threads): 22 seconds
● Demo
Demo
Case 2:
Accelerate Diagnostics
Case 2: Accelerate Diagnostics
●
●
●
●
●
Multithreaded Java code with CUDA integration
Image processing of large images (4k x 4k each)
Port to C++
Hard time constraint
Hard reliability constraint
The challenge: Maximize PCIe throughput
○
○
Image processing is very parallel
Memory transfer is majority of application run time
Case 2: Accelerate Diagnostics
● Target hardware:
○
○
○
Intel Xeon CPU
2 GTX Titans per system
64 GB RAM
● Required speed up: ~5x
● Required reliability: 48 hour stress test
The Framework
●
●
●
●
Master thread - scheduling and management
Slave threads - each handling 1 ‘pipeline’
Each pipeline handled one ‘site’ at a time
Continuous execution
● Pipeline - serial flow of execution for one site
●
“Rabbit Hole”
● Site - independent data set of images
●
Rabbit
The Framework - Initial
Master Thread
Reads
Site
Database
Pipe 0
Pipe 1
Site
Site
GPU 0
GPU 1
Master Thread
● Minimalist
● Initializes and controls pipeline
● Feeds sites to pipelines
Master Thread
Thread 0
Thread 1
Pipe 0
Pipe 1
Site
Site
GPU 0
GPU 1
Pipeline
● Serial execution within pipelines
● Processes one site at a time
Pipe 0
Site
Challenge 1: CPU Parallelism
How to parallelize pipelines independently?
● Each thread processes one pipeline
○
At pipeline level, application is single threaded
● Allot one GPU to each pipeline
● Pipelines initialized once per run
● Perpetual execution
Parallelism: Results
● On CPU side, worked fine
● GPU - not so much
○
○
○
○
Too many blocking syncs to allocate and deallocate memory
Copy/Kernel execution collisions between threads
No concurrency
Extremely slow memory transfer speeds
■ Each image is 16mb, multiple transfers per kernel call
○
Although pragmatically parallel, execution was almost serial,
probably slower
Parallelism: Results
● On CPU side, worked fine
● GPU - not so much
○
○
○
○
Too many blocking syncs to allocate and deallocate memory
Copy/Kernel execution collisions between threads
No concurrency
Extremely slow memory transfer speeds
■ Each image is 16mb, multiple transfers per kernel call
○
Although pragmatically parallel, execution was almost serial,
probably slower
Challenge 2A: Pinned Memory
Transfer speeds can double with Pinned Memory
● For pageable memory, CUDA first transfers to pinned
and then to GPU.
● Non-pageable (pinned) memory not pageable by OS
● CUDA skips pageable->pinned memory transfer
Pageable
Pageable
DRAM
Pinned
Host
DRAM
Pinned
Device
Pageable Memory Copy
Host
Device
Pinned Memory Copy
Challenge 2B: Streams
Concurrency, concurrency, concurrency
● Increases PCIe throughput
○
○
Streams allow simultaneous copy and execution
Together with pinned memory, allows asynchronous copy
● Each pipeline has one stream allotted to it
○
Stream remains active through the lifetime of the pipeline
○
All CUDA operations and kernel launches are done asynchronously
for the stream
Use of cudaStreamSynchronize (vs cudaDeviceSynchronize)
○
Pinned Memory and Streams: Results
● Memory transfers speeds increased ~2x
● Problems:
○
○
○
○
Allocating/freeing pinned memory is a full system block
■ All threads on CPU, all streams on GPU are blocked
■ Very very bad
■ Benefits of using streams is negated
Device memory alloc/free is also a blocking sync
Too many memory API calls negating the benefits of parallelism
Possible memory leaks - very bad for reliability
■ Will reveal in stress testing
Challenge 3: Better Memory Management
Minimize number of memory allocation and deletion
● On CPU and GPU
● The memory used in the processing of each site is
deterministic and constant
● Solution: Create a memory manager
Memory Manager
● Goals:
○
○
○
○
○
○
Manage host and device memory for each pipeline
Allocate, free memory
Assign, retract memory
Manage transfers between host and device
Ensure consistency between host and device memory
Free memory only at the end of the application
Memory Manager
Memory Manager
“Mirrored Array”
Type
Stream
Flags
Size
Device
Pointer
Host
Pointer
Dev Mem
Host Mem
Create()
Free()
Push()
Pull()
Update()
Release()
Memory Manager
● Memory usage is deterministic, memory once
allocated can be reused as needed.
○
○
○
○
After 1st site run, no new pinned or device memory will need to be
created
Most of the memory required can be created in initialization
Same chunk of memory can be reused using pointers
Pointers can release the memory back to manager when
processing is completed
Better Memory Management: Results
● Drastic reduction in alloc/free calls
● Much better parallelism
○
○
Streams are much more concurrent as blocks are reduced
CPU threads do not need to be synced
● Stable across multiple site processing
● Memory leaks are easily discovered
○
Increase in usage after 1st run shows leaks
○
Memory can be used to make sure all memory is release at the end
of each site
The Framework
Master Thread
Pipe 0
Pipe 1
Stream 0
Stream 1
Memory
Memory
Site
Site
GPU 0
GPU 1
Results
● Significant performance
improvements
● Excellent PCIe throughput
● Highly parallel
● GPU kernel execution lower
compared to memcpy times
Master Thread
Pipe 0
Pipe 1
Stream 0
Stream 1
Memory
Memory
Site
Site
GPU 0
GPU 1
The Framework - Final
● Increase pipelines to 4
○
Master Thread
2 per GPU
● 4 pipelines good for CPU
○
○
○
4 heavy processing threads
1 master light thread
4 threads = optimal usage
Pipe 0
Pipe 1
Pipe 2
Pipe 3
Stream 0
Stream 1
Stream 2
Stream 3
Memory
Memory
Memory
Memory
Site
Site
Site
GPU 0
Site
GPU 1
Results
● Improvement in times
○
Master Thread
Almost 2x better than required
● Stable memory usage
● GPU usage optimal
● Problems?
Pipe 0
Pipe 1
Pipe 2
Pipe 3
Stream 0
Stream 1
Stream 2
Stream 3
Memory
Memory
Memory
Memory
Site
Site
Site
GPU 0
Site
GPU 1
Results
● Improvement in times
○
Master Thread
Almost 2x better than required
● Stable memory usage
● GPU usage optimal
● Problem: OVERHEATING!
Pipe 0
Pipe 1
Pipe 2
Pipe 3
Stream 0
Stream 1
Stream 2
Stream 3
Memory
Memory
Memory
Memory
Site
Site
Site
GPU 0
Site
GPU 1
Results
● Problem: OVERHEATING!
● Solution:
○
○
○
Use software tools to lower gpu clock speeds
Control fan speeds on gpu
Create target power and temperature
● No major reduction in performance
Case 2: Takeaways
● Application is only as fast as its slowest part
● True multithreading is awesome
○
Not easy - but can be done
● Memory management is crucial to parallelism
● Be ready to tackle any problem
○
Overheating? Really?
Q&A