 
        07.04.2011 GPU optimizations at RRZE GPU--Computing Workshop GPU Why GPUs? J. Habich(a), Prof. Dr. G. Wellein(a,b) , C. Feichtinger(b) (a)HPC Services – Regional ComputingCenter Erlangen of Computer Science (b)Department April 6th 2011 University Erlangen-Nürnberg Peak Performance of CPU vs. GPU [email protected] 2 Peak Memory Bandwidth of CPU vs. GPU Consumer Cards Consumer Cards Professional HP Cards 6x 4x Single Chip Single Chip April 6th 2011 [email protected] 3 April 6th 2011 [email protected] 4 1 07.04.2011 Outline  GPGPU CUDA Hardware  Memory performance by massive parallelism  Iterative Solvers  Lattice Boltzmann on GPUs April 6th 2011 [email protected] GPGPU CUDA Hardware 5 Specifications of the NVIDIA Fermi GPU April 6th 2011  Up to 6 GB of global memory (DRAM)  32 processors SP driven by : Single Instruction Multiple Data (SIMD) Single Instruction Multiple Thread (SIMT)  Explicit in-order architecture  32K Registers  48 KB of local on-chip memory      (shared memory) 1500 MHz DDR 384 bit bus Global gather/scatter possible 144 GB/s bandwidth 16 GB/s PCIe 2.0 x16 interface to CPU  1st and 2nd level Cache hierarchy  clock rate of 1.15 GHz Memory Memory Clock Peak Memory Memory Interface Bandwidth (GB) Clock (MHz) (MHz) (GFLOPs) (bit) (GB/sec) 1030 GFLOP/s (single precision) 515 GFLOP/s (double precision) 1150 1030 6 1500 384 144 GeForce GTX 280 1400 Tesla C2070 1000 1 1160 512 148.6 1350 345 0.768 900 384 86 Host ( Westmere) 2.66 255 24 1333 3*64 63 GeForce 8800 GTX [email protected] 6 Specifications of the NVIDIA Fermi GPU  14 Multiprocessors (MP); each with: April 6th 2011 [email protected] 7 April 6th 2011 [email protected] 8 2 07.04.2011 Features and paradigm of the CUDA toolkit Remember: OpenMP 0 Thread 0 Typical coarse-grained scheduling using contiguous chunks Thread 1 Divide domain to huge chunks 1 2 3 Memory Bandwidth on GPUs April 6th 2011 Thread 2 5 [email protected] 9 Features and paradigm of the CUDA toolkit Now: CUDA Block 1 Cyclic mapping between workload and threads is often helpful Block 2 April 6th 2011 Parallelize the most outer loop to minimize overhead [email protected] 10 Features and paradigm of the CUDA toolkit Example: Streamcopy-benchmark (GPU) 32 Blocks, 32 Threads each  Fine-grained workload scheduling!  No caches  no consecutive data access within a thread required Divide domain/loop into small chunks Equally distribute to threads 4 Thread 0 0 Thread 1 1 Thread 2 2 Thread 0 3 Thread 1 4 Thread 2 5 __global__ void vectorCopyGPUOuterIter 1st element  Block 0 Thread 0 33rd element  Block 1 Thread 0 (float *d_C, float *d_A, int DATA_N){ 1025th element  Block 0 Thread 0 for ( int pos = (blockIdx.x * blockDim.x) + threadIdx.x; pos < DATA_N ; pos += blockDim.x*gridDim.x bl kDi * idDi ) { ID of current thread d_C[pos] = d_A[pos]; ID of current block Number of threads per } block __syncthreads(); } Jump across all Elements April 6th 2011 [email protected] 11 April 6th 2011 [email protected] 12 3 07.04.2011 Memory bandwidth case study CUDA Scheduling impacted by Resource limits How much parallelism is needed? Example: Streamcopy C=A  Resources per Streaming Multiprocessor (MP)  32 000 (32bit) Register  16 KB to 48 KB of Shared Memory 2010 Tesla 2008 Tesla  Paralellism is limited by resource usage per Block/Thread 2007 Consumer AMD Magny Cours  1536 threads can be executed/scheduled parallel per MP 0 Registers eg s e s pe per Thread ead  20  10 to 35 byte shared memory per Thread (GT300 allows switching)  Using more registers will decrease parallel threads per MP Threads 1536 1024 512 Registers • Not enough blocks • GPUs not used perfectly April 6th 2011 20 30 62 384 256 128 64 83 125 250 500 • Not enough work per blocks • Threads run empty [email protected] 13 April 6th 2011 [email protected] 14 PCI Express bandwidth measurements PCIe Gen 2.0:  Theoretically 8 GB/s per direction  Pinned host memory is mandatory!  Blocked data data-copy copy only improves unpinned memory transfer Iterative Jacobi Solver on GPUs  Cuda Version 2.3  nVIDIA GTX 280 float * h_A; float MAXMEMORY = 1024*1024*512 cudaHostAlloc( (void**)&h_A, (sizeof (float) * MAXMEMORY), 0) ; April 6th 2011 [email protected] 15 April 6th 2011 [email protected] 16 4 07.04.2011 Seminar work: Moritz Kreutzer Evaluation of real SP to DP ratio  Jacobi solver on a 3-dimensional domain Case Study: (Ω = [0; 1]³ ) using CUDA  Jacobi iterative solver (6 point stencil) Loads 6 elements / Stores 1 element  Underlying PDE: Poisson’s equation with right hand side 0 Does 6 FLOPs (implementation dependent) ∆u = f ≡ 0 Algorithmic balance between 4.6 (sp) byte/flop and 9.3(dp) byte/flop  Sinusoidal boundary y conditions with a maximum at each face A kernel running at 1 Gflop/s needs 9.3 Gbyte/s of data  6-point stencil: gdst (x; y ; z) = a · ( gsrc (x+1; y ; z) + gsrc (x - 1; y ; z) + gsrc (x; y+1; z) + gsrc (x; y - 1; z) + gsrc (x; y ; z+1) + gsrc (x; y ; z - 1)) Tesla Top Speed 9/18 GFLOPs (dp/sp) Far away from any floating point barrier Watch all occuring flops and integer calculations  6 Flops, 6 reads, 1 write per lattice site update (LUP) April 6th 2011 [email protected] 17 April 6th 2011 [email protected] 18 Jacobi on GPUs  Shared memory access  One GPU-block treats 3 dim. Block of domain  Threads within a block can share data through a 16 KB shared memory  One thread per node of GPU-Block  Use of shared memory, e.g. for simple blocking Threads load central point for neighbors (central stripe)  Element gsrc (x; y ; z) is stored into shared memory for all threads ( = 1;…; (x 1 di X – 1) dimX  Access to elements gsrc (x + 1; y ; z) and gsrc (x - 1; y ; z) via shared memory  Extension to multiple stripes:  load complete stencil data to shared memory in advance April 6th 2011 [email protected] 19 April 6th 2011 [email protected] 20 5 07.04.2011 Jacobi on GPUs Jacobi on GPUs  Texture Cache  Padding  GPUs do not comprise ordinary caches known from CPUs  needed to obtain coalesced memory access  sensible only in x-direction (adjacent elements in x-direction are adjacent in memory)  However, it is possible to exploit the Texture Cache  Source grid is mapped to a texture  Repetitive read operations can be satisfied by this fast cache with much higher performance  Consider boundary/ghost layer when calculating optimal padding  Swap grids  re-map source grid April 6th 2011 [email protected] 21 April 6th 2011 [email protected] 22 Jacobi on GPUs: Results  Speedup of 2 for cached algorithms  Double precision leads to 50 % reduction as known from CPUs  Memory bound execution Lattice Boltzmann on GPUs April 6th 2011 [email protected] 23 April 6th 2011 [email protected] 24 6 07.04.2011 The lattice Boltzmann method The lattice Boltzmann method f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1)  Incompressible flow solver x = threadIdx.x ; // set i index of current cell y = blockIdx.x+1; // set j index of current cell z = blockIdx.y+1; // set k index of current cell  Explicit, fully discrete Boltzmann equation with BGK collision operator 0, t) 1, t) 2, t) 3, t) y-1, z-1,18, t)  1st order accurate in time SAVE f(x,  Pull/collision optimized layout y, n io Relaxation (complex computations) s en f( x, z, z, z, z, m Di  2nd order accurate in space f( x, y, f( x+1, y+1, f( x, y+1, f( x-1, y+1, X- LOAD LOAD LOAD LOAD … LOAD  Physical discretization: D3Q19 Z-Dimension if( fluidcell(x,y,z) ) then z, 0:18, t+1) endif  Halfway bounce-back for obstacle treatment April 6th 2011 [email protected] 25 April 6th 2011 [email protected] Pure kernel (SP), no PCIe/IB transfer Kernel with boundary transfer (SP), no IB  Maximum performance starting at 50x50x50  Maximum performance starting at 200x200x200 (64 times more than pure kernel (50x50x50)!!)  Fluctuations due to different thread numbers and influence of alignment  Blocks influence kernel with any domainsize  Blocks influence kernel Domains < 200x200x200  Comparison: Xeon Node ~100 MLUPS  LUPS: Lattice updates per second April 6th 2011 [email protected] 26  28% is lost for 64 blocks Why? 27 April 6th 2011 [email protected] 28 7 07.04.2011 Time measurements of kernel with 1 and 64 blocks Performance on GPU  Domains > 250^3  about 50% of execution time is spent in nonkernel parts  C2050 speedup of 2 compared to full 12 core Intel Westmere in dp  75% of attainable memory bandwidth  Kernel execution time is constant no matter how much blocks are used SP  Domains < 150^3 non-kernel non kernel part becomes dominant ECC=1 DP  Difference of sp and dp vanishes for 64 blocks  Algorithm becomes communication bound April 6th 2011 [email protected] 29 April 6th 2011 [email protected] 30 [email protected] 32 WaLBerla Heterogeneous GPU CPU computing April 6th 2011 [email protected] 31 April 6th 2011 8 07.04.2011 WaLBerla Heterogeneous LBM Copy to Buffers on CPU and GPU  Patch and Block based domain decomposition After each iteration, boundary data is copied to Communication Buffers  Block contains Simulation Data and Metadata e.g. for parallelization, advanced models  Block can be algorithm or architecture specific  All Blocks are equal in spatial dimensions  Processes can have one or multiple blocks April 6th 2011 [email protected] 33 Buffer swap on GPU April 6th 2011 34 Transfer of buffers to the host Local Communication Buffers are only swapped. No Copy is done! April 6th 2011 [email protected] [email protected] Data of GPU processes is transferred to the Host 35 April 6th 2011 [email protected] 36 9 07.04.2011 Transfer of buffers to the host Disparity of Scales: The tale of the load balance In Theory: 2 MPI processes for CPU domain 1 MPI process for GPU domain Buffers are transferred/received to/from other hosts  Time per iteration Sample block of 75x75x75  GPU: 3ms (@130 MLUPS)  CPU: 9ms (@ 50 MLUPS) Take 8 blocks  6 for GPU and 2 for CPUs What about 4 socket machines? April 6th 2011 [email protected] 37 Disparity of Scales: The tale of the load balance 2 April 6th 2011 [email protected] 38 Disparity of Scales: The tale of the load balance 3 Problems involved: Theory estimate: (130 + 50 +50 ) MLUPS = 230 MLUPS  GPU as well as CPU get better with large domains and only one block Reality: = 167 MLUPS  However, a factor of 7 to 2 restricts largest blocksize and gives you multiple blocks Tuning of parameters leads to balance of 7 to 2 (GPU to CPU) blocks.  Both architectures are not operated under best conditions (N t that (Note th t a CPU bl block k is i computed t d by b three th CPU cores via i OpenMP) O MP) Results of tuned setup : 189 MLUPS  Overheads (PCIe transfers, Thread spawning, onGPU copys) are not covered by simple performance model  Still behind lightspeed estimate April 6th 2011 [email protected] 39 April 6th 2011 [email protected] 40 10 07.04.2011 Weak scaling GPU per Node performance Strong Scaling GPU per Node performance  Weak scaling works as expected  Initial performance drop from one to two cards per node  Loss of 64% in SP on 30 Nodes (60 GPUs GT200)  Loss of 75% in DP on 30 Nodes (60 GPUs GT200)  Up to 16 GFLUPS max. performance (GT200) Huge performance drop due to domain size 66% Up to 137 Intel Xeon nodes necessary! Up to 7 GFLUPS in SP 46% Up to 1275 BlueGene/P nodes necessary! 67% Up to 70 Intel Xeon nodes necessary! 35% About 3 GFLUPS in DP Up to 750 BlueGene/P nodes necessary! April 6th 2011 [email protected] 41 April 6th 2011 [email protected] 42 Summary Thank you very much for your attention  High attainable performance on GPUS is achievable by sophisticated optimization  Sweetspots of different architectures are quite the same (i.e . large domains)  Data locality plays important role  PCIexpress is still a major bottleneck All trademarks are the property of their respective owners. April 6th 2011 [email protected] 43 April 6th 2011 [email protected] 44 11
© Copyright 2025