Download Report

GPU Tutorial: How To Program for GPUs Kreso Cosic1, (1) University of Split, Croatia High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Overview   CUDA   Hardware architecture   Programming model   Convolu9on on GPU High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ CUDA  
‘Compute Uniﬁed Device Architecture’ – 
• 
Massively parallel architecture – 
• 
over 8000 threads is common C for CUDA (C++ for CUDA) – 
• 
Hardware and so@ware architecture for issuing and managing computa9ons on GPU C/C++ language with some addi9ons and restric9ons Enables GPGPU – ‘General Purpose Compu9ng on GPUs’ High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU: a mul9threaded coprocessor
SM SP: scalar processor SP SP SP SP ‘CUDA core’ SP SP SP SP Executes one thread SP SP SP SP SP SP SP SP SM streaming mul9processor SHARED MEMORY 32xSP (or 16, 48 or more) Fast local ‘shared memory’ (shared between SPs) 16 KiB (or 64 KiB) High Performance Computing with GPUs: An Introduction GLOBAL MEMORY (ON DEVICE) Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ   GPU:   SMs o 
30xSM on GT200, o 
14xSM on Fermi   For example, GTX 480:   14 SMs x 32 cores = 448 cores on a GPU GDDR memory 512 MiB -‐ 6 GiB High Performance Computing with GPUs: An Introduction SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY GLOBAL MEMORY (ON DEVICE) Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ How To Program For GPUs  
Paralleliza9on  
 
Decomposi6on to threads Memory  
shared memory, global memory SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY GLOBAL MEMORY (ON DEVICE) High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Important Things To Keep In Mind  
Avoid divergent branches  
 
 
SM Threads of single SM must be execu9ng the same code Code that branches heavily and unpredictably will execute slowly SP SP SP SP SP SP SP SP Threads shoud be independent as much as possible  
Synchroniza6on and communica6on can be done eﬃciently only for threads of single mul9processor High Performance Computing with GPUs: An Introduction SP SP SP SP SP SP SP SP SHARED MEMORY Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ How To Program For GPUs  
Paralleliza9on  
 
Memory  
 
shared memory, global memory Enormous processing power  
 
Decomposi6on to threads Avoid divergence Thread communica9on  
Synchroniza9on, no interdependencies High Performance Computing with GPUs: An Introduction SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY GLOBAL MEMORY (ON DEVICE) Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Programming model High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Thread blocks   Threads grouped in thread blocks   128, 192 or 256 threads in a block BLOCK 1 THREAD THREAD THREAD (0,0) (0,1) (0,2) THREAD THREAD THREAD (1,0) (1,1) (1,2) •  One thread block executes on one SM –  All threads sharing the ‘shared memory’ –  32 threads are executed simultaneously (‘warp’) High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Thread blocks   Blocks execute on SMs   -‐ execute in parallel BLOCK 1 THREAD THREAD THREAD (0,0) (0,1) (0,2)   -‐ execute independently! THREAD THREAD THREAD (1,0) •  Blocks form a GRID •  Thread ID unique within block •  Block ID unique within grid High Performance Computing with GPUs: An Introduction (1,1) (1,2) BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4 BLOCK 5 BLOCK 6 BLOCK 7 BLOCK 8 Grid Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Code that executes on GPU: Kernels   Kernel   -‐ a simple C func9on   -‐ executes on GPU   -‐ Executes in parallel   as many 9mes as there are threads   The keyword __global__
tells the compiler to make a func9on a kernel (and compile it for the GPU, instead of the CPU) High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Convolu9on  
To get one pixel of output image: -‐ mul9ply (pixelwise) mask with image at corresponding posi9on -‐ sum the products High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ __global__ void Convolve( float* img, int imgW, int imgH, float* filt, int filtW, int filtH, float* out) { const int nThreads = blockDim.x * gridDim.x;
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
const int outW = imgW – filtW + 1; const int outH = imgH – filtH + 1; const int nPixels = outW * outH; for(int curPixel = idx; curPixel < nPixels; curPixel += nThreads) for (int y = 0; y < outH; y++) for (int x = 0; x < outW; x++) { int x = curPixel % outW; { int y = curPixel / outW; float sum = 0; for (int filtY = 0; filtY < filtH; filtY++) for (int filtX = 0; filtX < filtW; filtX++) { int sx = x + filtX; int sy = y + filtY; sum+= img[sy*imgW + sx] * filt[filtY*filtW + filtX]; } out[y * outW + x] = sum; }
} Ker
-‐ Exa
ple
cod
pt 1
Setup and data transfer  
 
cudaMemcpy
  transfer data to and from GPU (global memory) cudaMalloc
  Allocate memory on GPU (global memory)   GPU is the ‘device’, CPU is the ‘host’   Kernel call syntax High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ int main() { ... float* img ... int imgW, imgH ... float* imgGPU; cudaMalloc((void**)& imgGPU, imgW * imgH * sizeof(float)); cudaMemcpy( imgGPU, // Destination img, // Source imgW * imgH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction ); float* filter ... int filterW, filterH ... float* filterGPU; cudaMalloc((void**)& filterGPU, filterW * filterH * sizeof(float)); cudaMemcpy( filterGPU, // Destination filter, // Source filterW * filterH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction ); Exa
e set
and
dat
tran
er 1
int resultW = imgW – filterW + 1; int resultH = imgH – filterH + 1; float* result = (float*) malloc(resultW * resultH * sizeof(float));
float* resultGPU; cudaMalloc((void**) &resultGPU, resultW * resultH * sizeof(float));
/* Call the GPU kernel */ dim3 block(128); dim3 grid(30); Convolve<<<grid, block>>> ( imgGPU, imgW, imgH, filterGPU, filterW, filterH,
resultGPU ); cudaMemcpy( result, // Desination resultGPU, // Source resultW * resultH * sizeof(float), // Size in bytes cudaMemcpyDeviceToHost // Direction );
cudaThreadExit(); ... } Exa
e set
and
dat
tran
er 2
High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Speedup Linear combina9on of 3 ﬁlters sized 15x15   Image size: 2k x 2k  
  CPU: Core 2 @ 2.0 GHz (1 core)  
GPU: Tesla S1070 (GT200 )   30xSM, 240 CUDA cores, 1.3 GHz CPU: 6.58 s 0.89 Mpixels/s   GPU: 0.21 s 27.99 Mpixels/s  
31 times faster!
High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ CUDA capabili9es  
1.0 GeForce 8800 Ultra/GTX/GTS  
1.1 GeForce 9800 GT, GTX, GTS 250 + atomic instruc9ons …  
1.2 GeForce GT 220  
1.3 Tesla S1070, C1060, GeForce GTX 275,285 + double precision (slow) …  
2.0 Tesla C2050, GeForce GTX 480, 470 + ECC, L1 and L2 cache, faster IMUL, faster atomics, faster double precision on Tesla cards … High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ CUDA essen9als  
developer.nvidia.com/object/cuda_3_1_downloads.html   Download   Driver   Toolkit (compiler nvcc)   SDK (examples) (recommended)   CUDA Programmers guide High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Other tools   ‘Emulator’   Executes on CPU   Slow   Simple proﬁler   cuda-‐gdb (Linux)   Paralel Nsight (Vista)   simple proﬁler   on-‐device debugger High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ ...   ... High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Logical thread hierarchy   Thread ID – unique within block   Block ID – unique within grid   To get globally unique thread ID:   Combine block ID and thread ID   Threads can access both shared and global memory High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ