Download Report

GPGPU
Agenda – Intro to Nvidia GPGPU
Lesson #3:
Nvidia HW and CUDA - Introduction
Nvidia GPGPU programming model
–
Nvidia GPGPU HW
–
Dr. Avi Mendelson - Intel
Dr. Evgeny Bolotin - Intel
Technion, GPGPU course, Winter 2008
What is
3
2
? - a reminder ☺
Introduction to CUDA
Will start with G80 (towards G200 in next lesson)
The material by courtesy of : D.Kirk (Nvidia), Prof. W.M Hwu (UIUC),
M. Behar (Intel), Nvision 08 and Hotchips 08 - Cuda Tutorials
Lecture #3: Nvidia HW and CUDA - Introduction
GPU vs CPU: Peak GFLOPS
General Purpose computing on Graphics
Processing Units
GPU well-suited to data-parallel problems
Common Apps: physics, image filtering,
video and audio encoding, matrix algebra,
sorting, scientific workloads….
Lecture #3: Nvidia HW and CUDA - Introduction
4
Lecture #3: Nvidia HW and CUDA - Introduction
1
What is Behind such an Evolution?
Bandwidth
The GPU is specialized for compute-intensive, highly data parallel
computation (exactly what graphics rendering is about)
–
So, more transistors can be devoted to data processing rather than
data caching and flow control
ALU
ALU
ALU
ALU
Control
CPU
GPU
Cache
DRAM
DRAM
Lecture #3: Nvidia HW and CUDA - Introduction
5
GPGPU Challenge in the Old Days
–
Mapping the algorithm to the API
Graphic terminology (shader, pixel, triangle…)
Limited Shader capabilities
–
–
Instruction Set
Communication between threads
–
–
8
Extended C
No knowledge of graphics is required
Unlimited, random and efficient access to memory
–
Limited instruction set, no integer, no bit operation
Lecture #3: Nvidia HW and CUDA - Introduction
“Compute Unified Device Architecture”
No graphics API overhead
Low learning curve
–
Number of outputs was limited
No scatter operation
–
7
CUDA (and G80), since Nov. 2006
Graphic API (OpenGL or DX)
–
Lecture #3: Nvidia HW and CUDA - Introduction
6
Thread can read/write as many locations as needed
Gather/scatter capabilities
Software-managed cache (Shared Memory)
Significantly extends the GPU beyond graphics
Lecture #3: Nvidia HW and CUDA - Introduction
2
GFx Pipeline - Unified Design
9
Lecture #3: Nvidia HW and CUDA - Introduction
10
Why Unify?
11
Lecture #3: Nvidia HW and CUDA - Introduction
Lecture #3: Nvidia HW and CUDA - Introduction
Why Unify?
12
Lecture #3: Nvidia HW and CUDA - Introduction
3
G80 Thread Computing Pipeline
The future of GPUs is programmable
processing
Host
Host
Input
Input Assembler
Assembler
Setup / Rstr / ZCull
Vtx Thread Issue
SP
SP
SP
SP
SP
SP
Geom Thread Issue
SP
SP
SP
SP
Pixel Thread Issue
SP
SP
SP
SP
SP
SP
Build the architecture around the processor
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Texture
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Load/store
L2
FB
Lecture #3: Nvidia HW and CUDA - Introduction
13
14
Scale to 100s of cores, 1000s of parallel
threads
Let programmers focus on parallel
algorithms … and not mechanics of a
parallel programming language.
Lecture #3: Nvidia HW and CUDA - Introduction
Load/store
L2
FB
FB Global Memory
Lecture #3: Nvidia HW and CUDA - Introduction
FB
Load/store
L2
Load/store
L2
FB
FB
SPMD (Single Program, Multiple Data)
–
–
–
15
Load/store
L2
CUDA = SPMD Model
Cuda Design Goals
Load/storeL2
Thread Processor
Thread Execution Manager
16
All PE’s execute the same program in parallel, but
have their own data
Each PE uses a unique ID to access its portion of
data
Different PE can follow different paths through the
same code
Lecture #3: Nvidia HW and CUDA - Introduction
4
Streaming architecture lexicon wars:
SPMD vs. Scalar/Vector/SIMD – WARPs?
Traditionally multi-processors provided set of vector instructions
–
–
–
–
–
NVIDIA approach: vector operations out of scalar operations
–
–
–
–
–
–
17
MMX – early days of Intel’s vectors
SSE – Intel’s second generation: vectors of 128bits each
AVX – Intel’s next generation: vectors of 256bits each
LRBNI – Intel’s future graphics with a vector operations of 512bits vectors
Altivec – IBM definition of vector operations
The user defines a scalar code (sequence of operations running on a
single element)
At instantiation time user tells the run-time the problem size and
partitioning hints
HW breaks the problem into thousands of threads (one thread per
element) organized in several hierarchies and “vectors” – HW
SIMDification?
Each thread runs the same code on different data element
Threads get scheduled in groups of 32 (“WARP”) – just an implementation
WARP – special case of SIMD?
Lecture #3: Nvidia HW and CUDA - Introduction
18
Lecture #3: Nvidia HW and CUDA - Introduction
Host
Data Assembler
Setup / Rstr / ZCull
Vtx ThreadGeom
IssueThread
Pixel
Issue
Thread Issue
CUDA-Scalable Computing
HW Overview (G80)
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
TF
TF
TF
TF
TF
TF
TF
TF
L1
L1
L2
FB
L1
L1
L2
FB
L2
FB
L1
L1
L2
FB
L1
L2
FB
L1
Thread Processor
L2
FB
Stream Processor Array - SPA
TPC
NV Mobile
NV Tegra
TPC
TPC
TPC
TPC
TPC
Texture Processor Cluster
SM
Data L1
Instruction Fetch/Dispatch
Shared Memory
Server-Desktop
Tesla D870
TEX
SP
SM
Server
Tesla S870
Lecture #3: Nvidia HW and CUDA - Introduction
TPC
Streaming Multiprocessor
Instruction L1
19
TPC
Desktop
GeForce 8800
SP
SP
SP
SFU
20
SFU
SP
SP
SP
SP
Lecture #3: Nvidia HW and CUDA - Introduction
5
Cuda System - Physical Reality
CUDA Processor Terminology
TPC
SPA
TPC (2 SM + TEX)
SM
–
–
–
–
–
–
GPU w/ local DRAM
(CUDA Device)
Shared Memory
TEX
SP
SM
Texture Processor Cluster
SP
SP
SP
SFU
SP
SP
SP
SFU
SP
Streaming Multiprocessor (8 SP)
Multi-threaded processor core
Fundamental processing unit for CUDA thread block
Streaming Processor (FMAD operations)
Scalar ALU for a single CUDA thread
SFU
–
21
Instruction L1
Data L1
Instruction Fetch/Dispatch
SP
–
Streaming Processor Array
CPU
(Host)
Streaming Multiprocessor
SM
Super Function Units (Transcendentals)
Lecture #3: Nvidia HW and CUDA - Introduction
Lecture #3: Nvidia HW and CUDA - Introduction
22
Cuda Programming Model
A Highly Multithreaded Coprocessor
The GPU is viewed as a compute device that:
–
CUDA Programming model
–
–
Data-parallel portions of an application are executed
on the device as kernels which run in parallel on
many threads
Differences between GPU and CPU threads
–
–
–
–
–
23
Lecture #3: Nvidia HW and CUDA - Introduction
24
Is a coprocessor to the CPU or host
Has its own DRAM (device memory)
Runs many threads in parallel
GPU threads are extremely lightweight
Very little creation overhead
GPU needs 1000s of threads for full efficiency
Up to 12,268 threads are alive simultaneously
Multi-core CPU needs only a few
Lecture #3: Nvidia HW and CUDA - Introduction
6
Host & Device
Serial code runs on Host
Device as a ultra-parallel
coprocessor for Host
CUDA Device Memory Space
(Device) Grid
Block (0, 0)
Parralel CUDA threads run on
Device
Host memory and Device memory
Block (1, 0)
Shared Memory
Registers
Host
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Lecture #3: Nvidia HW and CUDA - Introduction
25
26
Memory Space
Memory-Summary
Registers
32 KB per SM (8K registers)
Memory
Location
Cached
Access
Global and local (NOT cached!)
Registers
On-chip
N/A
Read/write
One thread
Local
Off-chip
No
Read/write
One thread
Shared
On-chip
N/A resident
Read/write
All threads in a
block
Global
Off-chip
No
Read/write
All threads + host
Constant
Off-chip
Yes
Read
All threads + host
Texture
Off-chip
Yes
Read
All threads + host
–
–
DRAM – large latency access
Constant memory (cached)
Texture memory (cached)
–
–
–
Fast if all threads read the same address (Broadcast)
Cached
The texture cache is optimized for 2D spatial locality
Shared memory
–
–
–
–
–
27
Lecture #3: Nvidia HW and CUDA - Introduction
SW managed
Very low latency
16KB on chip-memory per SM, organized in 16 banks
All threads of the block can share/exchange data using the shared memory
Allows reuse of data reducing bandwidth substantial performance
improvement
Lecture #3: Nvidia HW and CUDA - Introduction
28
Who
Lecture #3: Nvidia HW and CUDA - Introduction
7
Thread Hierarchy
Hierarchy: Kernel launches Grid
–
–
–
–
Grid of blocks
Blocks of threads
Threads organized and scheduled
in warps (microarchitecture)
Threads and blocks have unique IDs
–
–
Sharing through a low latency
shared memory
HW Synchronizing primitives
–
Threads from different blocks
cannot cooperate
Blocks are required to execute
independently
Simplifies memory
addressing when processing
multidimensional data
–
–
Lecture #3: Nvidia HW and CUDA - Introduction
29
Image processing
Matrix manipulation
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Lecture #3: Nvidia HW and CUDA - Introduction
30
Thread Hierarchy example
This way thread can decide what
data to process
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Device
A thread block is a batch of threads
that cooperate :
–
Block and Thread IDs
CUDA Common Programming style
A kernel is defined __global__ specifier
The number of CUDA threads for each call is specified using a
new <<<…>>> syntax
Thread ID is accessible within the kernel through the built-in
threadIdx variable
Each thread will work (compute) on 1 element
–
–
Define what to do for 1 element
Launch Kernel on the whole data set
Store blocks of data in fast shared memory (Tiling)
Use synchronization primitives among threads of
same block
Grid = 1 block
Block = NxN threads
31
Lecture #3: Nvidia HW and CUDA - Introduction
32
Lecture #3: Nvidia HW and CUDA - Introduction
8
A typical programming
pattern using the shared memory
Barrier - Synchronization Function
void __syncthreads();
Synchronizes all threads in a block
Once all threads have reached this point,
execution resumes normally
Used to avoid RAW/WAR/WAW hazards
when accessing shared or global memory
–
33
Lecture #3: Nvidia HW and CUDA - Introduction
phase
}
35
The host runtime component can only be
used by host functions
It provides functions to handle:
–
// compute sum for the threadblock
for(int dist = blockDim.x/2; dist>0; dist/=2)
{
if(threadIdx.x<dist)
s_data[threadIdx.x] += s_data[threadIdx.x+dist];
__syncthreads();
}
// write the block's sum to global memory
if(threadIdx.x==0)
g_output[blockIdx]= s_data[0];
Lecture #3: Nvidia HW and CUDA - Introduction
Process the data in shared memory
Synchronize again if necessary to make
sure that shared memory has been updated
with the results
Write the results back to device memory
CUDA Host Runtime Component
__global__ void sum_kernel(int *g_input, int *g_output)
{
extern __shared__ int s_data[]; // allocated during kernel launch
// read input into shared memory from global memory
Read
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
phase
s_data[threadIdx.x] = g_input[idx];
__syncthreads();
Write
read shared memory locations that were written
by different threads
Lecture #3: Nvidia HW and CUDA - Introduction
34
Example: Cuda kernel
Compute
phase
Load data from device memory to shared
memory
Synchronize with all the other threads of the
block so that each thread can safely:
–
–
–
–
–
36
Device management
Memory management
Code module management
Execution control
Texture reference management
Interoperability with OpenGL and Direct3D.
Lecture #3: Nvidia HW and CUDA - Introduction
9
Example for Host Runtime code
37
Lecture #3: Nvidia HW and CUDA - Introduction
Example for Host Runtime cont.
38
Language Extensions:
Variable Type Qualifiers
CUDA = C Language Extensions
Declaration: global, __global__ void convolve (float *image)
device , local ,
shared
Keywords:
dim3 threadIdx, dim3 blockIdx
dim3 blockDim;
int i = blockIdx.x * blockDim.x + threadIdx.x;
Intrinsics
__syncthreads(); __sin();__sqrt()
Runtime API
cudaMalloc(), cudaMemcpy()
Function launch
// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);
39
Lecture #3: Nvidia HW and CUDA - Introduction
Lecture #3: Nvidia HW and CUDA - Introduction
Memory
Scope
Lifetime
local
thread
thread
int SharedVar;
shared
block
block
int GlobalVar;
global
grid
application
constant
grid
application
__device__ __local__
int LocalVar;
__device__ __shared__
__device__
__device__ __constant__ int ConstantVar;
• Automatic variables without any qualifier reside in a register
• Pointers can only point to memory allocated or declared in global memory
•Allocated in the host and passed to the kernel:
__global__ void KernelFunc(float* ptr)
40
Lecture #3: Nvidia HW and CUDA - Introduction
10
Compiling CUDA, PTX-intermediate
CUDA SDK
C/C++
Application
ASM-level
Library
Programmer
Parallel Thread eXecution (PTX)
–
NVCC
Compiler
–
–
PTX Code
ISA – Instruction Set Architecture
–
PTX Code
–
–
Translator
–
41
Lecture #3: Nvidia HW and CUDA - Introduction
42
G80
…
Translates PTX to Target code
Program install time
Driver implements VM runtime
–
C
Variable declarations
Instructions and operands
Translator is an optimizing compiler
PTX to Target
Virtual Machine and ISA
Programming model
Execution resources and state
Coupled with Translator
GPU
Target
code
Lecture #3:
Nvidia HW and CUDA - Introduction
CUDA Device Memory Space
(Device) Grid
Block (0, 0)
CUDA HW
Architecture
Block (1, 0)
Shared Memory
Registers
We will describe G80 – will talk later what changed in G200
Host
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
43
Lecture #3: Nvidia HW and CUDA - Introduction
44
Lecture #3: Nvidia HW and CUDA - Introduction
11
Access Times
HW Overview (G80)
Register – dedicated HW - single cycle
Shared Memory – dedicated HW - single cycle
Local Memory – DRAM, no cache - *slow*
Global Memory – DRAM, no cache - *slow*
Constant Memory – DRAM, cached, 1…10s…100s
of cycles, depending on cache locality
Texture Memory – DRAM, cached, 1…10s…100s of
cycles, depending on cache locality
Instruction Memory (invisible) – DRAM, cached
Lecture #3: Nvidia HW and CUDA - Introduction
45
Stream Processor Array - SPA
TPC
–
–
–
47
TPC
Data L1
Instruction Fetch/Dispatch
Shared Memory
TEX
SP
SP
SP
SM
SP
SFU
SFU
SP
SP
SP
SP
Lecture #3: Nvidia HW and CUDA - Introduction
L1 Fill
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Work
1 to 768 threads active
Cover latency of texture/memory
loads
Hot clock 1.35 GHz (20+ GFLOPS)
Register file (32KB)
Shared memory (16KB)
ALL 8 SPs execute the same
instruction from the same warp (4
cycles)
Each instruction manipulates different
dataLecture #3: Nvidia HW and CUDA - Introduction
TPC
Thread / Instruction Dispatch
Multi-threaded instruction dispatch
–
TPC
Streaming Multiprocessor
SM
Kernel
launching
on Host
Instruction Fetch
Instruction L 1 Cache
TPC
Thread Scheduling
Streaming Multiprocessor (SM)
8 Streaming Processors (SP)
2 Super Function Units (SFU)
TPC
Instruction L1
46
Streaming Multiprocessor (SM)
TPC
Texture Processor Cluster
Streaming Multiprocessor (SM)
TPC
Shared Memory
S
F
U
Thread Execution Manager
Control
SP0
RF0
RF4
SP4
SP1
RF1
RF5
SP5
SP2
RF2
RF6
SP6
SP3
RF3
RF7
Results
S
F
U
SP7
Load Texture
Constant L1 Cache
Load from Memory
L1 Fill
Store to
Store to Memory
48
Lecture #3: Nvidia HW and CUDA - Introduction
12
Thread Scheduling
Thread Scheduling
Grid 1
Kernel
launching
on Host
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 1
Kernel
launching
on Host
Thread Execution Manager
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Thread Execution Manager
Block (1, 1)
Up to 24 Warps Per SM
…
Thread Thread
(14, 0) (15, 0)
Thread Thread
(0, 1)
(1, 1)
…
Thread Thread
(14, 1) (15, 1)
Thread Thread
(0, 2)
(1, 2)
…
Thread Thread
(14, 2) (15, 2)
Block(1,1) W1
Block(1,1) W2
warp
Thread Thread
(0, 0)
(1, 0)
Block(2,1) W7
Block(2,1) W8
Lecture #3: Nvidia HW and CUDA - Introduction
49
Block
(0, 0)
Lecture #3: Nvidia HW and CUDA - Introduction
50
Thread Life Cycle in HW
Host
SM Executes Blocks
Device
t0 t1 t2 … tm
Grid is launched on SPA
Thread Blocks are serially distributed to
all the SM’s
–
–
Kernel
1
Potentially >1 Thread Block per SM
Up to 8 bocks per SM
Every 32 consecutive threads are
grouped into a Warp
Each SM launches Warps of Threads
–
Grid 1
2 levels of parallelism
SM schedules and executes Warps that
are ready to run
As Warps and Thread Blocks complete,
resources are freed
–
SPA can distribute more Thread Blocks
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Threads are assigned to
SMs in Block granularity
–
–
Up to 8 Blocks to each
SM as resource allows
SM in G80 can take up to
768 threads
Grid 2
Kernel
2
MT IU
SP
–
–
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
SM assigns/maintains
thread id #s
SM manages/schedules
thread execution
51
Lecture #3: Nvidia HW and CUDA - Introduction
SP
Blocks
Shared
Memory
Shared
Memory
TF
Texture L1
L2
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
t0 t1 t2 … tm
MT IU
Blocks
Could be 256
(threads/block) * 3
blocks
Or 128 (threads/block) *
6 blocks, etc.
Threads run concurrently
Block (1, 1)
SM 0 SM 1
52
Lecture #3: Nvidia HW and CUDA - Introduction
Memory
13
Thread Scheduling/Execution
Each Thread Block is divided in 32-thread Warps
–
–
–
–
Warps whose next instruction has its
operands ready for consumption are
eligible for execution
–
Eligible Warps are selected for
execution on a prioritized scheduling
policy
warp 8 instruction 11
All threads in a Warp execute the
same instruction when selected
warp 3 instruction 95
..
.
warp 8 instruction 12
–
warp 3 instruction 96
SM Instruction Buffer – Warp
Scheduling
4 clock cycles needed to dispatch the same
instruction for all threads in a Warp in G80
–
–
56
from instruction L1 cache
into any instruction buffer slot
Issue one “ready-to-go” warp
instruction/cycle
–
I$
L1
Fetch one warp instruction/cycle
–
Study case:
– If one global memory access is needed for
every 4 instructions
– A minimal of 13 Warps are needed to fully
tolerate 200-cycle memory latency
(16 execution cycles * 13 warps = 208)
Lecture #3: Nvidia HW and CUDA - Introduction
warp 1 instruction 42
Lecture #3: Nvidia HW and CUDA - Introduction
54
55
SM multithreaded
Warp scheduler
At any point in time, only one of the 24 Warps
will be selected for instruction fetch and
execution.
Using Warps to hide Memory
Latency
Zero-overhead Warp scheduling
time
Each Block is divided into 256/32 = 8 Warps
There are 8 * 3 = 24 Warps
24 warps * 32 threads = 768 threads
Lecture #3: Nvidia HW and CUDA - Introduction
53
This is an implementation decision, not part of the
CUDA programming model!!!
Warps are THE scheduling units of the SM
If 3 blocks are assigned to an SM and each Block
has 256 threads, how many Warps are there in an
SM?
–
SM Warp Scheduling
from any warp - instruction buffer slot
operand scoreboarding used to prevent
hazards
Issue selection based on roundrobin/age of warp
SM broadcasts the same instruction
to 32 Threads of a Warp
Multithreaded
Instruction Buffer
R
F
C$
L1
Shared
Mem
Operand Select
MAD
SFU
Lecture #3: Nvidia HW and CUDA - Introduction
14
HW Summary
G80 is a Scoreboarding Machine
All register operands of all instructions in the Instruction Buffer
are scoreboarded
–
–
–
–
Warps are the basic SM scheduling unit
SM can execute multiple blocks concurrently
Decoupled Memory/Processor pipelines
–
The device processes only one grid at a time
Blocks are distributed to SMs by the Thread Execution
manger
Each thread-block of a grid is split into warps
Status becomes ready after the needed values are deposited
Prevents hazards
Cleared instructions are eligible for issue
Any thread can continue to issue instructions until scoreboarding prevents
issue
Allows Memory/Processor ops to proceed in shadow of Memory/Processor
ops
–
–
Lecture #3: Nvidia HW and CUDA - Introduction
57
Lecture #3: Nvidia HW and CUDA - Introduction
58
HW Summary:
Threads, Warps, Blocks
59
Only <32 when there are fewer than 32 total threads
–
–
–
SPMD is a natural way to express data level parallelism
–
–
Warps and scoreboarding used to hide memory
latency
–
60
CUDA – non-graphics programming interface
Shared Memory, Barriers
Unified, scalar
Scalable (SP per SM, SM per TPC, TPC count)
The GPU is used as an accelerator for massively data
parallel tasks
Cuda extends C
If resources (registers, thread space, shared memory)
allow, more than 1 Block can occupy each SM
Lecture #3: Nvidia HW and CUDA - Introduction
G80 is a revolutionary GPU architecture
–
There are (up to) 16 Warps in a Block
Each Block (and thus, each Warp) executes on a
single SM
G80 has 16 SMs
At least 16 Blocks required to “fill” the device
More is better
–
Exec. Summary SW/HW
There are (up to) 32 threads in a Warp
–
Shared memory and registers are partitioned among the threads of all
concurrent blocks
So, decreasing shared memory usage (per block) and register usage (per
thread) increases number of blocks that can run concurrently
Easy learning…?
Scalable programming model
No need to think in vectors
Lecture #3: Nvidia HW and CUDA - Introduction
15