CUDA
Grid, Block, and Thread¶
each block has shared memory for all threads within the block.
each thread has its own private memory.
CUDA Thread Block Scheduling¶
- one block is mapped to one SMM core (streaming multiprocessor core)
warp¶
- warp is the execution context storage for CUDA threads.
- a warp consists of 32 threads, each thread is an instruction bank.
Matrix Multiplication in CUDA¶
Parallel Reduction in CUDA¶
sequential addressing > interleaved addressing