CUDA C/C++ Programming Reference: Fundamentals and Optimization
CUDA C/C++ Programming Reference for Students
This cheatsheet merges key concepts from different CUDA C/C++ programming resources, designed for clarity and conciseness.
I. Fundamentals and Setup
Why CUDA and General Purpose GPU Programming (GPGPU)
- Usage: Leverage GPUs for computationally intensive parallel tasks. GPUs, with their many cores, excel at parallel problems, contrasting with CPUs’ fewer, powerful cores.
- Explanation: CPUs handle sequential tasks, while GPUs accelerate problems broken into many independent pieces (data parallelism). CUDA is NVIDIA’s platform for this.
- Takeaway: GPUs are powerful parallel engines, essential for speeding up modern compute-heavy applications.
CUDA Development Environment
- Usage: Set up your system to compile and run CUDA programs. This involves installing NVIDIA GPU drivers and the CUDA Toolkit (which includes
nvcc, the CUDA C compiler), and ensuring a compatible C/C++ compiler for your CPU code. - Explanation: Think of it as preparing your workbench with the right tools (
nvcc, drivers) for building CUDA applications. - Takeaway: A correctly configured CUDA development environment is the essential first step.
Host and Device Basics / Program Structure
- Usage: Understand the distinct roles of the CPU (host) and GPU (device), how to define GPU-executable functions, and launch them.
- Example Code:
#include <stdio.h> // Device function (kernel) __global__ void hello_kernel() { printf("Hello from GPU!\n"); } // Host function int main() { // Launch kernel with 1 block and 1 thread hello_kernel<<<1, 1>>>(); cudaDeviceSynchronize(); // Wait for GPU to finish return 0; } - Explanation: The CPU (host) manages the GPU (device) by launching
__global__functions (kernels) that run in parallel on the GPU. All threads execute the same kernel code in a Single-Program, Multiple-Data (SPMD) fashion. - Takeaway: CUDA C allows the CPU to launch parallel
__global__functions (kernels) on the GPU.
Basic CUDA Memory Management
- Usage: Allocate and free memory on the GPU, and transfer data between the CPU (host) and GPU (device).
- Example Code:
// Host Code Snippets float *h_data, *d_data; size_t size = N * sizeof(float); h_data = (float*)malloc(size); // Allocate host memory cudaMalloc((void**)&d_data, size); // Allocate device memory cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice); // Host to Device transfer // ... kernel launch ... cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost); // Device to Host transfer cudaFree(d_data); // Free device memory free(h_data); // Free host memory cudaDeviceSynchronize(); // Wait for GPU tasks to complete - Explanation: GPU (device) memory is separate from CPU (host) memory. You explicitly allocate (
cudaMalloc), transfer (cudaMemcpy), and free (cudaFree) device memory.cudaDeviceSynchronize()makes the CPU wait for the GPU. - Takeaway: Explicitly manage device memory via
cudaMalloc,cudaMemcpy,cudaFree, and synchronize withcudaDeviceSynchronize.
II. Thread Hierarchy and Execution
Thread Hierarchy (Grids, Blocks, Threads) and Identification
- Usage: Organize your parallel tasks and uniquely identify each thread. Threads are grouped into blocks, and blocks into a grid.
- Example Code:
// Vector Addition Kernel __global__ void add_vectors(float *A, float *B, float *C, int N) { // Calculate global unique index for the current thread (1D example) int tid = blockIdx.x * blockDim.x + threadIdx.x; if (tid < N) { // Ensure thread doesn't go out of bounds C[tid] = A[tid] + B[tid]; } } // Host Launch Example: // int threadsPerBlock = 256; // int numBlocks = (N + threadsPerBlock - 1) / threadsPerBlock; // add_vectors<<<numBlocks, threadsPerBlock>>>(dev_A, dev_B, dev_C, N); - Explanation:
threadIdxprovides a thread’s ID within its block,blockIdxprovides a block’s ID within the grid.blockDimandgridDimspecify the dimensions. These are used to calculate a unique global ID for each thread, often mapping to a data element. - Takeaway: CUDA’s 2-level hierarchy (grid of blocks, block of threads) uses
blockIdx,threadIdx,blockDim, andgridDimto assign unique work to each thread.
Warps and Transparent Scalability
- Usage: Understand how threads are executed and how CUDA scales programs.
- Explanation: Threads are executed in groups of 32 called warps, which is the basic unit of scheduling on a Streaming Multiprocessor (SM). GPUs hide memory latency by switching between ready warps. Blocks execute independently, enabling the same CUDA program to scale efficiently across GPUs with varying numbers of processor cores.
- Takeaway: Warps (32 threads) are the execution unit; independent blocks ensure programs scale across diverse GPU hardware and help hide latency.
Intra-Block Synchronization (__syncthreads())
- Usage: Coordinate threads within the same block to avoid race conditions when sharing data.
- Example Code (Block-wise Sum Reduction snippet):
__global__ void reduce_sum_shared(float *g_in, float *g_out) { __shared__ float s_data[256]; // Shared memory for a block unsigned int tid = threadIdx.x; // ... load data into s_data[tid] ... __syncthreads(); // All threads wait here before continuing // Perform reduction in shared memory for (unsigned int s = blockDim.x / 2; s > 0; s /= 2) { if (tid < s) { s_data[tid] += s_data[tid + s]; } __syncthreads(); // Synchronize after each step of reduction } if (tid == 0) { // Only the first thread writes the block's final sum g_out[blockIdx.x] = s_data[0]; } } - Explanation:
__syncthreads()acts as a barrier within a block, ensuring all threads in that block reach a certain point before any proceed. This is critical for preventing race conditions when threads read or write shared data. - Takeaway:
__syncthreads()provides a vital barrier for coordinating threads within a block, crucial for correct shared memory access.
III. Memory Hierarchy and Management (Advanced)
CUDA Memory Hierarchy Overview
- Usage: Optimize performance by understanding different memory types and their characteristics.
- Explanation: CUDA GPUs feature a hierarchy of memories: fast, on-chip memories like Registers (thread-private, fastest), Shared Memory (block-shared, very fast), and Constant Memory (read-only, cached), contrasted with slower, off-chip Global Memory (device-wide, largest capacity).
- Takeaway: GPU performance hinges on efficiently using the memory hierarchy, prioritizing fast on-chip memories and exploiting data locality.
Shared Memory (__shared__) and Tiling
- Usage: Utilize fast, on-chip memory for data shared and reused by threads within the same block, reducing global memory traffic.
- Example Code:
// Inside a kernel function: __shared__ float s_A[TILE_SIZE][TILE_SIZE]; // Shared memory declaration // ... s_A[threadIdx.y][threadIdx.x] = Md[global_row * WIDTH + global_col]; // Load into shared __syncthreads(); // Synchronize before using shared data - Explanation: Declared with
__shared__, this memory is accessible by all threads in a block and is much faster than global memory. Tiling algorithms load large data chunks into shared memory for block-local processing, then write results back to global memory, greatly improving locality. - Takeaway:
__shared__memory provides fast, on-chip storage for collaborative data reuse within a block, often utilized with tiling to reduce global memory traffic.
Constant Memory (__constant__)
- Usage: Efficiently store read-only data that is accessed frequently and uniformly by all threads in the grid.
- Example Code:
__constant__ float GLOBAL_SCALE_FACTOR = 2.0f; // Global, on-device constant __global__ void scale_kernel(float *data) { data[threadIdx.x] *= GLOBAL_SCALE_FACTOR; // Access constant data } // Host code to copy data to constant memory: // float host_factor = 2.0f; // cudaMemcpyToSymbol(GLOBAL_SCALE_FACTOR, &host_factor, sizeof(float)); - Explanation: Declared with
__constant__, this global device memory is aggressively cached. It is ideal for configuration data or lookup tables that do not change during kernel execution and are read by many threads. - Takeaway:
__constant__memory offers aggressively cached, fast read-only access for data uniformly used by all threads in a grid.
Texture Memory
- Usage: Optimize read-only access patterns, especially for 1D or 2D data like images, where access exhibits spatial locality.
- Example Code (2D Texture Read):
// Host-side global texture reference declaration texture<float, 2, cudaReadModeElementType> image_tex; __global__ void process_image_kernel(float *output, int width, int height) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x < width && y < height) { float pixel_val = tex2D(image_tex, (float)x, (float)y); output[y * width + x] = pixel_val * 0.5f; } } // Host code to bind texture (assuming dev_image_data is on GPU) // cudaChannelFormatDesc chDesc = cudaCreateChannelDesc<float>(); // cudaBindTexture2D(0, image_tex, dev_image_data, &chDesc, width, height, pitch); - Explanation: Texture memory is a specialized, hardware-optimized cache for read-only data, providing fast access and optional hardware-assisted interpolation for data with spatial locality (nearby threads access nearby data).
- Takeaway: Texture memory is a read-only cache optimized for 1D/2D data with spatial locality, providing fast access and optional hardware-assisted interpolation.
Zero-Copy Host Memory
- Usage: Allow GPUs to directly access system memory (host memory), avoiding explicit
cudaMemcpycalls. - Example Code:
// Host-side: float *host_mapped_data; cudaHostAlloc(&host_mapped_data, size, cudaHostAllocMapped); // Allocate pinned, mapped host memory float *dev_ptr_to_host_data; cudaHostGetDevicePointer(&dev_ptr_to_host_data, host_mapped_data, 0); // Get device-side pointer // Kernel can now use dev_ptr_to_host_data to access 'host_mapped_data' directly my_kernel<<<grid, block>>>(dev_ptr_to_host_data); // Host-side cleanup: // cudaFreeHost(host_mapped_data); - Explanation: This allocates host memory that is “pinned” (non-pageable) and mapped directly into the GPU’s address space. Kernels can then read/write this memory directly, useful for small, irregular transfers or overlapping host-device work.
- Takeaway: Zero-copy memory allows direct GPU access to pinned host memory, eliminating explicit
cudaMemcpyfor certain use cases.
IV. Performance Optimization
Warp Divergence
- Usage: Write kernel code to ensure optimal parallel execution within a warp.
- Explanation: When threads within the same warp (a group of 32 threads) take different control flow paths (e.g., different branches of an
if/elsestatement), they execute sequentially, leading to performance degradation. This is known as warp divergence. - Takeaway: Avoid warp divergence by ensuring threads within a warp follow the same execution path to maintain parallel efficiency.
Global Memory Coalescing
- Usage: Design global memory access patterns to maximize memory throughput.
- Example Code:
// Coalesced global memory access (threads in a warp read adjacent elements) data[row * width + (blockIdx.x * blockDim.x + threadIdx.x)] - Explanation: Threads in a warp accessing consecutive global memory locations together allows the GPU to perform a single, efficient memory transaction, which is critical for high performance. Uncoalesced access results in multiple, slower transactions.
- Takeaway: Maximize coalescing by having threads in a warp access contiguous global memory to achieve optimal throughput.
Resource Occupancy and Latency Hiding
- Usage: Balance resource usage (registers, shared memory) to keep the GPU busy and hide memory access latencies.
- Explanation: Occupancy is the ratio of active warps on an SM to the maximum possible. High occupancy allows the SM to quickly switch to another ready warp when one is stalled (e.g., waiting for memory), thereby hiding latency. Over-allocating resources can lead to a “performance cliff” by reducing the number of active warps.
- Takeaway: High occupancy (many active warps) is crucial for hiding memory latency, but overusing resources can lead to performance cliffs.
Floating Point Considerations
- Usage: Design numerically stable algorithms and understand the limitations of floating-point arithmetic on GPUs.
- Explanation: Floating-point numbers (IEEE 754 standard) have finite precision, meaning not all real numbers can be represented exactly. This can lead to rounding errors, especially with operations involving widely different magnitudes. The order of operations in parallel reductions can affect accuracy. Hardware intrinsic functions (e.g.,
__sinf) are faster but may be less accurate. - Takeaway: Understand IEEE 754 floating-point formats, precision, and accuracy to design numerically stable parallel algorithms, as the order of operations can impact results due to rounding errors.
V. Advanced CUDA Features
Atomic Operations
- Usage: Perform thread-safe read-modify-write operations on shared (global or shared) memory locations, preventing race conditions.
- Example Code (Atomic Increment – Histogram):
__global__ void atomic_histogram_kernel(unsigned char *input_data, unsigned int *counts_array, int N) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if (tid < N) { unsigned char value = input_data[tid]; atomicAdd(&counts_array[value], 1); // Atomically increment the count } } // Example Atomic Lock (conceptual): // while (atomicCAS(&lock_variable, 0, 1) != 0); // Acquire lock (0=unlocked, 1=locked) // // ... critical section ... // atomicExch(&lock_variable, 0); // Release lock - Explanation: Atomic functions (e.g.,
atomicAdd,atomicCAS– Compare And Swap,atomicExch) guarantee that a memory operation is completed without interference from other threads, even if multiple threads attempt to update the same location simultaneously. They are crucial for correct parallel updates and building synchronization primitives like locks. - Takeaway: Atomic operations guarantee thread-safe read-modify-write access to shared memory, crucial for correct parallel updates and building synchronization primitives.
CUDA Streams (Asynchronous Execution)
- Usage: Overlap data transfers and kernel computations for improved GPU utilization and performance.
- Example Code:
cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(dev_A, host_A, size, cudaMemcpyHostToDevice, stream1); // Async H2D copy kernel_func_A<<<grid, block, 0, stream1>>>(dev_A); // Kernel in stream1 cudaMemcpyAsync(host_result_A, dev_A, size, cudaMemcpyDeviceToHost, stream1); // Async D2H copy // Independent tasks in stream2 (can overlap with stream1) cudaMemcpyAsync(dev_B, host_B, size, cudaMemcpyHostToDevice, stream2); kernel_func_B<<<grid, block, 0, stream2>>>(dev_B); cudaMemcpyAsync(host_result_B, dev_B, size, cudaMemcpyDeviceToHost, stream2); cudaStreamSynchronize(stream1); // Wait for stream1 to complete cudaStreamDestroy(stream1); - Explanation: Streams (
cudaStream_t) allow the host to issue multiple operations (memory copies, kernel launches) to the GPU asynchronously. Operations within a stream execute sequentially, but operations in different streams can run concurrently, enabling overlapping computation and data transfer. - Takeaway: CUDA Streams enable asynchronous execution, allowing the overlap of data transfers and kernel computations for improved GPU utilization and performance.
CUDA Events (Timing and Synchronization)
- Usage: Precisely measure kernel execution times and synchronize GPU operations without blocking the host CPU unnecessarily.
- Example Code:
cudaEvent_t start_event, stop_event; cudaEventCreate(&start_event); cudaEventCreate(&stop_event); cudaEventRecord(start_event, 0); // Mark start time on default stream my_kernel<<<num_blocks, threads_per_block>>>(dev_data); cudaEventRecord(stop_event, 0); // Mark end time cudaEventSynchronize(stop_event); // Wait for GPU tasks to complete float ms_elapsed; cudaEventElapsedTime(&ms_elapsed, start_event, stop_event); printf("Kernel time: %.3f ms\n", ms_elapsed); cudaEventDestroy(start_event); cudaEventDestroy(stop_event); - Explanation: Events (
cudaEvent_t) mark points in the GPU’s timeline.cudaEventRecordmarks an event,cudaEventElapsedTimecalculates time between two events, andcudaEventSynchronizemakes the host wait for a specific event to complete. - Takeaway: CUDA Events provide precise performance profiling and non-blocking synchronization for GPU operations.
Multi-GPU Programming (cudaSetDevice, Peer-to-Peer)
- Usage: Scale applications to use multiple GPUs for larger problems or increased throughput.
- Example Code (Multi-GPU setup):
int num_devices; cudaGetDeviceCount(&num_devices); for (int i = 0; i < num_devices; ++i) { cudaSetDevice(i); // Select GPU 'i' to work on // Perform device-specific memory allocations, copies, and kernel launches } // For Peer-to-Peer: // cudaDeviceEnablePeerAccess(peerDevice, 0); // Allows current device to access peerDevice's memory - Explanation:
cudaSetDevice(deviceID)selects the active GPU for subsequent CUDA operations. Peer-to-Peer (P2P) access (cudaDeviceEnablePeerAccess) allows one GPU to directly read/write memory on another GPU, avoiding copies through the host, if supported by the hardware and OS. - Takeaway: Scale applications across multiple GPUs using
cudaSetDeviceand enable direct GPU-to-GPU data transfers with Peer-to-Peer access.
Graphics Interoperability
- Usage: Directly exchange data between CUDA kernels and graphics APIs (like OpenGL or DirectX) on the GPU, avoiding host-device transfers.
- Example Code (Conceptual Steps for PBO Interop):
// Host-side initialization: // 1. Create OpenGL PBO: glGenBuffers(), glBindBuffer(), glBufferData(). // 2. Register PBO with CUDA: cudaGraphicsGLRegisterBuffer(&cuda_resource, pbo_id, cudaGraphicsMapFlagsWriteDiscard); // Host-side in rendering loop: // 1. Map resource: cudaGraphicsMapResources(1, &cuda_resource, 0); // 2. Get device pointer: float* dev_ptr; cudaGraphicsResourceGetMappedPointer((void**)&dev_ptr, &size, cuda_resource); // 3. Launch kernel: my_compute_kernel<<<..., >>> (dev_ptr); // Kernel writes directly to PBO memory // 4. Unmap resource: cudaGraphicsUnmapResources(1, &cuda_resource, 0); // 5. Use OpenGL to render the PBO content. - Explanation: This feature allows CUDA kernels to directly read from and write to memory buffers allocated by a graphics API (e.g., Pixel Buffer Objects, Vertex Buffer Objects). It involves registering the graphics resource with CUDA, mapping it to obtain a device pointer, and unmapping it after kernel execution, which significantly reduces data transfer overhead.
- Takeaway: Graphics interoperability enables direct, high-speed data exchange between CUDA and graphics APIs on the GPU, avoiding unnecessary memory copies.
VI. Tools and Computational Thinking
CUDA Libraries, Debugging, and Profiling Tools
- Usage: Expedite development, ensure correctness, and optimize performance of CUDA applications.
- Explanation: NVIDIA provides highly optimized libraries (e.g.,
cuBLASfor linear algebra,cuFFTfor FFTs,NPPfor image processing). Debugging tools likecuda-gdbandcuda-memcheck(for memory error detection) help find bugs. Profiling tools like NVIDIA Visual Profiler (ornvprof) identify performance bottlenecks. - Takeaway: Leverage NVIDIA’s optimized libraries, powerful debugging (
cuda-gdb,cuda-memcheck), and profiling (nvprof) tools for efficient CUDA development.
Computational Thinking and Algorithm Design
- Usage: Strategically approach problems for parallel execution on GPUs.
- Explanation: This involves formulating domain problems in terms of parallel computational steps, efficiently decomposing problems into independent work units, and selecting algorithms that are inherently parallel and leverage GPU architecture (e.g., favoring gather over scatter memory access patterns where threads read from different locations into private registers). Always consider Amdahl’s Law, which limits parallel speedup by the sequential portion of a program.
- Takeaway: Effective parallel programming requires computational thinking: strategically decomposing problems, choosing gather-based access, and selecting algorithms that leverage GPU architecture, mindful of Amdahl’s Law.
VII. Other and Future Outlook
Introduction to OpenCL
- Usage: Understand a cross-platform alternative for heterogeneous parallel computing.
- Example Code (OpenCL Kernel Example):
// OpenCL Kernel Example (Device Code) __kernel void vectorAdd(__global float* A, __global float* B, __global float* C, int N) { int gid = get_global_id(0); // Get global work item ID if (gid < N) { C[gid] = A[gid] + B[gid]; } } // OpenCL Host Code Snippets (Conceptual) // cl_context context; cl_command_queue queue; cl_kernel kernel; // ... setup context, queue, compile kernel, create kernel ... // size_t global_work_size = N; size_t local_work_size = 256; // clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, &local_work_size, 0, NULL, NULL); - Explanation: OpenCL is a standardized API for heterogeneous computing on various devices (CPUs, GPUs, etc.). It uses similar concepts to CUDA:
Work Items(threads),Work Groups(blocks), andNDRanges(grids). However, its host-side API for device management and kernel compilation is more explicit and complex. - Takeaway: OpenCL mirrors CUDA’s data-parallel model with similar concepts but its host API is more explicit to achieve cross-platform portability.
GPU Architecture Evolution and Future Outlook
- Usage: Anticipate future trends in GPU programming and hardware capabilities.
- Explanation: GPU architectures are continuously evolving, bringing more flexible memory models (e.g., unified virtual memory space), richer kernel execution features (e.g., function calls,
printfin kernels), and enhanced performance (e.g., faster atomics, less strict coalescing requirements). Improved C++ language support and debugging tools also simplify development. - Takeaway: GPU architectures are rapidly evolving to provide more flexible memory models, richer kernel execution features, and better C++ support, promising even easier and more powerful parallel programming.
