CUDA C/C++ Programming Reference: Fundamentals and Optimization

Posted on Nov 9, 2025 in Information Technologies Engineering

CUDA C/C++ Programming Reference for Students

This cheatsheet merges key concepts from different CUDA C/C++ programming resources, designed for clarity and conciseness.

I. Fundamentals and Setup

Why CUDA and General Purpose GPU Programming (GPGPU)

Usage: Leverage GPUs for computationally intensive parallel tasks. GPUs, with their many cores, excel at parallel problems, contrasting with CPUs’ fewer, powerful cores.
Explanation: CPUs handle sequential tasks, while GPUs accelerate problems broken into many independent pieces (data parallelism). CUDA is NVIDIA’s platform for this.
Takeaway: GPUs are powerful parallel engines, essential for speeding up modern compute-heavy applications.

CUDA Development Environment

Usage: Set up your system to compile and run CUDA programs. This involves installing NVIDIA GPU drivers and the CUDA Toolkit (which includes nvcc, the CUDA C compiler), and ensuring a compatible C/C++ compiler for your CPU code.
Explanation: Think of it as preparing your workbench with the right tools (nvcc, drivers) for building CUDA applications.
Takeaway: A correctly configured CUDA development environment is the essential first step.

Host and Device Basics / Program Structure

Usage: Understand the distinct roles of the CPU (host) and GPU (device), how to define GPU-executable functions, and launch them.

Example Code:

#include <stdio.h>
// Device function (kernel)
__global__ void hello_kernel() {
    printf("Hello from GPU!\n");
}
// Host function
int main() {
    // Launch kernel with 1 block and 1 thread
    hello_kernel<<<1, 1>>>();
    cudaDeviceSynchronize(); // Wait for GPU to finish
    return 0;
}

Explanation: The CPU (host) manages the GPU (device) by launching __global__ functions (kernels) that run in parallel on the GPU. All threads execute the same kernel code in a Single-Program, Multiple-Data (SPMD) fashion.
Takeaway: CUDA C allows the CPU to launch parallel __global__ functions (kernels) on the GPU.

Basic CUDA Memory Management

Usage: Allocate and free memory on the GPU, and transfer data between the CPU (host) and GPU (device).

Example Code:

// Host Code Snippets
float *h_data, *d_data;
size_t size = N * sizeof(float);
h_data = (float*)malloc(size);          // Allocate host memory
cudaMalloc((void**)&d_data, size);      // Allocate device memory
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice); // Host to Device transfer

// ... kernel launch ...

cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost); // Device to Host transfer
cudaFree(d_data);                       // Free device memory
free(h_data);                           // Free host memory
cudaDeviceSynchronize();                // Wait for GPU tasks to complete

Explanation: GPU (device) memory is separate from CPU (host) memory. You explicitly allocate (cudaMalloc), transfer (cudaMemcpy), and free (cudaFree) device memory. cudaDeviceSynchronize() makes the CPU wait for the GPU.
Takeaway: Explicitly manage device memory via cudaMalloc, cudaMemcpy, cudaFree, and synchronize with cudaDeviceSynchronize.

II. Thread Hierarchy and Execution

Thread Hierarchy (Grids, Blocks, Threads) and Identification

Usage: Organize your parallel tasks and uniquely identify each thread. Threads are grouped into blocks, and blocks into a grid.

Example Code:

// Vector Addition Kernel
__global__ void add_vectors(float *A, float *B, float *C, int N) {
    // Calculate global unique index for the current thread (1D example)
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) { // Ensure thread doesn't go out of bounds
        C[tid] = A[tid] + B[tid];
    }
}
// Host Launch Example:
// int threadsPerBlock = 256;
// int numBlocks = (N + threadsPerBlock - 1) / threadsPerBlock;
// add_vectors<<<numBlocks, threadsPerBlock>>>(dev_A, dev_B, dev_C, N);

Explanation: threadIdx provides a thread’s ID within its block, blockIdx provides a block’s ID within the grid. blockDim and gridDim specify the dimensions. These are used to calculate a unique global ID for each thread, often mapping to a data element.
Takeaway: CUDA’s 2-level hierarchy (grid of blocks, block of threads) uses blockIdx, threadIdx, blockDim, and gridDim to assign unique work to each thread.

Warps and Transparent Scalability

Usage: Understand how threads are executed and how CUDA scales programs.
Explanation: Threads are executed in groups of 32 called warps, which is the basic unit of scheduling on a Streaming Multiprocessor (SM). GPUs hide memory latency by switching between ready warps. Blocks execute independently, enabling the same CUDA program to scale efficiently across GPUs with varying numbers of processor cores.
Takeaway: Warps (32 threads) are the execution unit; independent blocks ensure programs scale across diverse GPU hardware and help hide latency.

Intra-Block Synchronization (`__syncthreads()`)

Usage: Coordinate threads within the same block to avoid race conditions when sharing data.

Example Code (Block-wise Sum Reduction snippet):

__global__ void reduce_sum_shared(float *g_in, float *g_out) {
    __shared__ float s_data[256]; // Shared memory for a block
    unsigned int tid = threadIdx.x;
    // ... load data into s_data[tid] ...
    __syncthreads(); // All threads wait here before continuing

    // Perform reduction in shared memory
    for (unsigned int s = blockDim.x / 2; s > 0; s /= 2) {
        if (tid < s) {
            s_data[tid] += s_data[tid + s];
        }
        __syncthreads(); // Synchronize after each step of reduction
    }
    if (tid == 0) { // Only the first thread writes the block's final sum
        g_out[blockIdx.x] = s_data[0];
    }
}

Explanation: __syncthreads() acts as a barrier within a block, ensuring all threads in that block reach a certain point before any proceed. This is critical for preventing race conditions when threads read or write shared data.
Takeaway: __syncthreads() provides a vital barrier for coordinating threads within a block, crucial for correct shared memory access.

III. Memory Hierarchy and Management (Advanced)

CUDA Memory Hierarchy Overview

Usage: Optimize performance by understanding different memory types and their characteristics.
Explanation: CUDA GPUs feature a hierarchy of memories: fast, on-chip memories like Registers (thread-private, fastest), Shared Memory (block-shared, very fast), and Constant Memory (read-only, cached), contrasted with slower, off-chip Global Memory (device-wide, largest capacity).
Takeaway: GPU performance hinges on efficiently using the memory hierarchy, prioritizing fast on-chip memories and exploiting data locality.

Shared Memory (`shared`) and Tiling

Usage: Utilize fast, on-chip memory for data shared and reused by threads within the same block, reducing global memory traffic.

Example Code:

// Inside a kernel function:
__shared__ float s_A[TILE_SIZE][TILE_SIZE]; // Shared memory declaration
// ...
s_A[threadIdx.y][threadIdx.x] = Md[global_row * WIDTH + global_col]; // Load into shared
__syncthreads(); // Synchronize before using shared data

Explanation: Declared with __shared__, this memory is accessible by all threads in a block and is much faster than global memory. Tiling algorithms load large data chunks into shared memory for block-local processing, then write results back to global memory, greatly improving locality.
Takeaway: __shared__ memory provides fast, on-chip storage for collaborative data reuse within a block, often utilized with tiling to reduce global memory traffic.

Constant Memory (`constant`)

Usage: Efficiently store read-only data that is accessed frequently and uniformly by all threads in the grid.

Example Code:

__constant__ float GLOBAL_SCALE_FACTOR = 2.0f; // Global, on-device constant

__global__ void scale_kernel(float *data) {
    data[threadIdx.x] *= GLOBAL_SCALE_FACTOR; // Access constant data
}
// Host code to copy data to constant memory:
// float host_factor = 2.0f;
// cudaMemcpyToSymbol(GLOBAL_SCALE_FACTOR, &host_factor, sizeof(float));

Explanation: Declared with __constant__, this global device memory is aggressively cached. It is ideal for configuration data or lookup tables that do not change during kernel execution and are read by many threads.
Takeaway: __constant__ memory offers aggressively cached, fast read-only access for data uniformly used by all threads in a grid.

Texture Memory

Usage: Optimize read-only access patterns, especially for 1D or 2D data like images, where access exhibits spatial locality.

Example Code (2D Texture Read):

// Host-side global texture reference declaration
texture<float, 2, cudaReadModeElementType> image_tex;

__global__ void process_image_kernel(float *output, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < width && y < height) {
        float pixel_val = tex2D(image_tex, (float)x, (float)y);
        output[y * width + x] = pixel_val * 0.5f;
    }
}
// Host code to bind texture (assuming dev_image_data is on GPU)
// cudaChannelFormatDesc chDesc = cudaCreateChannelDesc<float>();
// cudaBindTexture2D(0, image_tex, dev_image_data, &chDesc, width, height, pitch);

Explanation: Texture memory is a specialized, hardware-optimized cache for read-only data, providing fast access and optional hardware-assisted interpolation for data with spatial locality (nearby threads access nearby data).
Takeaway: Texture memory is a read-only cache optimized for 1D/2D data with spatial locality, providing fast access and optional hardware-assisted interpolation.

Zero-Copy Host Memory

Usage: Allow GPUs to directly access system memory (host memory), avoiding explicit cudaMemcpy calls.

Example Code:

// Host-side:
float *host_mapped_data;
cudaHostAlloc(&host_mapped_data, size, cudaHostAllocMapped); // Allocate pinned, mapped host memory
float *dev_ptr_to_host_data;
cudaHostGetDevicePointer(&dev_ptr_to_host_data, host_mapped_data, 0); // Get device-side pointer

// Kernel can now use dev_ptr_to_host_data to access 'host_mapped_data' directly
my_kernel<<<grid, block>>>(dev_ptr_to_host_data);

// Host-side cleanup:
// cudaFreeHost(host_mapped_data);

Explanation: This allocates host memory that is “pinned” (non-pageable) and mapped directly into the GPU’s address space. Kernels can then read/write this memory directly, useful for small, irregular transfers or overlapping host-device work.
Takeaway: Zero-copy memory allows direct GPU access to pinned host memory, eliminating explicit cudaMemcpy for certain use cases.

IV. Performance Optimization

Warp Divergence

Usage: Write kernel code to ensure optimal parallel execution within a warp.
Explanation: When threads within the same warp (a group of 32 threads) take different control flow paths (e.g., different branches of an if/else statement), they execute sequentially, leading to performance degradation. This is known as warp divergence.
Takeaway: Avoid warp divergence by ensuring threads within a warp follow the same execution path to maintain parallel efficiency.

Global Memory Coalescing

Usage: Design global memory access patterns to maximize memory throughput.

Example Code:

// Coalesced global memory access (threads in a warp read adjacent elements)
data[row * width + (blockIdx.x * blockDim.x + threadIdx.x)]

Explanation: Threads in a warp accessing consecutive global memory locations together allows the GPU to perform a single, efficient memory transaction, which is critical for high performance. Uncoalesced access results in multiple, slower transactions.
Takeaway: Maximize coalescing by having threads in a warp access contiguous global memory to achieve optimal throughput.

Resource Occupancy and Latency Hiding

Usage: Balance resource usage (registers, shared memory) to keep the GPU busy and hide memory access latencies.
Explanation: Occupancy is the ratio of active warps on an SM to the maximum possible. High occupancy allows the SM to quickly switch to another ready warp when one is stalled (e.g., waiting for memory), thereby hiding latency. Over-allocating resources can lead to a “performance cliff” by reducing the number of active warps.
Takeaway: High occupancy (many active warps) is crucial for hiding memory latency, but overusing resources can lead to performance cliffs.

Floating Point Considerations

Usage: Design numerically stable algorithms and understand the limitations of floating-point arithmetic on GPUs.
Explanation: Floating-point numbers (IEEE 754 standard) have finite precision, meaning not all real numbers can be represented exactly. This can lead to rounding errors, especially with operations involving widely different magnitudes. The order of operations in parallel reductions can affect accuracy. Hardware intrinsic functions (e.g., __sinf) are faster but may be less accurate.
Takeaway: Understand IEEE 754 floating-point formats, precision, and accuracy to design numerically stable parallel algorithms, as the order of operations can impact results due to rounding errors.

V. Advanced CUDA Features

Atomic Operations

Usage: Perform thread-safe read-modify-write operations on shared (global or shared) memory locations, preventing race conditions.

Example Code (Atomic Increment – Histogram):

__global__ void atomic_histogram_kernel(unsigned char *input_data, unsigned int *counts_array, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) {
        unsigned char value = input_data[tid];
        atomicAdd(&counts_array[value], 1); // Atomically increment the count
    }
}
// Example Atomic Lock (conceptual):
// while (atomicCAS(&lock_variable, 0, 1) != 0); // Acquire lock (0=unlocked, 1=locked)
// // ... critical section ...
// atomicExch(&lock_variable, 0);                 // Release lock

Explanation: Atomic functions (e.g., atomicAdd, atomicCAS – Compare And Swap, atomicExch) guarantee that a memory operation is completed without interference from other threads, even if multiple threads attempt to update the same location simultaneously. They are crucial for correct parallel updates and building synchronization primitives like locks.
Takeaway: Atomic operations guarantee thread-safe read-modify-write access to shared memory, crucial for correct parallel updates and building synchronization primitives.

CUDA Streams (Asynchronous Execution)

Usage: Overlap data transfers and kernel computations for improved GPU utilization and performance.

Example Code:

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);

cudaMemcpyAsync(dev_A, host_A, size, cudaMemcpyHostToDevice, stream1); // Async H2D copy
kernel_func_A<<<grid, block, 0, stream1>>>(dev_A);                  // Kernel in stream1
cudaMemcpyAsync(host_result_A, dev_A, size, cudaMemcpyDeviceToHost, stream1); // Async D2H copy

// Independent tasks in stream2 (can overlap with stream1)
cudaMemcpyAsync(dev_B, host_B, size, cudaMemcpyHostToDevice, stream2);
kernel_func_B<<<grid, block, 0, stream2>>>(dev_B);
cudaMemcpyAsync(host_result_B, dev_B, size, cudaMemcpyDeviceToHost, stream2);

cudaStreamSynchronize(stream1); // Wait for stream1 to complete
cudaStreamDestroy(stream1);

Explanation: Streams (cudaStream_t) allow the host to issue multiple operations (memory copies, kernel launches) to the GPU asynchronously. Operations within a stream execute sequentially, but operations in different streams can run concurrently, enabling overlapping computation and data transfer.
Takeaway: CUDA Streams enable asynchronous execution, allowing the overlap of data transfers and kernel computations for improved GPU utilization and performance.

CUDA Events (Timing and Synchronization)

Usage: Precisely measure kernel execution times and synchronize GPU operations without blocking the host CPU unnecessarily.

Example Code:

cudaEvent_t start_event, stop_event;
cudaEventCreate(&start_event); cudaEventCreate(&stop_event);

cudaEventRecord(start_event, 0); // Mark start time on default stream
my_kernel<<<num_blocks, threads_per_block>>>(dev_data);
cudaEventRecord(stop_event, 0);  // Mark end time

cudaEventSynchronize(stop_event); // Wait for GPU tasks to complete
float ms_elapsed;
cudaEventElapsedTime(&ms_elapsed, start_event, stop_event);
printf("Kernel time: %.3f ms\n", ms_elapsed);

cudaEventDestroy(start_event); cudaEventDestroy(stop_event);

Explanation: Events (cudaEvent_t) mark points in the GPU’s timeline. cudaEventRecord marks an event, cudaEventElapsedTime calculates time between two events, and cudaEventSynchronize makes the host wait for a specific event to complete.
Takeaway: CUDA Events provide precise performance profiling and non-blocking synchronization for GPU operations.

Multi-GPU Programming (`cudaSetDevice`, Peer-to-Peer)

Usage: Scale applications to use multiple GPUs for larger problems or increased throughput.

Example Code (Multi-GPU setup):

int num_devices;
cudaGetDeviceCount(&num_devices);
for (int i = 0; i < num_devices; ++i) {
    cudaSetDevice(i); // Select GPU 'i' to work on
    // Perform device-specific memory allocations, copies, and kernel launches
}
// For Peer-to-Peer:
// cudaDeviceEnablePeerAccess(peerDevice, 0); // Allows current device to access peerDevice's memory

Explanation: cudaSetDevice(deviceID) selects the active GPU for subsequent CUDA operations. Peer-to-Peer (P2P) access (cudaDeviceEnablePeerAccess) allows one GPU to directly read/write memory on another GPU, avoiding copies through the host, if supported by the hardware and OS.
Takeaway: Scale applications across multiple GPUs using cudaSetDevice and enable direct GPU-to-GPU data transfers with Peer-to-Peer access.

Graphics Interoperability

Usage: Directly exchange data between CUDA kernels and graphics APIs (like OpenGL or DirectX) on the GPU, avoiding host-device transfers.

Example Code (Conceptual Steps for PBO Interop):

// Host-side initialization:
// 1. Create OpenGL PBO: glGenBuffers(), glBindBuffer(), glBufferData().
// 2. Register PBO with CUDA: cudaGraphicsGLRegisterBuffer(&cuda_resource, pbo_id, cudaGraphicsMapFlagsWriteDiscard);

// Host-side in rendering loop:
// 1. Map resource: cudaGraphicsMapResources(1, &cuda_resource, 0);
// 2. Get device pointer: float* dev_ptr; cudaGraphicsResourceGetMappedPointer((void**)&dev_ptr, &size, cuda_resource);
// 3. Launch kernel: my_compute_kernel<<<..., >>> (dev_ptr); // Kernel writes directly to PBO memory
// 4. Unmap resource: cudaGraphicsUnmapResources(1, &cuda_resource, 0);
// 5. Use OpenGL to render the PBO content.

Explanation: This feature allows CUDA kernels to directly read from and write to memory buffers allocated by a graphics API (e.g., Pixel Buffer Objects, Vertex Buffer Objects). It involves registering the graphics resource with CUDA, mapping it to obtain a device pointer, and unmapping it after kernel execution, which significantly reduces data transfer overhead.
Takeaway: Graphics interoperability enables direct, high-speed data exchange between CUDA and graphics APIs on the GPU, avoiding unnecessary memory copies.

VI. Tools and Computational Thinking

CUDA Libraries, Debugging, and Profiling Tools

Usage: Expedite development, ensure correctness, and optimize performance of CUDA applications.
Explanation: NVIDIA provides highly optimized libraries (e.g., cuBLAS for linear algebra, cuFFT for FFTs, NPP for image processing). Debugging tools like cuda-gdb and cuda-memcheck (for memory error detection) help find bugs. Profiling tools like NVIDIA Visual Profiler (or nvprof) identify performance bottlenecks.
Takeaway: Leverage NVIDIA’s optimized libraries, powerful debugging (cuda-gdb, cuda-memcheck), and profiling (nvprof) tools for efficient CUDA development.

Computational Thinking and Algorithm Design

Usage: Strategically approach problems for parallel execution on GPUs.
Explanation: This involves formulating domain problems in terms of parallel computational steps, efficiently decomposing problems into independent work units, and selecting algorithms that are inherently parallel and leverage GPU architecture (e.g., favoring gather over scatter memory access patterns where threads read from different locations into private registers). Always consider Amdahl’s Law, which limits parallel speedup by the sequential portion of a program.
Takeaway: Effective parallel programming requires computational thinking: strategically decomposing problems, choosing gather-based access, and selecting algorithms that leverage GPU architecture, mindful of Amdahl’s Law.

VII. Other and Future Outlook

Introduction to OpenCL

Usage: Understand a cross-platform alternative for heterogeneous parallel computing.

Example Code (OpenCL Kernel Example):

// OpenCL Kernel Example (Device Code)
__kernel void vectorAdd(__global float* A, __global float* B, __global float* C, int N) {
    int gid = get_global_id(0); // Get global work item ID
    if (gid < N) {
        C[gid] = A[gid] + B[gid];
    }
}
// OpenCL Host Code Snippets (Conceptual)
// cl_context context; cl_command_queue queue; cl_kernel kernel;
// ... setup context, queue, compile kernel, create kernel ...
// size_t global_work_size = N; size_t local_work_size = 256;
// clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, &local_work_size, 0, NULL, NULL);

Explanation: OpenCL is a standardized API for heterogeneous computing on various devices (CPUs, GPUs, etc.). It uses similar concepts to CUDA: Work Items (threads), Work Groups (blocks), and NDRanges (grids). However, its host-side API for device management and kernel compilation is more explicit and complex.
Takeaway: OpenCL mirrors CUDA’s data-parallel model with similar concepts but its host API is more explicit to achieve cross-platform portability.

GPU Architecture Evolution and Future Outlook

Usage: Anticipate future trends in GPU programming and hardware capabilities.
Explanation: GPU architectures are continuously evolving, bringing more flexible memory models (e.g., unified virtual memory space), richer kernel execution features (e.g., function calls, printf in kernels), and enhanced performance (e.g., faster atomics, less strict coalescing requirements). Improved C++ language support and debugging tools also simplify development.
Takeaway: GPU architectures are rapidly evolving to provide more flexible memory models, richer kernel execution features, and better C++ support, promising even easier and more powerful parallel programming.

CUDA C/C++ Programming Reference: Fundamentals and Optimization

CUDA C/C++ Programming Reference for Students

I. Fundamentals and Setup

Why CUDA and General Purpose GPU Programming (GPGPU)

CUDA Development Environment

Host and Device Basics / Program Structure

Basic CUDA Memory Management

II. Thread Hierarchy and Execution

Thread Hierarchy (Grids, Blocks, Threads) and Identification

Warps and Transparent Scalability

Intra-Block Synchronization (`__syncthreads()`)

III. Memory Hierarchy and Management (Advanced)

CUDA Memory Hierarchy Overview

Shared Memory (`shared`) and Tiling

Constant Memory (`constant`)

Texture Memory

Zero-Copy Host Memory

IV. Performance Optimization

Warp Divergence

Global Memory Coalescing

Resource Occupancy and Latency Hiding

Floating Point Considerations

V. Advanced CUDA Features

Atomic Operations

CUDA Streams (Asynchronous Execution)

CUDA Events (Timing and Synchronization)

Multi-GPU Programming (`cudaSetDevice`, Peer-to-Peer)

Graphics Interoperability

VI. Tools and Computational Thinking

CUDA Libraries, Debugging, and Profiling Tools

Computational Thinking and Algorithm Design

VII. Other and Future Outlook

Introduction to OpenCL

GPU Architecture Evolution and Future Outlook

Recent Notes

Subjects

Publicidad

CUDA C/C++ Programming Reference: Fundamentals and Optimization

CUDA C/C++ Programming Reference for Students

I. Fundamentals and Setup

Why CUDA and General Purpose GPU Programming (GPGPU)

CUDA Development Environment

Host and Device Basics / Program Structure

Basic CUDA Memory Management

II. Thread Hierarchy and Execution

Thread Hierarchy (Grids, Blocks, Threads) and Identification

Warps and Transparent Scalability

Intra-Block Synchronization (__syncthreads())

III. Memory Hierarchy and Management (Advanced)

CUDA Memory Hierarchy Overview

Shared Memory (__shared__) and Tiling

Constant Memory (__constant__)

Texture Memory

Zero-Copy Host Memory

IV. Performance Optimization

Warp Divergence

Global Memory Coalescing

Resource Occupancy and Latency Hiding

Floating Point Considerations

V. Advanced CUDA Features

Atomic Operations

CUDA Streams (Asynchronous Execution)

CUDA Events (Timing and Synchronization)

Multi-GPU Programming (cudaSetDevice, Peer-to-Peer)

Graphics Interoperability

VI. Tools and Computational Thinking

CUDA Libraries, Debugging, and Profiling Tools

Computational Thinking and Algorithm Design

VII. Other and Future Outlook

Introduction to OpenCL

GPU Architecture Evolution and Future Outlook

Recent Notes

Subjects

Publicidad

Intra-Block Synchronization (`__syncthreads()`)

Shared Memory (`shared`) and Tiling

Constant Memory (`constant`)

Multi-GPU Programming (`cudaSetDevice`, Peer-to-Peer)