Dataflow Architectures and AI Hardware Optimization

Dataflow Architectures in Machine Learning

Why use dataflow architecture for ML?

Dataflow architectures eliminate the need for a program counter, executing instructions based solely on input data availability. This minimizes the massive energy and latency costs associated with fetching data from main memory (DRAM) in traditional von Neumann architectures, making it highly efficient for Machine Learning workloads.

Mapping Deep Learning to Dataflow

Multiply-and-Accumulate (MAC) operations and deep loop nests (e.g., convolutions) are the most suitable. They involve massive, repetitive computations with predictable data dependencies and no complex control flow branches.

Exploiting Data Locality

Dataflow exploits locality through two main strategies: temporal reuse (storing data in local registers/buffers to be reused multiple times by the same processing element) and spatial reuse (broadcasting the same data to multiple processing elements simultaneously).

Model Pruning and Sparsity

Pruning: Structured vs. Unstructured

Pruning is the process of zeroing out or removing weights in a neural network to achieve model sparsity.

  • Unstructured pruning: Randomly zeros out individual weights, maintaining original matrix dimensions.
  • Structured pruning: Removes entire structural blocks (filters, channels, or neurons), physically shrinking matrix dimensions.

Performance and Storage Benefits

Unstructured pruning does not directly translate to speedups on conventional hardware. Structured pruning reduces physical matrix size, leading to immediate latency improvements. While pruning doesn’t inherently reduce raw byte size, sparse models are significantly easier to compress using standard techniques.

Sources of Sparsity

The main sources are activation functions (specifically ReLU) and model weights (induced by pruning or L1/L2 regularization).

Pruning in TensorFlow

TensorFlow primarily implements magnitude-based unstructured pruning. The process involves: 1. Defining a pruning schedule (e.g., PolynomialDecay); 2. Wrapping the model using prune_low_magnitude; 3. Recompiling and fitting using the UpdatePruningStep callback; 4. Exporting via strip_pruning.

TensorFlow Lite (LiteRT) and Quantization

Quantization Techniques

Available techniques include: Post-training float16 quantization, post-training dynamic range quantization, post-training integer quantization, and quantization-aware training.

Inference and Dataflow

Inference benefits by avoiding memory bottlenecks. By keeping intermediate activations and weights in local, low-cost memory, hardware can continuously feed data into parallel MAC units, accelerating throughput.

Output vs. Weight Stationary

In Output Stationary architectures, partial sums are kept fixed in local registers. In Weight Stationary architectures, weights are kept fixed while inputs are streamed across them.

LiteRT and Delegates

LiteRT is a framework for running models on edge devices. Delegates are configuration options that allow the interpreter to offload operations to specialized hardware like GPUs.

Integer Quantization and Datasets

Low-end microcontrollers often lack floating-point support, requiring full integer quantization. A representative dataset is required to calibrate the dynamic range of tensors, enabling integer-only math, which improves latency and reduces memory usage by up to 75%.

AI Hardware Architectures

Modern GPU Capabilities

Modern GPUs support matrix sparsity (e.g., NVIDIA Ampere’s 2:4 sparse Tensor Cores) and a wide variety of formats including FP64, FP32, FP16, BF16, TF32, and FP8. They also support INT8 and INT4 operations for accelerated inference.

Edge Computing Markets

  • High-performance edge: Uses powerful GPUs (e.g., NVIDIA Jetson) for robotics and autonomous vehicles.
  • Mobile devices: Uses integrated SoCs with dedicated AI accelerators (e.g., Qualcomm Hexagon).
  • Ultra-low-power/IoT: Employs microcontrollers optimized for tight energy budgets.

FPGA vs. SoC

FPGAs offer fully adaptable hardware for custom, deterministic, ultra-low latency architectures. While i.MX9 and Jetson are both SoCs, Jetson is a high-performance device for heavy AI, whereas the i.MX9 is a power-efficient processor blending real-time microcontrollers with an NPU.

TPUs and Interconnects

A Google TPU is an ASIC utilizing a Systolic Array for massive compute throughput. For multi-GPU communication, NVLink provides a high-speed, direct interconnect that avoids the protocol overhead of PCIe. RDMA allows data transfers between nodes while bypassing the OS kernel and CPU.