Hands-On: Deploying a Local Generative AI Pipeline on Raspberry Pi 5 with AI HAT+ 2
TutorialEdge AIRaspberry Pi

Hands-On: Deploying a Local Generative AI Pipeline on Raspberry Pi 5 with AI HAT+ 2

UUnknown
2026-03-06
10 min read
Advertisement

Step-by-step guide to run and benchmark a local generative AI pipeline on Raspberry Pi 5 + AI HAT+ 2 for reliable offline demos.

Hook: Stop explaining outages — run reproducible, offline demos

Developers and platform engineers: if your demos, workshops, or dev-lab exercises fail because of flaky cloud access, high latency, or unpredictable costs, you need a reliable local stack. In 2026 the Raspberry Pi 5 + AI HAT+ 2 combination makes small, usable generative AI models realistic at the edge. This hands-on guide shows how to set up, run, and benchmark a compact generative pipeline for offline demos, classroom labs, and proof-of-concept deployments.

What you'll get — at a glance

  • Hardware checklist and setup tips for Raspberry Pi 5 + AI HAT+ 2
  • Two inference paths: CPU-only (llama.cpp / ggml) and accelerated (ONNX + AI HAT+ 2 delegate)
  • Model preparation and quantization steps for small generative models
  • Benchmarking recipes and scripts (latency, tokens/sec, power)
  • Optimization and production-ready tips for dev labs and offline demos

Why this matters in 2026

Late-2025 and early-2026 trends made this approach practical: 4-bit quantization (widespread), the GGUF model packaging standard, and improved ARM64 runtimes (llama.cpp, onnxruntime ARM delegates, and community-driven optimized kernels). Edge NPUs like those on the AI HAT+ 2 now offer real throughput for small to medium models, making low-latency, private inference feasible for demos and constrained deployments.

Prerequisites (hardware and software)

Hardware

  • Raspberry Pi 5 (4–8 GB models recommended; 8 GB preferred if you plan larger models)
  • AI HAT+ 2 module (vendor-supplied M.2/PCIe AI accelerator for Pi 5)
  • Fast storage: NVMe SSD (USB4/PCIe adapter) or high-end microSD (UHS-II)
  • Quality 5V/6A power supply (or factory-supplied Pi 5 PSU)
  • Optional: active cooling (fan + heatsink) and USB power meter for power benchmarking

Software

  • 64-bit OS: Raspberry Pi OS (64-bit) or Ubuntu Server 24.04/24.10 arm64 (2025/2026 builds)
  • Developer tools: git, build-essential, python3, pip
  • llama.cpp (or equivalent ggml runtime) for CPU path
  • ONNX + onnxruntime (ARM64) for accelerated path — plus AI HAT+ 2 vendor SDK/delegate

Step 1 — Hardware setup and first-boot configuration

  1. Assemble Pi 5 with AI HAT+ 2: securely seat the HAT into the Pi 5 expansion header or M.2 adapter per vendor instructions. Add an NVMe SSD to adapter if you plan to store models on disk for speed.
  2. Attach an active cooling solution — Pi 5 under load benefits significantly from an airflow fan + heatsink. Monitor temps during first runs.
  3. Flash your 64-bit OS (Raspberry Pi OS 64-bit or Ubuntu Server 24.04 LTS arm64). For Raspberry Pi OS, use Raspberry Pi Imager 2025+ and choose 64-bit release.
  4. Initial boot: update packages and enable SSH for headless access.
    sudo apt update && sudo apt full-upgrade -y
    sudo reboot
    
  5. Install common tools:
    sudo apt install -y git build-essential python3 python3-pip cmake libopenblas-dev liblapack-dev libomp-dev

Step 2 — Decide your inference path

Two practical approaches for developer labs:

  • CPU-only (fastest to set up): Use llama.cpp / ggml backends and a 1.3B–3B quantized model. Ideal for single-user demos and offline chat bots.
  • AI HAT+ 2 accelerated: Export model to ONNX or a vendor-supported runtime and use the HAT+ 2 delegate. Improved tokens/sec and lower latency for interactive demos with multiple sessions.

Path A — CPU-only with llama.cpp (quick and reliable)

Why choose this

llama.cpp (and forks) are lightweight, easy to compile on ARM64, and support multithreading and quantized ggml models. For many labs, a 1.3B or quantized 3B model gives acceptable latency without vendor drivers.

Install and build llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j4

If you have an 8-core Pi 5, use -j8. The build produces a fast, portable binary.

Get a small model and convert

Use a small open model (1.3B or community small models) in GGUF or convert an HF checkpoint to ggml/gguf. Many community models in 2025–2026 ship already in GGUF.

# Example: convert a Hugging Face checkpoint (high-level, follow model license)
# 1) download model to /models/my-model
# 2) use conversion utilities (see llama.cpp tools/convert or community scripts)
./quantize model-fp32.bin model-q4_0.bin q4_0

Store the quantized model on your NVMe SSD for best I/O performance.

Run a simple inference

./main -m models/my-model.gguf -p "Write a short workshop outline for local AI demos" -n 128

This should print streaming tokens. Measure latency with time if you want single-run numbers.

Path B — AI HAT+ 2 accelerated (higher throughput)

Why choose this

When you need faster interactive demos or multiple concurrent sessions, the AI HAT+ 2 accelerator reduces CPU load and increases tokens/sec. In 2026, vendor SDKs typically provide an ONNX Runtime delegate or an NPU runtime that plugs into existing pipelines.

Install the AI HAT+ 2 SDK and ONNX Runtime

Follow the vendor download instructions for AI HAT+ 2: install kernel driver, runtime, and ONNX delegate. The steps below are representative; always consult the vendor's docs for the exact package names and signatures.

# Example (vendor placeholder):
# 1) Install vendor kernel modules and runtime (requires reboot)
sudo dpkg -i ai-hat2-kernel-*.deb ai-hat2-runtime-*.deb
sudo reboot

# 2) Install onnxruntime for ARM64
python3 -m pip install onnxruntime==1.16.0 --extra-index-url https://download.vendor/onnx

After installation, verify the delegate is registered with onnxruntime and that the device shows up in vendor tools.

Export your model to ONNX

Use Hugging Face Transformers + Optimum (or the vendor’s conversion tool) to export a small model to ONNX. In many cases you’ll then apply quantization or compile to the vendor format.

python -m pip install transformers optimum[onnx]
python export_to_onnx.py --model my-small-model --output model.onnx
# Vendor-specific compilation (placeholder)
vendor-compile --input model.onnx --out model.hat2

Run onnxruntime with the HAT+ 2 delegate

python run_onnx_inference.py --model model.hat2 --prompt "Hello edge AI"

Expect lower wall-clock latency and higher tokens/sec, but check memory usage and thermal throttling under sustained load.

Model preparation and quantization best practices

  • Prefer 4-bit (q4) quantization for best size/perf tradeoff. 3–4-bit quantization is the standard in 2025–2026.
  • Use GGUF wherever possible — it’s become the de-facto packaging standard for ggml models and simplifies metadata handling.
  • Validate model outputs after quantization — small numeric differences are expected but verify prompts used in your demos.
  • Store hot models on NVMe to prevent I/O stalls when streaming, especially on demos that repeatedly load models.

Benchmarking: real, repeatable measurements

Benchmarks should measure three things: latency (time to first token), throughput (tokens/sec), and power draw. Use micro-benchmarks and representative prompts.

Simple latency/throughput script (bash)

# run-benchmark.sh
PROMPT="Summarize the history of local AI in two sentences."
MODEL_PATH=models/my-model.gguf
ITER=5
for i in $(seq 1 $ITER); do
  start=$(date +%s.%N)
  ./main -m $MODEL_PATH -p "$PROMPT" -n 128 > /tmp/out.txt
  end=$(date +%s.%N)
  elapsed=$(python3 - <

Parse output tokens and compute tokens/sec for throughput. For llama.cpp, it prints tokens periodically; you can capture and calculate tokens/sec.

Power measurement

  • Use an inline USB power meter between PSU and Pi to measure Watts during idle and during inference.
  • Record temperature with vcgencmd measure_temp (Raspberry Pi OS provides it) and watch for thermal throttling.

Interpreting results — practical expectations (2026)

Typical community-observed ranges (late 2025–2026):

  • CPU-only, 1.3B quantized: ~10–40 tokens/sec (depending on quantization and threads)
  • CPU-only, 3B quantized: ~3–12 tokens/sec (memory-bound on 4GB devices)
  • AI HAT+ 2 accelerated, small models: 50–300 tokens/sec (highly variable based on delegate & quantization)

These are ranges; your mileage depends on model, conversational settings, token context length, and whether you use streaming or full-batch decoding.

Performance tuning checklist

  • Enable multithreading in runtime (set OMP_NUM_THREADS to number of physical cores minus 1)
  • Pin processes to cores if mixing CPU and NPU workloads to reduce contention
  • Use context window management — shorter contexts = lower latency and memory use
  • Pre-warm the model process to avoid cold-start overheads for demos (run a dummy prompt at startup)
  • Implement output streaming to improve perceived latency for users

Security, privacy & offline best practices

Edge deployments are attractive because they reduce data exfiltration risks. For dev labs and offline demos:

  • Keep models and inference on-device; disable outbound network access for the inference container if possible
  • Run the inference process with a low-privilege user and limit accessible filesystem paths
  • Sanitize prompts in multi-user labs to avoid running code or leaking secrets in shared sessions

Real-world mini postmortem: a workshop that failed — and how we fixed it

Problem: During a 2025 university workshop, cloud-hosted demo instances hit rate limits and the network failed. Attendees experienced long stalls and the session got canceled.

Remediation: we rebuilt the demo to run locally on Raspberry Pi 5 + AI HAT+ 2. The steps that saved the day:

  1. Switched to a 1.3B gguf quantized model (fits in memory, quick to load)
  2. Pre-warmed with 3 short prompts to avoid cold-start latency for attendees
  3. Provided USB power meters and taught students to measure energy per prompt as a FinOps exercise
  4. Ran all instances offline with a local web UI on a Pi to mimic a cloud-hosted experience

Outcome: the lab ran smoothly, students got hands-on experience without cloud credits, and the organizers learned the value of local reproducible demos.

Advanced strategies & future-proofing (2026 and beyond)

  • Model distillation — produce task-specific distilled models for even lower latency on Pi-class hardware.
  • Incremental quantization & pruning — run A/B tests to tune quantization knobs for best perceived quality vs. speed.
  • Containerized reproducibility — ship a Docker image with the exact runtime, vendor SDK, and model so labs reproduce behavior reliably.
  • Edge orchestration — use lightweight orchestrators (k3s or Nomad) for multi-Pi labs to manage updates and telemetry.

Common pitfalls and how to avoid them

  1. Insufficient cooling: causes thermal throttling and flaky performance. Add a fan and monitor temps.
  2. Using too-large models: pick 1.3B–3B for Pi 5 unless you have 8 GB and a strong SSD; otherwise use a smaller model.
  3. Ignoring vendor docs: vendor runtime and kernel driver versions must match; mismatches lead to delegate failures.
  4. No power measurement: you can’t optimize for energy without measuring it — bring a USB power meter.

Checklist for replicable dev-lab demos

  • Pre-download and store models locally (GGUF preferred)
  • Create a startup script that pre-warms the model and starts a local web UI
  • Document exact OS and SDK versions; capture the whole environment as a container image
  • Run benchmarks and include expected numbers in the lab README
Edge-first AI demos in 2026 are practical: with quantization and small models you can deliver private, low-latency experiences using hardware like the Raspberry Pi 5 + AI HAT+ 2.

Final thoughts and next steps

By following the paths above you can move from flaky cloud demos to deterministic, offline workshops. Start with the CPU-only llama.cpp approach for speed of setup, then iterate to the AI HAT+ 2 accelerated path for higher throughput. Measure, tune, and document — that’s how you build reliable developer labs that teach real skills without depending on cloud quotas.

Actionable takeaways

  • Use a 64-bit OS and fast storage; 8 GB Pi 5 yields the most flexibility.
  • Start with a quantized 1.3B model on llama.cpp for easiest replication.
  • Install the AI HAT+ 2 SDK and ONNX delegate to unlock higher tokens/sec for multi-user demos.
  • Benchmark latency, tokens/sec, and power; document the expected ranges for your audience.
  • Containerize the runtime to make repeatable labs for workshops and training.

Call to action

Ready to build your first offline generative AI lab on a Pi? Download the accompanying repository with example scripts, model conversion helpers, and benchmark templates at our GitHub (search "behind-cloud/pi-ai-lab"). Share your results, and if you need help architecting a reproducible workshop or scaling to multi-device labs, contact our DevOps team for a hands-on session.

Advertisement

Related Topics

#Tutorial#Edge AI#Raspberry Pi
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-06T04:34:27.066Z