Deploy Generative AI on Raspberry Pi 5 + AI HAT+ 2

Step-by-step guide to run and benchmark a local generative AI pipeline on Raspberry Pi 5 + AI HAT+ 2 for reliable offline demos.

Hook: Stop explaining outages — run reproducible, offline demos

Developers and platform engineers: if your demos, workshops, or dev-lab exercises fail because of flaky cloud access, high latency, or unpredictable costs, you need a reliable local stack. In 2026 the Raspberry Pi 5 + AI HAT+ 2 combination makes small, usable generative AI models realistic at the edge. This hands-on guide shows how to set up, run, and benchmark a compact generative pipeline for offline demos, classroom labs, and proof-of-concept deployments.

What you'll get — at a glance

Hardware checklist and setup tips for Raspberry Pi 5 + AI HAT+ 2
Two inference paths: CPU-only (llama.cpp / ggml) and accelerated (ONNX + AI HAT+ 2 delegate)
Model preparation and quantization steps for small generative models
Benchmarking recipes and scripts (latency, tokens/sec, power)
Optimization and production-ready tips for dev labs and offline demos

Why this matters in 2026

Late-2025 and early-2026 trends made this approach practical: 4-bit quantization (widespread), the GGUF model packaging standard, and improved ARM64 runtimes (llama.cpp, onnxruntime ARM delegates, and community-driven optimized kernels). Edge NPUs like those on the AI HAT+ 2 now offer real throughput for small to medium models, making low-latency, private inference feasible for demos and constrained deployments.

Prerequisites (hardware and software)

Hardware

Raspberry Pi 5 (4–8 GB models recommended; 8 GB preferred if you plan larger models)
AI HAT+ 2 module (vendor-supplied M.2/PCIe AI accelerator for Pi 5)
Fast storage: NVMe SSD (USB4/PCIe adapter) or high-end microSD (UHS-II)
Quality 5V/6A power supply (or factory-supplied Pi 5 PSU)
Optional: active cooling (fan + heatsink) and USB power meter for power benchmarking

Software

64-bit OS: Raspberry Pi OS (64-bit) or Ubuntu Server 24.04/24.10 arm64 (2025/2026 builds)
Developer tools: git, build-essential, python3, pip
llama.cpp (or equivalent ggml runtime) for CPU path
ONNX + onnxruntime (ARM64) for accelerated path — plus AI HAT+ 2 vendor SDK/delegate

Step 1 — Hardware setup and first-boot configuration

Assemble Pi 5 with AI HAT+ 2: securely seat the HAT into the Pi 5 expansion header or M.2 adapter per vendor instructions. Add an NVMe SSD to adapter if you plan to store models on disk for speed.
Attach an active cooling solution — Pi 5 under load benefits significantly from an airflow fan + heatsink. Monitor temps during first runs.
Flash your 64-bit OS (Raspberry Pi OS 64-bit or Ubuntu Server 24.04 LTS arm64). For Raspberry Pi OS, use Raspberry Pi Imager 2025+ and choose 64-bit release.
Initial boot: update packages and enable SSH for headless access.
```
sudo apt update && sudo apt full-upgrade -y
sudo reboot
```

Install common tools:

sudo apt install -y git build-essential python3 python3-pip cmake libopenblas-dev liblapack-dev libomp-dev

Step 2 — Decide your inference path

Two practical approaches for developer labs:

CPU-only (fastest to set up): Use llama.cpp / ggml backends and a 1.3B–3B quantized model. Ideal for single-user demos and offline chat bots.
AI HAT+ 2 accelerated: Export model to ONNX or a vendor-supported runtime and use the HAT+ 2 delegate. Improved tokens/sec and lower latency for interactive demos with multiple sessions.

Path A — CPU-only with llama.cpp (quick and reliable)

Why choose this

llama.cpp (and forks) are lightweight, easy to compile on ARM64, and support multithreading and quantized ggml models. For many labs, a 1.3B or quantized 3B model gives acceptable latency without vendor drivers.

Install and build llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j4

If you have an 8-core Pi 5, use -j8. The build produces a fast, portable binary.

Get a small model and convert

Use a small open model (1.3B or community small models) in GGUF or convert an HF checkpoint to ggml/gguf. Many community models in 2025–2026 ship already in GGUF.

# Example: convert a Hugging Face checkpoint (high-level, follow model license)
# 1) download model to /models/my-model
# 2) use conversion utilities (see llama.cpp tools/convert or community scripts)
./quantize model-fp32.bin model-q4_0.bin q4_0

Store the quantized model on your NVMe SSD for best I/O performance.

Run a simple inference

./main -m models/my-model.gguf -p "Write a short workshop outline for local AI demos" -n 128

This should print streaming tokens. Measure latency with time if you want single-run numbers.

Path B — AI HAT+ 2 accelerated (higher throughput)

Why choose this

When you need faster interactive demos or multiple concurrent sessions, the AI HAT+ 2 accelerator reduces CPU load and increases tokens/sec. In 2026, vendor SDKs typically provide an ONNX Runtime delegate or an NPU runtime that plugs into existing pipelines.

Install the AI HAT+ 2 SDK and ONNX Runtime

Follow the vendor download instructions for AI HAT+ 2: install kernel driver, runtime, and ONNX delegate. The steps below are representative; always consult the vendor's docs for the exact package names and signatures.

# Example (vendor placeholder):
# 1) Install vendor kernel modules and runtime (requires reboot)
sudo dpkg -i ai-hat2-kernel-*.deb ai-hat2-runtime-*.deb
sudo reboot

# 2) Install onnxruntime for ARM64
python3 -m pip install onnxruntime==1.16.0 --extra-index-url https://download.vendor/onnx

After installation, verify the delegate is registered with onnxruntime and that the device shows up in vendor tools.

Export your model to ONNX

Use Hugging Face Transformers + Optimum (or the vendor’s conversion tool) to export a small model to ONNX. In many cases you’ll then apply quantization or compile to the vendor format.

python -m pip install transformers optimum[onnx]
python export_to_onnx.py --model my-small-model --output model.onnx
# Vendor-specific compilation (placeholder)
vendor-compile --input model.onnx --out model.hat2

Run onnxruntime with the HAT+ 2 delegate

python run_onnx_inference.py --model model.hat2 --prompt "Hello edge AI"

Expect lower wall-clock latency and higher tokens/sec, but check memory usage and thermal throttling under sustained load.

Model preparation and quantization best practices

Prefer 4-bit (q4) quantization for best size/perf tradeoff. 3–4-bit quantization is the standard in 2025–2026.
Use GGUF wherever possible — it’s become the de-facto packaging standard for ggml models and simplifies metadata handling.
Validate model outputs after quantization — small numeric differences are expected but verify prompts used in your demos.
Store hot models on NVMe to prevent I/O stalls when streaming, especially on demos that repeatedly load models.

Benchmarking: real, repeatable measurements

Benchmarks should measure three things: latency (time to first token), throughput (tokens/sec), and power draw. Use micro-benchmarks and representative prompts.

Simple latency/throughput script (bash)

# run-benchmark.sh
PROMPT="Summarize the history of local AI in two sentences."
MODEL_PATH=models/my-model.gguf
ITER=5
for i in $(seq 1 $ITER); do
  start=$(date +%s.%N)
  ./main -m $MODEL_PATH -p "$PROMPT" -n 128 > /tmp/out.txt
  end=$(date +%s.%N)
  elapsed=$(python3 - <


  Parse output tokens and compute tokens/sec for throughput. For llama.cpp, it prints tokens periodically; you can capture and calculate tokens/sec.

  Power measurement
  Use an inline USB power meter between PSU and Pi to measure Watts during idle and during inference.
Record temperature with vcgencmd measure_temp (Raspberry Pi OS provides it) and watch for thermal throttling.

  Interpreting results — practical expectations (2026)
  Typical community-observed ranges (late 2025–2026):
  CPU-only, 1.3B quantized: ~10–40 tokens/sec (depending on quantization and threads)
CPU-only, 3B quantized: ~3–12 tokens/sec (memory-bound on 4GB devices)
AI HAT+ 2 accelerated, small models: 50–300 tokens/sec (highly variable based on delegate & quantization)
  These are ranges; your mileage depends on model, conversational settings, token context length, and whether you use streaming or full-batch decoding.

  Performance tuning checklist
  Enable multithreading in runtime (set OMP_NUM_THREADS to number of physical cores minus 1)
Pin processes to cores if mixing CPU and NPU workloads to reduce contention
Use context window management — shorter contexts = lower latency and memory use
Pre-warm the model process to avoid cold-start overheads for demos (run a dummy prompt at startup)
Implement output streaming to improve perceived latency for users

  Security, privacy & offline best practices
  Edge deployments are attractive because they reduce data exfiltration risks. For dev labs and offline demos:
  Keep models and inference on-device; disable outbound network access for the inference container if possible
Run the inference process with a low-privilege user and limit accessible filesystem paths
Sanitize prompts in multi-user labs to avoid running code or leaking secrets in shared sessions

  Real-world mini postmortem: a workshop that failed — and how we fixed it
  Problem: During a 2025 university workshop, cloud-hosted demo instances hit rate limits and the network failed. Attendees experienced long stalls and the session got canceled.
  Remediation: we rebuilt the demo to run locally on Raspberry Pi 5 + AI HAT+ 2. The steps that saved the day:
  Switched to a 1.3B gguf quantized model (fits in memory, quick to load)
Pre-warmed with 3 short prompts to avoid cold-start latency for attendees
Provided USB power meters and taught students to measure energy per prompt as a FinOps exercise
Ran all instances offline with a local web UI on a Pi to mimic a cloud-hosted experience
  Outcome: the lab ran smoothly, students got hands-on experience without cloud credits, and the organizers learned the value of local reproducible demos.

  Advanced strategies & future-proofing (2026 and beyond)
  Model distillation — produce task-specific distilled models for even lower latency on Pi-class hardware.
Incremental quantization & pruning — run A/B tests to tune quantization knobs for best perceived quality vs. speed.
Containerized reproducibility — ship a Docker image with the exact runtime, vendor SDK, and model so labs reproduce behavior reliably.
Edge orchestration — use lightweight orchestrators (k3s or Nomad) for multi-Pi labs to manage updates and telemetry.

  Common pitfalls and how to avoid them
  Insufficient cooling: causes thermal throttling and flaky performance. Add a fan and monitor temps.
Using too-large models: pick 1.3B–3B for Pi 5 unless you have 8 GB and a strong SSD; otherwise use a smaller model.
Ignoring vendor docs: vendor runtime and kernel driver versions must match; mismatches lead to delegate failures.
No power measurement: you can’t optimize for energy without measuring it — bring a USB power meter.

  Checklist for replicable dev-lab demos
  Pre-download and store models locally (GGUF preferred)
Create a startup script that pre-warms the model and starts a local web UI
Document exact OS and SDK versions; capture the whole environment as a container image
Run benchmarks and include expected numbers in the lab README

  Edge-first AI demos in 2026 are practical: with quantization and small models you can deliver private, low-latency experiences using hardware like the Raspberry Pi 5 + AI HAT+ 2.

  Final thoughts and next steps
  By following the paths above you can move from flaky cloud demos to deterministic, offline workshops. Start with the CPU-only llama.cpp approach for speed of setup, then iterate to the AI HAT+ 2 accelerated path for higher throughput. Measure, tune, and document — that’s how you build reliable developer labs that teach real skills without depending on cloud quotas.

  Actionable takeaways
  Use a 64-bit OS and fast storage; 8 GB Pi 5 yields the most flexibility.
Start with a quantized 1.3B model on llama.cpp for easiest replication.
Install the AI HAT+ 2 SDK and ONNX delegate to unlock higher tokens/sec for multi-user demos.
Benchmark latency, tokens/sec, and power; document the expected ranges for your audience.
Containerize the runtime to make repeatable labs for workshops and training.

  Call to action
  Ready to build your first offline generative AI lab on a Pi? Download the accompanying repository with example scripts, model conversion helpers, and benchmark templates at our GitHub (search "behind-cloud/pi-ai-lab"). Share your results, and if you need help architecting a reproducible workshop or scaling to multi-device labs, contact our DevOps team for a hands-on session.



Related Reading
Protect Your Solar Rebate Application From Email Hijacking
Offerings That Sell: How Boutique Hotels Can Monetize Craft Cocktail Syrups
How to Build a Creator Travel Kit: Chargers, VPNs, and Mobile Plans That Save Money
Marc Cuban’s Bet on Nightlife: What Investors Can Learn from Experiential Entertainment Funding
Build vs Buy: How to Decide Whether Your Next App Should Be a Micro App You Make In‑House

Hands-On: Deploying a Local Generative AI Pipeline on Raspberry Pi 5 with AI HAT+ 2

Hook: Stop explaining outages — run reproducible, offline demos

What you'll get — at a glance

Why this matters in 2026

Prerequisites (hardware and software)

Hardware

Software

Step 1 — Hardware setup and first-boot configuration

Step 2 — Decide your inference path

Path A — CPU-only with llama.cpp (quick and reliable)

Why choose this

Install and build llama.cpp

Get a small model and convert

Run a simple inference

Path B — AI HAT+ 2 accelerated (higher throughput)

Why choose this

Install the AI HAT+ 2 SDK and ONNX Runtime

Export your model to ONNX

Run onnxruntime with the HAT+ 2 delegate

Model preparation and quantization best practices

Benchmarking: real, repeatable measurements

Simple latency/throughput script (bash)

Power measurement

Interpreting results — practical expectations (2026)

Performance tuning checklist

Security, privacy & offline best practices

Real-world mini postmortem: a workshop that failed — and how we fixed it

Advanced strategies & future-proofing (2026 and beyond)

Common pitfalls and how to avoid them

Checklist for replicable dev-lab demos

Final thoughts and next steps

Actionable takeaways

Call to action

Related Topics

behind

Up Next

Service Mesh Comparison: Istio vs Linkerd vs Cilium Service Mesh

OpenTelemetry Collector Configuration Patterns for Production

Container Registry Comparison: ECR vs GHCR vs GCR vs Docker Hub