Hands-On: Deploying a Local Generative AI Pipeline on Raspberry Pi 5 with AI HAT+ 2
Step-by-step guide to run and benchmark a local generative AI pipeline on Raspberry Pi 5 + AI HAT+ 2 for reliable offline demos.
Hook: Stop explaining outages — run reproducible, offline demos
Developers and platform engineers: if your demos, workshops, or dev-lab exercises fail because of flaky cloud access, high latency, or unpredictable costs, you need a reliable local stack. In 2026 the Raspberry Pi 5 + AI HAT+ 2 combination makes small, usable generative AI models realistic at the edge. This hands-on guide shows how to set up, run, and benchmark a compact generative pipeline for offline demos, classroom labs, and proof-of-concept deployments.
What you'll get — at a glance
- Hardware checklist and setup tips for Raspberry Pi 5 + AI HAT+ 2
- Two inference paths: CPU-only (llama.cpp / ggml) and accelerated (ONNX + AI HAT+ 2 delegate)
- Model preparation and quantization steps for small generative models
- Benchmarking recipes and scripts (latency, tokens/sec, power)
- Optimization and production-ready tips for dev labs and offline demos
Why this matters in 2026
Late-2025 and early-2026 trends made this approach practical: 4-bit quantization (widespread), the GGUF model packaging standard, and improved ARM64 runtimes (llama.cpp, onnxruntime ARM delegates, and community-driven optimized kernels). Edge NPUs like those on the AI HAT+ 2 now offer real throughput for small to medium models, making low-latency, private inference feasible for demos and constrained deployments.
Prerequisites (hardware and software)
Hardware
- Raspberry Pi 5 (4–8 GB models recommended; 8 GB preferred if you plan larger models)
- AI HAT+ 2 module (vendor-supplied M.2/PCIe AI accelerator for Pi 5)
- Fast storage: NVMe SSD (USB4/PCIe adapter) or high-end microSD (UHS-II)
- Quality 5V/6A power supply (or factory-supplied Pi 5 PSU)
- Optional: active cooling (fan + heatsink) and USB power meter for power benchmarking
Software
- 64-bit OS: Raspberry Pi OS (64-bit) or Ubuntu Server 24.04/24.10 arm64 (2025/2026 builds)
- Developer tools: git, build-essential, python3, pip
- llama.cpp (or equivalent ggml runtime) for CPU path
- ONNX + onnxruntime (ARM64) for accelerated path — plus AI HAT+ 2 vendor SDK/delegate
Step 1 — Hardware setup and first-boot configuration
- Assemble Pi 5 with AI HAT+ 2: securely seat the HAT into the Pi 5 expansion header or M.2 adapter per vendor instructions. Add an NVMe SSD to adapter if you plan to store models on disk for speed.
- Attach an active cooling solution — Pi 5 under load benefits significantly from an airflow fan + heatsink. Monitor temps during first runs.
- Flash your 64-bit OS (Raspberry Pi OS 64-bit or Ubuntu Server 24.04 LTS arm64). For Raspberry Pi OS, use Raspberry Pi Imager 2025+ and choose 64-bit release.
- Initial boot: update packages and enable SSH for headless access.
sudo apt update && sudo apt full-upgrade -y sudo reboot - Install common tools:
sudo apt install -y git build-essential python3 python3-pip cmake libopenblas-dev liblapack-dev libomp-dev
Step 2 — Decide your inference path
Two practical approaches for developer labs:
- CPU-only (fastest to set up): Use llama.cpp / ggml backends and a 1.3B–3B quantized model. Ideal for single-user demos and offline chat bots.
- AI HAT+ 2 accelerated: Export model to ONNX or a vendor-supported runtime and use the HAT+ 2 delegate. Improved tokens/sec and lower latency for interactive demos with multiple sessions.
Path A — CPU-only with llama.cpp (quick and reliable)
Why choose this
llama.cpp (and forks) are lightweight, easy to compile on ARM64, and support multithreading and quantized ggml models. For many labs, a 1.3B or quantized 3B model gives acceptable latency without vendor drivers.
Install and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j4
If you have an 8-core Pi 5, use -j8. The build produces a fast, portable binary.
Get a small model and convert
Use a small open model (1.3B or community small models) in GGUF or convert an HF checkpoint to ggml/gguf. Many community models in 2025–2026 ship already in GGUF.
# Example: convert a Hugging Face checkpoint (high-level, follow model license)
# 1) download model to /models/my-model
# 2) use conversion utilities (see llama.cpp tools/convert or community scripts)
./quantize model-fp32.bin model-q4_0.bin q4_0
Store the quantized model on your NVMe SSD for best I/O performance.
Run a simple inference
./main -m models/my-model.gguf -p "Write a short workshop outline for local AI demos" -n 128
This should print streaming tokens. Measure latency with time if you want single-run numbers.
Path B — AI HAT+ 2 accelerated (higher throughput)
Why choose this
When you need faster interactive demos or multiple concurrent sessions, the AI HAT+ 2 accelerator reduces CPU load and increases tokens/sec. In 2026, vendor SDKs typically provide an ONNX Runtime delegate or an NPU runtime that plugs into existing pipelines.
Install the AI HAT+ 2 SDK and ONNX Runtime
Follow the vendor download instructions for AI HAT+ 2: install kernel driver, runtime, and ONNX delegate. The steps below are representative; always consult the vendor's docs for the exact package names and signatures.
# Example (vendor placeholder):
# 1) Install vendor kernel modules and runtime (requires reboot)
sudo dpkg -i ai-hat2-kernel-*.deb ai-hat2-runtime-*.deb
sudo reboot
# 2) Install onnxruntime for ARM64
python3 -m pip install onnxruntime==1.16.0 --extra-index-url https://download.vendor/onnx
After installation, verify the delegate is registered with onnxruntime and that the device shows up in vendor tools.
Export your model to ONNX
Use Hugging Face Transformers + Optimum (or the vendor’s conversion tool) to export a small model to ONNX. In many cases you’ll then apply quantization or compile to the vendor format.
python -m pip install transformers optimum[onnx]
python export_to_onnx.py --model my-small-model --output model.onnx
# Vendor-specific compilation (placeholder)
vendor-compile --input model.onnx --out model.hat2
Run onnxruntime with the HAT+ 2 delegate
python run_onnx_inference.py --model model.hat2 --prompt "Hello edge AI"
Expect lower wall-clock latency and higher tokens/sec, but check memory usage and thermal throttling under sustained load.
Model preparation and quantization best practices
- Prefer 4-bit (q4) quantization for best size/perf tradeoff. 3–4-bit quantization is the standard in 2025–2026.
- Use GGUF wherever possible — it’s become the de-facto packaging standard for ggml models and simplifies metadata handling.
- Validate model outputs after quantization — small numeric differences are expected but verify prompts used in your demos.
- Store hot models on NVMe to prevent I/O stalls when streaming, especially on demos that repeatedly load models.
Benchmarking: real, repeatable measurements
Benchmarks should measure three things: latency (time to first token), throughput (tokens/sec), and power draw. Use micro-benchmarks and representative prompts.
Simple latency/throughput script (bash)
# run-benchmark.sh
PROMPT="Summarize the history of local AI in two sentences."
MODEL_PATH=models/my-model.gguf
ITER=5
for i in $(seq 1 $ITER); do
start=$(date +%s.%N)
./main -m $MODEL_PATH -p "$PROMPT" -n 128 > /tmp/out.txt
end=$(date +%s.%N)
elapsed=$(python3 - <
Parse output tokens and compute tokens/sec for throughput. For llama.cpp, it prints tokens periodically; you can capture and calculate tokens/sec.
Power measurement
- Use an inline USB power meter between PSU and Pi to measure Watts during idle and during inference.
- Record temperature with
vcgencmd measure_temp(Raspberry Pi OS provides it) and watch for thermal throttling.
Interpreting results — practical expectations (2026)
Typical community-observed ranges (late 2025–2026):
- CPU-only, 1.3B quantized: ~10–40 tokens/sec (depending on quantization and threads)
- CPU-only, 3B quantized: ~3–12 tokens/sec (memory-bound on 4GB devices)
- AI HAT+ 2 accelerated, small models: 50–300 tokens/sec (highly variable based on delegate & quantization)
These are ranges; your mileage depends on model, conversational settings, token context length, and whether you use streaming or full-batch decoding.
Performance tuning checklist
- Enable multithreading in runtime (set OMP_NUM_THREADS to number of physical cores minus 1)
- Pin processes to cores if mixing CPU and NPU workloads to reduce contention
- Use context window management — shorter contexts = lower latency and memory use
- Pre-warm the model process to avoid cold-start overheads for demos (run a dummy prompt at startup)
- Implement output streaming to improve perceived latency for users
Security, privacy & offline best practices
Edge deployments are attractive because they reduce data exfiltration risks. For dev labs and offline demos:
- Keep models and inference on-device; disable outbound network access for the inference container if possible
- Run the inference process with a low-privilege user and limit accessible filesystem paths
- Sanitize prompts in multi-user labs to avoid running code or leaking secrets in shared sessions
Real-world mini postmortem: a workshop that failed — and how we fixed it
Problem: During a 2025 university workshop, cloud-hosted demo instances hit rate limits and the network failed. Attendees experienced long stalls and the session got canceled.
Remediation: we rebuilt the demo to run locally on Raspberry Pi 5 + AI HAT+ 2. The steps that saved the day:
- Switched to a 1.3B gguf quantized model (fits in memory, quick to load)
- Pre-warmed with 3 short prompts to avoid cold-start latency for attendees
- Provided USB power meters and taught students to measure energy per prompt as a FinOps exercise
- Ran all instances offline with a local web UI on a Pi to mimic a cloud-hosted experience
Outcome: the lab ran smoothly, students got hands-on experience without cloud credits, and the organizers learned the value of local reproducible demos.
Advanced strategies & future-proofing (2026 and beyond)
- Model distillation — produce task-specific distilled models for even lower latency on Pi-class hardware.
- Incremental quantization & pruning — run A/B tests to tune quantization knobs for best perceived quality vs. speed.
- Containerized reproducibility — ship a Docker image with the exact runtime, vendor SDK, and model so labs reproduce behavior reliably.
- Edge orchestration — use lightweight orchestrators (k3s or Nomad) for multi-Pi labs to manage updates and telemetry.
Common pitfalls and how to avoid them
- Insufficient cooling: causes thermal throttling and flaky performance. Add a fan and monitor temps.
- Using too-large models: pick 1.3B–3B for Pi 5 unless you have 8 GB and a strong SSD; otherwise use a smaller model.
- Ignoring vendor docs: vendor runtime and kernel driver versions must match; mismatches lead to delegate failures.
- No power measurement: you can’t optimize for energy without measuring it — bring a USB power meter.
Checklist for replicable dev-lab demos
- Pre-download and store models locally (GGUF preferred)
- Create a startup script that pre-warms the model and starts a local web UI
- Document exact OS and SDK versions; capture the whole environment as a container image
- Run benchmarks and include expected numbers in the lab README
Edge-first AI demos in 2026 are practical: with quantization and small models you can deliver private, low-latency experiences using hardware like the Raspberry Pi 5 + AI HAT+ 2.
Final thoughts and next steps
By following the paths above you can move from flaky cloud demos to deterministic, offline workshops. Start with the CPU-only llama.cpp approach for speed of setup, then iterate to the AI HAT+ 2 accelerated path for higher throughput. Measure, tune, and document — that’s how you build reliable developer labs that teach real skills without depending on cloud quotas.
Actionable takeaways
- Use a 64-bit OS and fast storage; 8 GB Pi 5 yields the most flexibility.
- Start with a quantized 1.3B model on llama.cpp for easiest replication.
- Install the AI HAT+ 2 SDK and ONNX delegate to unlock higher tokens/sec for multi-user demos.
- Benchmark latency, tokens/sec, and power; document the expected ranges for your audience.
- Containerize the runtime to make repeatable labs for workshops and training.
Call to action
Ready to build your first offline generative AI lab on a Pi? Download the accompanying repository with example scripts, model conversion helpers, and benchmark templates at our GitHub (search "behind-cloud/pi-ai-lab"). Share your results, and if you need help architecting a reproducible workshop or scaling to multi-device labs, contact our DevOps team for a hands-on session.
Related Reading
- Protect Your Solar Rebate Application From Email Hijacking
- Offerings That Sell: How Boutique Hotels Can Monetize Craft Cocktail Syrups
- How to Build a Creator Travel Kit: Chargers, VPNs, and Mobile Plans That Save Money
- Marc Cuban’s Bet on Nightlife: What Investors Can Learn from Experiential Entertainment Funding
- Build vs Buy: How to Decide Whether Your Next App Should Be a Micro App You Make In‑House
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Operating System Resilience: Lessons from Windows on Linux for Cloud Systems
Powering Through the Storm: Strategies to Bolster Cloud Infrastructure Resilience
Inside the Cloud: Lessons from Recent Microsoft Outages
Impact of Recent Policy Changes on Cloud Compliance Strategies
Learning from Game Development: Applying Iterative Design to Cloud Infrastructure
From Our Network
Trending stories across our publication group