Run Local Generative AI on Raspberry Pi 5: A DevOps Quickstart with the AI HAT+ 2
Quickstart: set up AI HAT+ 2 on Raspberry Pi 5, run a small edge LLM, expose an API, and add CI to keep the model container updated.
Run Local Generative AI on Raspberry Pi 5: A DevOps Quickstart with the AI HAT+ 2
Hook: If your team is wrestling with latency, privacy, and operations overhead when integrating cloud LLMs, running a compact generative model locally on a Raspberry Pi 5 with the new $130 AI HAT+ 2 can cut inference latency, reduce costs, and keep sensitive data on-prem. This quickstart gets an edge LLM up and serving as an integration endpoint and adds CI so the model container stays current.
Quick summary — what you'll finish in ~90 minutes
- Set up Raspberry Pi 5 with 64-bit OS and the AI HAT+ 2 drivers.
- Deploy a small, quantized LLM (gguf/ggml-compatible) inside a container using llama.cpp bindings and FastAPI.
- Expose a secure /v1/inference endpoint for integrations.
- Add a GitHub Actions CI pipeline that builds multi-arch images, pushes to registry, and deploys to your Pi via SSH.
Why this matters in 2026
Late 2025 and early 2026 saw two clear trends that make this setup practical for production prototypes:
- Hardware vendors improved compact NPUs and open drivers for ARM SBCs, and the AI HAT+ 2 brings accessible acceleration to Pi-class devices.
- Open runtimes (llama.cpp, ggml, llama-cpp-python) and quantized model formats (gguf) matured for smaller memory footprints, enabling sub-4GB models to run efficiently on edge hardware.
ZDNET and other outlets highlighted the AI HAT+ 2 in late 2025 as a milestone for making generative models viable on Raspberry Pi 5-class hardware.
Prerequisites
- Raspberry Pi 5 (recommended 8GB or 16GB RAM) and reliable power.
- AI HAT+ 2 accessory (approx $130) and its vendor SDK/drivers.
- MicroSD or NVMe boot with 64-bit Raspberry Pi OS or Ubuntu 22.04/24.04 (aarch64).
- Network access (SSH), a dev machine for building images, and a Docker registry (Docker Hub, GHCR, or private registry).
- Basic familiarity with Docker, Python, and GitHub Actions or CI of choice.
High-level architecture
We’ll package the model runtime and a small FastAPI app into an ARM container. The Pi runs the container via Docker and exposes a token-protected HTTP API. CI builds multi-arch images and pushes them to your registry; a deploy step SSHs to the Pi and restarts the container when images change.
[Client] --> HTTPS --> [Raspberry Pi 5 + AI HAT+ 2]
|-- Container: FastAPI + llama.cpp (llama-cpp-python)
|-- Model: /opt/models/edge-llm.gguf
|-- Systemd or Docker to keep container running
CI: GitHub Actions -> buildx multi-arch -> push image -> SSH deploy
Step 1 — Prepare the Pi 5
- Install a 64-bit OS image (Raspberry Pi OS 64-bit or Ubuntu aarch64). Flash and boot with SSH enabled.
- Update and install essentials:
sudo apt update && sudo apt upgrade -y sudo apt install -y build-essential git curl python3 python3-venv python3-pip docker.io - Add your user to docker group and enable Docker:
sudo usermod -aG docker $USER newgrp docker sudo systemctl enable --now docker - Install vendor drivers for the AI HAT+ 2. Replace the example URL with the manufacturer's latest SDK (the vendor released open drivers in late 2025):
curl -fsSL https://vendor.example.com/ai-hat2-sdk/latest/install.sh | sudo bashAfter install, verify the board is recognized (vendor CLI or dmesg):
vendor-hatctl status # or dmesg | grep -i hat
Step 2 — Choose and prepare a small, quantized model
For Raspberry Pi 5, aim for models in the ~1B–3B parameter range, quantized to reduce RAM. In 2026, the preferred format for lightweight on-device models is gguf (ggml-unified), compatible with llama.cpp and its Python bindings.
Two ways to get a model:
- Download an already quantized gguf release from Hugging Face or the model vendor.
- Convert a checkpoint to gguf using community tools and quantizers (run on your build host with more RAM).
Example: using Hugging Face CLI to download (login required if model requires consent):
pip3 install huggingface_hub
huggingface-cli login
mkdir -p ~/models/edge-llm && cd ~/models/edge-llm
huggingface-cli repo download --revision=main --filename=model.gguf
Place the final gguf file under /opt/models/edge-llm.gguf on the Pi or mount it into the container at runtime.
Step 3 — Build a container runtime for inference
We'll use a lightweight FastAPI wrapper that calls into llama-cpp-python (an ffi binding for llama.cpp). The container compiles llama.cpp for arm64 and exposes a /v1/generate endpoint.
Example Dockerfile (arm64)
FROM ubuntu:22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt update && apt install -y build-essential git cmake python3 python3-venv python3-pip libpthread-stubs0-dev libopenblas-dev
# Build and install llama.cpp
WORKDIR /opt
RUN git clone --depth 1 https://github.com/ggerganov/llama.cpp.git && \
cd llama.cpp && make CXXFLAGS="-O3 -marm" && \
pip3 install --no-cache-dir llama-cpp-python
# App
WORKDIR /app
COPY app/requirements.txt ./
RUN python3 -m venv /opt/venv && /opt/venv/bin/pip install --no-cache-dir -r app/requirements.txt
COPY app /app
ENV PATH=/opt/venv/bin:$PATH
EXPOSE 8000
CMD ["/opt/venv/bin/uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Example FastAPI app (app/main.py)
from fastapi import FastAPI, HTTPException, Header
from pydantic import BaseModel
from llama_cpp import Llama
import os
MODEL_PATH = os.getenv("MODEL_PATH", "/opt/models/edge-llm.gguf")
API_TOKEN = os.getenv("API_TOKEN", "changeme")
app = FastAPI()
llm = Llama(model_path=MODEL_PATH)
class GenRequest(BaseModel):
prompt: str
max_tokens: int = 128
@app.post("/v1/generate")
async def generate(req: GenRequest, authorization: str | None = Header(None)):
if authorization != f"Bearer {API_TOKEN}":
raise HTTPException(status_code=401, detail="Unauthorized")
out = llm.create_completion(prompt=req.prompt, max_tokens=req.max_tokens)
return {"text": out.text}
@app.get("/health")
async def health():
return {"status": "ok"}
requirements.txt should include:
fastapi
uvicorn[standard]
llama-cpp-python
Step 4 — Run locally and validate
- Build the image locally (or on your build host) for aarch64. On your dev machine with Docker Buildx enabled:
docker buildx create --use docker buildx build --platform linux/arm64 -t yourname/edge-llm:latest --push . - On the Pi, pull the image and run (mount the model if you didn't bake it in):
docker pull yourname/edge-llm:latest sudo mkdir -p /opt/models # copy model to /opt/models/edge-llm.gguf docker run -d --restart unless-stopped --name edge-llm \ -p 8000:8000 \ -v /opt/models:/opt/models \ -e MODEL_PATH=/opt/models/edge-llm.gguf \ -e API_TOKEN="my-secret-token" \ yourname/edge-llm:latest - Test the endpoint from your dev machine:
curl -s -H "Authorization: Bearer my-secret-token" \ -X POST localhost:8000/v1/generate -d '{"prompt":"Hello world","max_tokens":20}' | jq
Step 5 — Add CI to build and deploy model container
We use GitHub Actions to build multi-arch images and SSH into the Pi to pull and restart the container. Store secrets in GitHub: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN, SSH_HOST, SSH_USER, SSH_KEY (private key for deploy user), and optionally SSH_PORT.
Example .github/workflows/ci-deploy.yml
name: Build and Deploy
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Login to DockerHub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and push image
uses: docker/build-push-action@v4
with:
push: true
platforms: linux/arm64
tags: ${{ secrets.DOCKERHUB_USERNAME }}/edge-llm:latest
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- name: Copy SSH key
uses: webfactory/ssh-agent@v0.9.0
with:
ssh-private-key: ${{ secrets.SSH_KEY }}
- name: Deploy to Pi
run: |
ssh -o StrictHostKeyChecking=no ${{ secrets.SSH_USER }}@${{ secrets.SSH_HOST }} -p ${{ secrets.SSH_PORT }} \
'docker pull ${{ secrets.DOCKERHUB_USERNAME }}/edge-llm:latest && \
docker stop edge-llm || true && docker rm edge-llm || true && \
docker run -d --restart unless-stopped --name edge-llm -p 8000:8000 -v /opt/models:/opt/models -e MODEL_PATH=/opt/models/edge-llm.gguf -e API_TOKEN="${{ secrets.API_TOKEN }}" ${{ secrets.DOCKERHUB_USERNAME }}/edge-llm:latest'
This CI flow rebuilds and redeploys the container whenever you push to main. To update the model file (gguf), store it in a model store or on the Pi and trigger the workflow either by pushing a tag or by adding a small metadata file that lists the model version.
Operational best practices
- Model updates: Avoid baking large model files into images. Keep models on a mounted volume and make your container resilient to model swaps (reload on SIGHUP or expose an endpoint that reloads the model).
- Security: Use short-lived tokens, IP restrictions, or an auth proxy. Never expose the service directly to the public internet without proper auth and rate-limiting.
- Monitoring: Add /health and /metrics endpoints and scrape them with Prometheus or push basic logs to a central log aggregator.
- Rollback: Tag images with semantic versions. CI should keep the last N successful images for quick rollback on the Pi.
- Resource limits: Use Docker resource constraints (memory, cpus) to prevent the model from destabilizing the OS.
Advanced strategies (real-world patterns)
Hybrid inference
Route small requests to the Pi for low latency, and route heavy generation or long-context requests to a cloud-hosted larger model. Use the same API contract so client code doesn't need to know which backend handled the request.
Canary model updates
Use CI to push experimental quantizations to a single device or device group. Run automated tests that exercise generation quality metrics (perplexity proxies, embedding similarity) before promoting to the fleet.
Observability for models
- Log prompts (sanitized), inference latency, memory pressure, and token counts.
- Track model drift and quality regressions across releases with sample prompt suites.
Troubleshooting checklist
- No device detected: re-run the vendor SDK installer, check dmesg, and ensure the AI HAT+ 2 is seated and powered.
- Model load fails: verify the gguf file path, check file permissions, and ensure enough swap or RAM for initial load.
- Slow inference: confirm the container uses the NPU/accelerator, check CPU governor, and prefer q4/q8 quantized gguf files.
- CI build fails for arm64: enable buildx and QEMU emulation or build on an arm64 runner.
2026 trends & future-proofing
By 2026, edge-first AI architectures are mainstream for privacy-sensitive and latency-critical use cases. Expect these ongoing changes:
- More ARM-optimized model releases and quantized gguf model packs designed for micro-NPUs.
- Vendor drivers converging on common runtime APIs, making it easier to swap acceleration hardware without changing inference code.
- WASM-based model runtimes improving portability and sandboxing options for constrained devices.
Actionable takeaways
- Get the hardware: Pi 5 + AI HAT+ 2 and use a 64-bit OS image.
- Prefer gguf-quantized models and llama.cpp runtimes on-device for predictable memory behavior.
- Containerize the runtime and use CI (buildx) to create arm64 images and automate deploys to the Pi.
- Protect the endpoint (token, proxy) and add health/metrics for observability.
Resources & links
- AI HAT+ 2 vendor SDK/docs — check the manufacturer for the latest installer (released late 2025).
- llama.cpp and llama-cpp-python — lightweight, ARM-friendly runtimes.
- Hugging Face model hub — search for gguf/quantized models suitable for edge.
- GitHub Actions buildx docs — for multi-arch build pipelines.
Conclusion & next steps
Running a local generative AI endpoint on Raspberry Pi 5 with the AI HAT+ 2 is now practical for teams that need low-latency, private inference at the edge. This quickstart gives you a repeatable DevOps workflow: hardware + drivers, a containerized LLM runtime, a secure API for integrations, and CI to keep the deployment current.
Try it now: clone the starter repo (link in your project board), provision a Pi 5, and run the GitHub Actions pipeline to build and deploy your first edge LLM in under two hours. For production, add monitoring, secrets rotation, and a canary workflow to safely roll model updates.
Call to action: Want a ready-made reference repo, production templates for model rollout, and policy-ready security patterns for edge LLMs? Visit midways.cloud/edge-llm-quickstart to download the complete example (Dockerfile, FastAPI app, and CI pipeline) and join our community preview for field-tested configurations.
Related Reading
- Are Custom 3D-Scanned Skincare Devices Just Placebo? What Dermatologists Say
- From Micro App to Micro-Monetization: Side Hustle Ideas Creators Can Build With Non-Developer Tools
- Launching a Podcast in a Crowded Market: Ant & Dec’s Move Through a Mental Health Lens
- Best Lamps Under $100 That Look High-End: Style, Tech, and Textile Pairings
- Clinical-Grade Ready Meals in 2026: Packaging, Compliance, and Low‑Waste Distribution Strategies
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Starter Kit: Building a Secure Webhook Consumer for High-Volume Logistics Events
Operator's Guide: Running Mixed Reality Hardware and Software After Vendor Shutdowns
Integrating Local Browser AI with Enterprise Authentication: Patterns and Pitfalls
Scaling Event Streams for Real-Time Warehouse and Trucking Integrations
Legal Checklist for Using Third-Party Headsets and Services in Enterprise Workflows
From Our Network
Trending stories across our publication group