edge-aiquickstartRaspberry Pi

Run Local Generative AI on Raspberry Pi 5: A DevOps Quickstart with the AI HAT+ 2

UUnknown

2026-02-25

10 min read

Quickstart: set up AI HAT+ 2 on Raspberry Pi 5, run a small edge LLM, expose an API, and add CI to keep the model container updated.

Run Local Generative AI on Raspberry Pi 5: A DevOps Quickstart with the AI HAT+ 2

Hook: If your team is wrestling with latency, privacy, and operations overhead when integrating cloud LLMs, running a compact generative model locally on a Raspberry Pi 5 with the new $130 AI HAT+ 2 can cut inference latency, reduce costs, and keep sensitive data on-prem. This quickstart gets an edge LLM up and serving as an integration endpoint and adds CI so the model container stays current.

Quick summary — what you'll finish in ~90 minutes

Set up Raspberry Pi 5 with 64-bit OS and the AI HAT+ 2 drivers.
Deploy a small, quantized LLM (gguf/ggml-compatible) inside a container using llama.cpp bindings and FastAPI.
Expose a secure /v1/inference endpoint for integrations.
Add a GitHub Actions CI pipeline that builds multi-arch images, pushes to registry, and deploys to your Pi via SSH.

Why this matters in 2026

Late 2025 and early 2026 saw two clear trends that make this setup practical for production prototypes:

Hardware vendors improved compact NPUs and open drivers for ARM SBCs, and the AI HAT+ 2 brings accessible acceleration to Pi-class devices.
Open runtimes (llama.cpp, ggml, llama-cpp-python) and quantized model formats (gguf) matured for smaller memory footprints, enabling sub-4GB models to run efficiently on edge hardware.

ZDNET and other outlets highlighted the AI HAT+ 2 in late 2025 as a milestone for making generative models viable on Raspberry Pi 5-class hardware.

Prerequisites

Raspberry Pi 5 (recommended 8GB or 16GB RAM) and reliable power.
AI HAT+ 2 accessory (approx $130) and its vendor SDK/drivers.
MicroSD or NVMe boot with 64-bit Raspberry Pi OS or Ubuntu 22.04/24.04 (aarch64).
Network access (SSH), a dev machine for building images, and a Docker registry (Docker Hub, GHCR, or private registry).
Basic familiarity with Docker, Python, and GitHub Actions or CI of choice.

High-level architecture

We’ll package the model runtime and a small FastAPI app into an ARM container. The Pi runs the container via Docker and exposes a token-protected HTTP API. CI builds multi-arch images and pushes them to your registry; a deploy step SSHs to the Pi and restarts the container when images change.


  [Client] --> HTTPS --> [Raspberry Pi 5 + AI HAT+ 2]
                          |-- Container: FastAPI + llama.cpp (llama-cpp-python)
                          |-- Model: /opt/models/edge-llm.gguf
                          |-- Systemd or Docker to keep container running
  CI: GitHub Actions -> buildx multi-arch -> push image -> SSH deploy

Step 1 — Prepare the Pi 5

Install a 64-bit OS image (Raspberry Pi OS 64-bit or Ubuntu aarch64). Flash and boot with SSH enabled.

Update and install essentials:

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential git curl python3 python3-venv python3-pip docker.io

Add your user to docker group and enable Docker:

sudo usermod -aG docker $USER
newgrp docker
sudo systemctl enable --now docker

Install vendor drivers for the AI HAT+ 2. Replace the example URL with the manufacturer's latest SDK (the vendor released open drivers in late 2025):
```
curl -fsSL https://vendor.example.com/ai-hat2-sdk/latest/install.sh | sudo bash
```
After install, verify the board is recognized (vendor CLI or dmesg):
```
vendor-hatctl status
# or
dmesg | grep -i hat
```

Step 2 — Choose and prepare a small, quantized model

For Raspberry Pi 5, aim for models in the ~1B–3B parameter range, quantized to reduce RAM. In 2026, the preferred format for lightweight on-device models is gguf (ggml-unified), compatible with llama.cpp and its Python bindings.

Two ways to get a model:

Download an already quantized gguf release from Hugging Face or the model vendor.
Convert a checkpoint to gguf using community tools and quantizers (run on your build host with more RAM).

Example: using Hugging Face CLI to download (login required if model requires consent):

pip3 install huggingface_hub
huggingface-cli login
mkdir -p ~/models/edge-llm && cd ~/models/edge-llm
huggingface-cli repo download  --revision=main --filename=model.gguf

Place the final gguf file under /opt/models/edge-llm.gguf on the Pi or mount it into the container at runtime.

Step 3 — Build a container runtime for inference

We'll use a lightweight FastAPI wrapper that calls into llama-cpp-python (an ffi binding for llama.cpp). The container compiles llama.cpp for arm64 and exposes a /v1/generate endpoint.

Example Dockerfile (arm64)

FROM ubuntu:22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt update && apt install -y build-essential git cmake python3 python3-venv python3-pip libpthread-stubs0-dev libopenblas-dev
# Build and install llama.cpp
WORKDIR /opt
RUN git clone --depth 1 https://github.com/ggerganov/llama.cpp.git && \
    cd llama.cpp && make CXXFLAGS="-O3 -marm" && \
    pip3 install --no-cache-dir llama-cpp-python
# App
WORKDIR /app
COPY app/requirements.txt ./
RUN python3 -m venv /opt/venv && /opt/venv/bin/pip install --no-cache-dir -r app/requirements.txt
COPY app /app
ENV PATH=/opt/venv/bin:$PATH
EXPOSE 8000
CMD ["/opt/venv/bin/uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Example FastAPI app (app/main.py)

from fastapi import FastAPI, HTTPException, Header
from pydantic import BaseModel
from llama_cpp import Llama
import os

MODEL_PATH = os.getenv("MODEL_PATH", "/opt/models/edge-llm.gguf")
API_TOKEN = os.getenv("API_TOKEN", "changeme")

app = FastAPI()

llm = Llama(model_path=MODEL_PATH)

class GenRequest(BaseModel):
    prompt: str
    max_tokens: int = 128

@app.post("/v1/generate")
async def generate(req: GenRequest, authorization: str | None = Header(None)):
    if authorization != f"Bearer {API_TOKEN}":
        raise HTTPException(status_code=401, detail="Unauthorized")
    out = llm.create_completion(prompt=req.prompt, max_tokens=req.max_tokens)
    return {"text": out.text}

@app.get("/health")
async def health():
    return {"status": "ok"}

requirements.txt should include:

fastapi
uvicorn[standard]
llama-cpp-python

Step 4 — Run locally and validate

Build the image locally (or on your build host) for aarch64. On your dev machine with Docker Buildx enabled:
```
docker buildx create --use
docker buildx build --platform linux/arm64 -t yourname/edge-llm:latest --push .
```

On the Pi, pull the image and run (mount the model if you didn't bake it in):

docker pull yourname/edge-llm:latest
sudo mkdir -p /opt/models
# copy model to /opt/models/edge-llm.gguf
docker run -d --restart unless-stopped --name edge-llm \
  -p 8000:8000 \
  -v /opt/models:/opt/models \
  -e MODEL_PATH=/opt/models/edge-llm.gguf \
  -e API_TOKEN="my-secret-token" \
  yourname/edge-llm:latest

Test the endpoint from your dev machine:

curl -s -H "Authorization: Bearer my-secret-token" \
  -X POST localhost:8000/v1/generate -d '{"prompt":"Hello world","max_tokens":20}' | jq

Step 5 — Add CI to build and deploy model container

We use GitHub Actions to build multi-arch images and SSH into the Pi to pull and restart the container. Store secrets in GitHub: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN, SSH_HOST, SSH_USER, SSH_KEY (private key for deploy user), and optionally SSH_PORT.

Example .github/workflows/ci-deploy.yml

name: Build and Deploy
on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v2
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      - name: Login to DockerHub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Build and push image
        uses: docker/build-push-action@v4
        with:
          push: true
          platforms: linux/arm64
          tags: ${{ secrets.DOCKERHUB_USERNAME }}/edge-llm:latest

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Copy SSH key
        uses: webfactory/ssh-agent@v0.9.0
        with:
          ssh-private-key: ${{ secrets.SSH_KEY }}
      - name: Deploy to Pi
        run: |
          ssh -o StrictHostKeyChecking=no ${{ secrets.SSH_USER }}@${{ secrets.SSH_HOST }} -p ${{ secrets.SSH_PORT }} \
            'docker pull ${{ secrets.DOCKERHUB_USERNAME }}/edge-llm:latest && \
             docker stop edge-llm || true && docker rm edge-llm || true && \
             docker run -d --restart unless-stopped --name edge-llm -p 8000:8000 -v /opt/models:/opt/models -e MODEL_PATH=/opt/models/edge-llm.gguf -e API_TOKEN="${{ secrets.API_TOKEN }}" ${{ secrets.DOCKERHUB_USERNAME }}/edge-llm:latest'

This CI flow rebuilds and redeploys the container whenever you push to main. To update the model file (gguf), store it in a model store or on the Pi and trigger the workflow either by pushing a tag or by adding a small metadata file that lists the model version.

Operational best practices

Model updates: Avoid baking large model files into images. Keep models on a mounted volume and make your container resilient to model swaps (reload on SIGHUP or expose an endpoint that reloads the model).
Security: Use short-lived tokens, IP restrictions, or an auth proxy. Never expose the service directly to the public internet without proper auth and rate-limiting.
Monitoring: Add /health and /metrics endpoints and scrape them with Prometheus or push basic logs to a central log aggregator.
Rollback: Tag images with semantic versions. CI should keep the last N successful images for quick rollback on the Pi.
Resource limits: Use Docker resource constraints (memory, cpus) to prevent the model from destabilizing the OS.

Advanced strategies (real-world patterns)

Hybrid inference

Route small requests to the Pi for low latency, and route heavy generation or long-context requests to a cloud-hosted larger model. Use the same API contract so client code doesn't need to know which backend handled the request.

Canary model updates

Use CI to push experimental quantizations to a single device or device group. Run automated tests that exercise generation quality metrics (perplexity proxies, embedding similarity) before promoting to the fleet.

Observability for models

Log prompts (sanitized), inference latency, memory pressure, and token counts.
Track model drift and quality regressions across releases with sample prompt suites.

Troubleshooting checklist

No device detected: re-run the vendor SDK installer, check dmesg, and ensure the AI HAT+ 2 is seated and powered.
Model load fails: verify the gguf file path, check file permissions, and ensure enough swap or RAM for initial load.
Slow inference: confirm the container uses the NPU/accelerator, check CPU governor, and prefer q4/q8 quantized gguf files.
CI build fails for arm64: enable buildx and QEMU emulation or build on an arm64 runner.

2026 trends & future-proofing

By 2026, edge-first AI architectures are mainstream for privacy-sensitive and latency-critical use cases. Expect these ongoing changes:

More ARM-optimized model releases and quantized gguf model packs designed for micro-NPUs.
Vendor drivers converging on common runtime APIs, making it easier to swap acceleration hardware without changing inference code.
WASM-based model runtimes improving portability and sandboxing options for constrained devices.

Actionable takeaways

Get the hardware: Pi 5 + AI HAT+ 2 and use a 64-bit OS image.
Prefer gguf-quantized models and llama.cpp runtimes on-device for predictable memory behavior.
Containerize the runtime and use CI (buildx) to create arm64 images and automate deploys to the Pi.
Protect the endpoint (token, proxy) and add health/metrics for observability.

Resources & links

AI HAT+ 2 vendor SDK/docs — check the manufacturer for the latest installer (released late 2025).
llama.cpp and llama-cpp-python — lightweight, ARM-friendly runtimes.
Hugging Face model hub — search for gguf/quantized models suitable for edge.
GitHub Actions buildx docs — for multi-arch build pipelines.

Conclusion & next steps

Running a local generative AI endpoint on Raspberry Pi 5 with the AI HAT+ 2 is now practical for teams that need low-latency, private inference at the edge. This quickstart gives you a repeatable DevOps workflow: hardware + drivers, a containerized LLM runtime, a secure API for integrations, and CI to keep the deployment current.

Try it now: clone the starter repo (link in your project board), provision a Pi 5, and run the GitHub Actions pipeline to build and deploy your first edge LLM in under two hours. For production, add monitoring, secrets rotation, and a canary workflow to safely roll model updates.

Call to action: Want a ready-made reference repo, production templates for model rollout, and policy-ready security patterns for edge LLMs? Visit midways.cloud/edge-llm-quickstart to download the complete example (Dockerfile, FastAPI app, and CI pipeline) and join our community preview for field-tested configurations.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Up Next

Starter Kit: Building a Secure Webhook Consumer for High-Volume Logistics Events

hardware•10 min read

Operator's Guide: Running Mixed Reality Hardware and Software After Vendor Shutdowns

security•3 min read

Integrating Local Browser AI with Enterprise Authentication: Patterns and Pitfalls

streaming•10 min read

Scaling Event Streams for Real-Time Warehouse and Trucking Integrations

procurement•11 min read

Legal Checklist for Using Third-Party Headsets and Services in Enterprise Workflows

From Our Network

Trending stories across our publication group

Automating Detection of Credential Stuffing: Playbooks for DevOps

net-work.pro

devops•9 min read

Automating Detection of Credential Stuffing: Playbooks for DevOps

How to Evaluate AI HATs for Edge Inference: Metrics, Benchmarks, and Cost Models

programa.club

hardware•10 min read

How to Evaluate AI HATs for Edge Inference: Metrics, Benchmarks, and Cost Models

Hosting LLMs vs. Consuming LLM APIs: Cost, Latency, and Privacy Tradeoffs

deploy.website

ai-infrastructure•11 min read

FinOps for Sovereign Clouds: Managing Cost & Compliance Tradeoffs

2026-02-25T04:38:24.927Z

Run Local Generative AI on Raspberry Pi 5: A DevOps Quickstart with the AI HAT+ 2

Quick summary — what you'll finish in ~90 minutes

Why this matters in 2026

Prerequisites

High-level architecture

Step 1 — Prepare the Pi 5

Step 2 — Choose and prepare a small, quantized model

Step 3 — Build a container runtime for inference

Example Dockerfile (arm64)

Example FastAPI app (app/main.py)

Step 4 — Run locally and validate

Step 5 — Add CI to build and deploy model container

Example .github/workflows/ci-deploy.yml

Operational best practices

Advanced strategies (real-world patterns)

Hybrid inference

Canary model updates

Observability for models

Troubleshooting checklist

2026 trends & future-proofing

Actionable takeaways

Resources & links

Conclusion & next steps

Related Reading

Related Topics

Unknown

Up Next

Starter Kit: Building a Secure Webhook Consumer for High-Volume Logistics Events

Operator's Guide: Running Mixed Reality Hardware and Software After Vendor Shutdowns

Integrating Local Browser AI with Enterprise Authentication: Patterns and Pitfalls

Scaling Event Streams for Real-Time Warehouse and Trucking Integrations

Legal Checklist for Using Third-Party Headsets and Services in Enterprise Workflows

From Our Network

Automating Detection of Credential Stuffing: Playbooks for DevOps

How to Evaluate AI HATs for Edge Inference: Metrics, Benchmarks, and Cost Models

Hosting LLMs vs. Consuming LLM APIs: Cost, Latency, and Privacy Tradeoffs

Integrating Automation Systems in Warehouses: A Toggle-First Roadmap

Runbook: Troubleshooting Unexpected Timing Violations in AUTOSAR ECUs

FinOps for Sovereign Clouds: Managing Cost & Compliance Tradeoffs