This site is still being built. Content and links may change while we get things ready.
naxiv
howto

Running a Local LLM on a Raspberry Pi 5: What Actually Works

Hands-on results running quantized LLMs on a Raspberry Pi 5. Which model sizes are usable, what tokens/sec to expect, and the accessories you actually need.

By Pedro Santos 3 min read
Raspberry Pi 5 single-board computer illustration with an AI neural-node chip running llama.cpp

The Raspberry Pi 5 is the cheapest realistic on-ramp to local AI. It’s not fast, but for the right job (a tiny always-on agent) it’s perfect, and it sips power. The trick is having honest expectations: knowing what it can do means you’ll be delighted instead of disappointed.

What to expect (the real numbers)

With llama.cpp and a 4-bit 1.7B model, the Pi 5 (8 GB) lands in the low single-digit tokens/sec range. That’s too slow for a chat UI where you wait on every word, but perfectly fine for background tasks that classify a message, pick a tool, or extract a field and move on. Push past ~3B and it drops below one token/sec, usable only for non-interactive batch jobs you don’t sit and watch.

Power draw under inference is about 5–8 W, low enough to run 24/7 for pennies a year, which is exactly what makes it a great always-on box.

What you can actually build with it

This is where the Pi 5 shines. Real, useful projects that fit a 1–2B model:

  • A home-automation router that reads a spoken or typed request and decides which smart-home action to trigger.
  • An offline text classifier: tag incoming notes, emails, or sensor messages by category without sending anything to the cloud.
  • A tiny RAG assistant over a small personal knowledge base for simple Q&A.
  • A learning rig to understand quantization, GGUF, and llama.cpp flags hands-on before you invest in a real GPU.

If you need fluent chat or coding help, this isn’t the tier; see the cheapest-way guide for the step up to a GPU.

The board

Recommended

Raspberry Pi 5 (8GB)

  • 8 GiB VRAM
  • 12 W TDP
  • 2023

~$80 street price

Get the 8 GB model: the extra RAM directly limits the model size you can load. An active cooler is non-negotiable for sustained inference, and a quality 27 W USB-C power supply prevents brown-outs under load.

What else you’ll need

  • Active cooler (official or a heatsink+fan). Without it the Pi throttles within minutes of sustained inference and your tokens/sec quietly halves.
  • Official 27 W USB-C PSU. Under-powering causes random resets exactly when the CPU spikes during generation.
  • Fast microSD or, better, an NVMe SSD via the PCIe HAT. Model files are GB-sized and load far quicker from SSD.

Setup in three steps

The quickest path is Ollama, which now runs on 64-bit Raspberry Pi OS and skips all the compiling. Flash Raspberry Pi OS (64-bit) with Raspberry Pi Imager (the 32-bit OS can’t address enough memory for these models), boot, then:

# one-line install, then pull a tiny 4-bit model
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2:1b

That downloads a ~0.8 GB model and drops you into a prompt. Try qwen2.5:1.5b too, and stay at the 1–2B sizes, anything larger crawls.

Or build llama.cpp from source (more control, more speed)

If you want to tune every flag or squeeze the most tokens/sec, compile llama.cpp directly:

sudo apt update && sudo apt install -y build-essential cmake git
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build -j4
# then run a 4-bit GGUF you downloaded:
./build/bin/llama-cli -m qwen2-1_5b-q4_k_m.gguf -p "Classify: 'reset the lights'" -n 64

Tips to get the most out of it

  • Keep the model at 1–2B and Q4. It’s the only combination that stays usable.
  • Use short prompts and short outputs. The Pi’s bottleneck is compute per token, so fewer tokens = a snappier feel.
  • Run it headless as a service and call it over your network, so the Pi does one job well in the background.

We test this exact setup on a real board and will publish the measured tokens/sec here once the runs are captured: real numbers, no estimates.

Gear mentioned in this post

Frequently asked questions

Can a Raspberry Pi 5 run a local LLM?

Yes. With llama.cpp and a 4-bit 1–2B model it runs at a few tokens/sec, fine for routing, classification, and small always-on agents, but too slow for a chat interface.

What size model can a Raspberry Pi 5 run?

Comfortably 1–2B quantized models on the 8 GB board. Anything above ~3B becomes painfully slow, so keep models small.

Do I need a cooler to run an LLM on a Raspberry Pi 5?

Yes. Sustained inference will thermal-throttle the bare board, so an active cooler is effectively non-negotiable for steady performance.

What is the easiest way to run an LLM on a Raspberry Pi 5?

Install Ollama with the one-line script on 64-bit Raspberry Pi OS, then run 'ollama run llama3.2:1b'. It downloads the model and starts a chat prompt with no compiling required.

Get tested, not hyped.

One email when we publish a new hands-on guide, review or benchmark. No spam, no vendor fluff. Unsubscribe anytime.

Related reading