Can a Raspberry Pi 5 run a local LLM?

Yes. With llama.cpp and a 4-bit 1–2B model it runs at a few tokens/sec, fine for routing, classification, and small always-on agents, but too slow for a chat interface.

What size model can a Raspberry Pi 5 run?

Comfortably 1–2B quantized models on the 8 GB board. Anything above ~3B becomes painfully slow, so keep models small.

Do I need a cooler to run an LLM on a Raspberry Pi 5?

Yes. Sustained inference will thermal-throttle the bare board, so an active cooler is effectively non-negotiable for steady performance.

What is the easiest way to run an LLM on a Raspberry Pi 5?

Install Ollama with the one-line script on 64-bit Raspberry Pi OS, then run 'ollama run llama3.2:1b'. It downloads the model and starts a chat prompt with no compiling required.

Running a Local LLM on a Raspberry Pi 5: What Actually Works

The Raspberry Pi 5 is the cheapest realistic on-ramp to local AI. It’s not fast, but for the right job (a tiny always-on agent) it’s perfect, and it sips power. The trick is having honest expectations: knowing what it can do means you’ll be delighted instead of disappointed.

What to expect (the real numbers)

With llama.cpp and a 4-bit 1.7B model, the Pi 5 (8 GB) lands in the low single-digit tokens/sec range. That’s too slow for a chat UI where you wait on every word, but perfectly fine for background tasks that classify a message, pick a tool, or extract a field and move on. Push past ~3B and it drops below one token/sec, usable only for non-interactive batch jobs you don’t sit and watch.

Power draw under inference is about 5–8 W, low enough to run 24/7 for pennies a year, which is exactly what makes it a great always-on box.

What you can actually build with it

This is where the Pi 5 shines. Real, useful projects that fit a 1–2B model:

A home-automation router that reads a spoken or typed request and decides which smart-home action to trigger.
An offline text classifier: tag incoming notes, emails, or sensor messages by category without sending anything to the cloud.
A tiny RAG assistant over a small personal knowledge base for simple Q&A.
A learning rig to understand quantization, GGUF, and llama.cpp flags hands-on before you invest in a real GPU.

If you need fluent chat or coding help, this isn’t the tier; see the cheapest-way guide for the step up to a GPU.

The board

Recommended

Raspberry Pi 5 (8GB)

8 GiB VRAM
12 W TDP
2023

~$80 street price

Check price

Get the 8 GB model: the extra RAM directly limits the model size you can load. An active cooler is non-negotiable for sustained inference, and a quality 27 W USB-C power supply prevents brown-outs under load.

What else you’ll need

Active cooler (official or a heatsink+fan). Without it the Pi throttles within minutes of sustained inference and your tokens/sec quietly halves.
Official 27 W USB-C PSU. Under-powering causes random resets exactly when the CPU spikes during generation.
Fast microSD or, better, an NVMe SSD via the PCIe HAT. Model files are GB-sized and load far quicker from SSD.

Setup in three steps

The quickest path is Ollama, which now runs on 64-bit Raspberry Pi OS and skips all the compiling. Flash Raspberry Pi OS (64-bit) with Raspberry Pi Imager (the 32-bit OS can’t address enough memory for these models), boot, then:

# one-line install, then pull a tiny 4-bit model
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2:1b

That downloads a ~0.8 GB model and drops you into a prompt. Try qwen2.5:1.5b too, and stay at the 1–2B sizes, anything larger crawls.

Or build llama.cpp from source (more control, more speed)

If you want to tune every flag or squeeze the most tokens/sec, compile llama.cpp directly:

sudo apt update && sudo apt install -y build-essential cmake git
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build -j4
# then run a 4-bit GGUF you downloaded:
./build/bin/llama-cli -m qwen2-1_5b-q4_k_m.gguf -p "Classify: 'reset the lights'" -n 64

Tips to get the most out of it

Keep the model at 1–2B and Q4. It’s the only combination that stays usable.
Use short prompts and short outputs. The Pi’s bottleneck is compute per token, so fewer tokens = a snappier feel.
Run it headless as a service and call it over your network, so the Pi does one job well in the background.

We test this exact setup on a real board and will publish the measured tokens/sec here once the runs are captured: real numbers, no estimates.

Running a Local LLM on a Raspberry Pi 5: What Actually Works

What to expect (the real numbers)

What you can actually build with it

The board

Raspberry Pi 5 (8GB)

What else you’ll need

Setup in three steps

Or build llama.cpp from source (more control, more speed)

Tips to get the most out of it

Gear mentioned in this post

Frequently asked questions

Related reading

The Cheapest Way to Run a Local LLM in 2026

What to expect (the real numbers)

What you can actually build with it

The board

Raspberry Pi 5 (8GB)

What else you’ll need

Setup in three steps

Or build llama.cpp from source (more control, more speed)

Tips to get the most out of it

Gear mentioned in this post

Frequently asked questions

Get tested, not hyped.

Related reading

The Cheapest Way to Run a Local LLM in 2026