
Will Schenk
I am a father, entrepreneur, technologist and aspiring woodsman.
My wife Ksenia and I live in the woods of Northwest Connecticut with our four boys and one baby girl. I have a lumber mill and all the kids love using the tractor.
I’m currently building The Focus AI, Umwelten, and Cornwall Market.
Latest
Migrating to llama.cpp
April 16, 2026
Ollama made local LLMs easy, but it comes with real downsides –
it's slower than running llama.cpp directly, obscures what
you're actually running, locks models into a hashed blob store, and
trails upstream on new model support. The good news is that
llama.cpp itself has gotten very easy to use.
If you use Ollama, you probably do three things:
ollama run/ollama chat– download a model, chat with it interactively, have it unload when you're done- The Ollama API – point tools like Continue, aider, or Open
WebUI at
localhost:11434for an OpenAI-compatible endpoint - The Ollama desktop app – a GUI to chat with models
Here's the direct equivalent of each, and then we'll walk through setting it all up.
| Ollama | llama.cpp equivalent |
|---|---|
ollama run gemma4 | llama-cli -hf ...:Q4_K_M -cnv |
ollama serve (API) | llama-server -hf ... or llama-swap |
| Ollama desktop app | llama-server web UI at localhost:8080 |
ollama list | ls ~/.cache/huggingface/hub/ or hf cache ls |
ollama pull model | Automatic on first run with -hf |
| Modelfile for parameters | CLI flags (--temp, --ctx-size, etc.) |
~/.ollama/models (hashed) | ~/.cache/huggingface (readable dirs) |
| Auto-unload after idle | llama-swap with ttl |
Install llama.cpp
On macOS with Homebrew:
| |
That's it. You get llama-server, llama-cli, and the rest of
the tools. Metal GPU acceleration works out of the box on Apple
Silicon.
You can also grab a pre-built binary from the releases page, or build from source:
| |
Choosing a model and quantization
The model
We'll use Gemma 4 26B-A4B as our example. It's a Mixture-of-Experts model – 26B total parameters but only 3.8B active per token, so it runs almost as fast as a 4B model with much better quality.
| Model | Total Params | Active Params | Type | Context |
|---|---|---|---|---|
| Gemma 4 E2B | 5.1B | 2.3B | Dense | 128K |
| Gemma 4 E4B | 8B | 4.5B | Dense | 128K |
| Gemma 4 26B-A4B | 25.2B | 3.8B | MoE | 256K |
| Gemma 4 31B | 30.7B | 30.7B | Dense | 256K |
The quantization
The rule of thumb: your model needs to fit in memory with room left over for the KV cache (which stores the conversation context). Head to unsloth/gemma-4-26B-A4B-it-GGUF on Hugging Face to see all the available sizes.
| Quant | Size | Notes |
|---|---|---|
| UD-IQ2_XXS | ~10 GB | Tight on RAM, willing to trade quality |
| UD-Q3_K_M | ~12.5 GB | Good balance for constrained systems |
| UD-Q4_K_M | ~17 GB | Best quality-per-GB sweet spot |
| UD-Q5_K_M | ~21 GB | Noticeably better than Q4 |
| UD-Q6_K | ~23 GB | Diminishing returns vs Q5 |
| Q8_0 | ~27 GB | Near-lossless |
| BF16 | ~50.5 GB | Full precision |
On this M4 Max with 64GB, we can run Q8_0 (27 GB) or even
BF16 (50.5 GB) and still have room for the full 256K context
window. For most machines, Q4_K_M at ~17 GB is the sweet spot.
Ollama only offers Q4_K_M and Q8_0. Here you get the full range from IQ2 to BF16, quantized by Unsloth with their Dynamic 2.0 method that selectively quantizes different layers to preserve quality.
Why not always max context?
The context window isn't free – it requires a KV cache in memory on top of the model weights. The bigger the context, the more RAM the KV cache uses. For a single chat session this usually doesn't matter much, but if you're running a server handling multiple requests, or running multiple models via llama-swap, you may want to cap it to leave room.
Use --ctx-size 0 to get the model's full trained context (256K
for Gemma 4 26B-A4B). Or set a specific number if you need to
budget memory.
1. Interactive chat (ollama run replacement)
| |
That's the whole thing. What happens:
- Downloads the model from Hugging Face if you don't have it
- Loads it into memory with Metal GPU acceleration
- Starts an interactive chat (
-cnvis conversation mode) - Frees memory immediately when you quit (
Ctrl+C)
The chat template is read from the GGUF metadata – no Modelfile, no configuration. Unlike Ollama there's no background daemon; the process runs, you chat, you quit, it's gone.
Want different parameters? Just add flags:
| |
Compare this to Ollama where changing temperature means creating a
Modelfile, running ollama create, and potentially copying 20+ GB
of model data.
2. API server (ollama serve replacement)
| |
This starts an OpenAI-compatible API on http://localhost:8080.
Point any tool at it – Continue, aider, Open WebUI, or just curl:
| |
Or with Python:
| |
3. Desktop chat UI (Ollama app replacement)
llama-server includes a built-in web UI. Just start the server
and open http://localhost:8080 in your browser. You get a chat
interface, no extra app to install.
Managing models
Listing downloaded models
Since llama.cpp stores models in the standard Hugging Face cache, you can just look:
| |
Which gives you readable directory names:
| |
For more detail, install the hf CLI:
| |
Which shows size and last access time:
| |
Deleting models
| |
No hashed blob filenames to decode – just readable directory names you can also browse in Finder.
Multi-model hot-swapping with llama-swap
A bare llama-server serves one model and runs until you stop it.
If you want Ollama-style behavior where you can hit one endpoint
with different model names and have them auto-load and auto-unload,
that's what llama-swap is for.
llama-swap is a lightweight Go proxy that sits in front of
llama-server. When a request comes in, it looks at the model
field, starts the right llama-server, and proxies the request.
When a request comes in for a different model, it stops the old
one and starts the new one. After the ttl expires with no
requests, the model unloads and memory is freed.
Install
| |
Configure
Create a config.yaml:
| |
${PORT}: llama-swap assigns a free port automaticallyttl: seconds of idle time before auto-unloading – 120 means 2 minutes. This is the Ollama auto-unload behavior.aliases: friendly names for API calls. Point your tools atgpt-4o-miniand it routes to your local Gemma.
Run it
| |
Now it works just like Ollama – one endpoint, multiple models:
| |
First request to a model takes a few seconds while weights load.
Subsequent requests are instant. After ttl with no activity, it
unloads.
llama-swap also has a web UI at http://localhost:8080/ui for
monitoring running models, viewing token metrics, and manually
loading/unloading.
Running models concurrently
By default llama-swap runs one model at a time. If you have enough
RAM, you can define a matrix to run multiple models
simultaneously. On a 64GB machine you could comfortably run two Q4
models side by side. See the configuration docs for the matrix
DSL.
Why switch?
- Performance: community benchmarks show llama.cpp running 1.5-1.8x faster than Ollama on the same hardware.
- New models immediately: GGUFs appear on Hugging Face within hours of a model release. With Ollama you wait for someone to package it for their registry.
- Full quantization range: Ollama only offers a handful of quant levels. On Hugging Face you get IQ2 through BF16.
- No lock-in: models are plain GGUF files shared with any tool.
- Chat templates just work: llama.cpp reads Jinja templates embedded in the GGUF. No Modelfile, no Go template translation.
- No background daemon: nothing running when you're not using it.
- No VC pivot: llama.cpp is MIT-licensed, community-driven, and now part of the Hugging Face ecosystem.
References
- llama.cpp on GitHub
- Friends Don't Let Friends Use Ollama – detailed history of Ollama's issues
- llama-swap – multi-model hot-swapping proxy
- Gemma 4 26B-A4B GGUF on Hugging Face (Unsloth Dynamic quants)
- Gemma 4 31B GGUF on Hugging Face
Recent Writing
Migrating to llama.cpp
drop ollama and run models directly
The Will to Power Intelligence
It’s not a miracle, it’s a threshold
Lazzzored Dodecahedron
17 years ago at NYC Resistor
Shadows of God
Denying the debt
Yelling at the Models
A consequence-free venting session
Lego-powered Submarine 4.0
automatic depth control with Raspberry Pi and PID
Agents all the way down
Git repos, data flywheels, and the end of the app
Into the Singularity
The future is different than the past
Jack Clark on Tetragrammaton
Anthropic co-founder on AI safety, creativity, and deception
Unreasonable Effectiveness of Compute
Moravec on the shortage of compute in 1976
Computer Held Accountable
Agent Kickoff