Introduction
Running an LLM locally sounds straightforward until you fire up three different engines on the same GPU and get wildly different speeds. vLLM, llama.cpp, and Ollama each give you a local inference server, an OpenAI-compatible API, and access to the same model weights. But that is roughly where the similarities end. Pick the wrong one for your workload and you will either crawl through single requests or collapse under five concurrent users.
Table of Contents
This article is for developers, ML engineers, and hobbyists who want real numbers, not marketing summaries. We tested throughput, time-to-first-token, VRAM footprint, and ease of setup across all three tools using the same hardware and the same models. If you are new to local deployment entirely, our guide on deploying open-source LLMs locally covers the fundamentals, while our Llama.cpp vs Ollama vs vLLM stack guide maps out the structural architectures before you pick an engine.
| Feature | vLLM | llama.cpp | Ollama |
|---|---|---|---|
| Best for | High-throughput serving | CPU / low VRAM | Quick local setup |
| GPU required | Yes (CUDA primary) | No (CPU-first) | Optional |
| Multi-user support | Yes | Limited | Limited |
| OpenAI API compatible | Yes | Yes (llama-server) | Yes |
| Ease of setup | Medium (1-2 hrs) | Medium (build from source) | Easy (minutes) |
| GGUF model format | No (HF native) | Yes (native) | Yes (native) |
What each tool actually does
vLLM: PagedAttention for GPU serving
vLLM came out of UC Berkeley in 2023 with one specific goal: make GPU inference faster under load. Its central contribution is PagedAttention, a memory management technique borrowed from operating system virtual memory design. Instead of pre-allocating a contiguous block of GPU memory for every possible token in a sequence, PagedAttention splits the KV cache into small non-contiguous blocks and allocates them on demand. The result is 19-27% less memory waste and the ability to serve far more concurrent requests within the same VRAM footprint.
Combine PagedAttention with continuous batching (where new requests join an active batch the moment a slot opens, rather than waiting for the whole batch to finish) and you get a system that keeps the GPU saturated at 85-92% utilization under heavy load. That is the sweet spot vLLM was built for: production inference servers, internal APIs, multi-user SaaS applications.
llama.cpp: GGUF and CPU-first inference
llama.cpp is a pure C++ implementation with essentially no external dependencies. Georgi Gerganov wrote it in 2023 to run LLaMA on a MacBook Pro CPU, and it has grown from there into one of the most ported and actively developed inference engines available. It created the GGUF format: a single binary file that bundles model weights, tokenizer, and metadata together for fast memory-mapped loading.
Quantization support is its other major strength. llama.cpp handles everything from Q2 to Q8, with practical sweet spots around Q4_K_M (high quality, fits in 4-5GB VRAM for a 7B model) and Q5_K_M (near full precision, needs about 5-6GB). For CPU inference on AVX2-capable machines you can expect roughly 10-30 tokens per second on a 7B model. On Apple Silicon with Metal acceleration that jumps to 50-100 tokens per second, depending on the chip. The full GGUF quantization guide covers the trade-offs between each quant level in detail.
The limitation is concurrency. llama.cpp processes requests in a queue. When one request finishes, the next starts. Under heavy multi-user load, time-to-first-token increases roughly linearly with the queue depth. Fine for personal use, painful for a shared API.
Ollama: the friendly wrapper
Ollama is built on top of llama.cpp (with MLX support added for Apple Silicon on newer macOS versions). It adds model management — pull, run, list, delete — a simplified REST API, and Modelfile-based customization that works a bit like Docker. The entire local setup is three commands: ollama pull llama3, ollama run llama3, done.
Because Ollama runs on llama.cpp, single-user performance is nearly identical to vanilla llama.cpp. The management layer adds 10-30% overhead in raw throughput tests, which matters less when you are the only person hitting the endpoint. What Ollama does not do is PagedAttention or continuous batching. Once you send more than five or six concurrent requests, P95 latency spikes fast and request queuing compounds. Benchmarks have recorded P95 TTFT going from a few seconds to over a minute once concurrency exceeds 10 users. It is not built for that workload. Our guide on connecting LM Studio to a remote Ollama server shows how to get the most out of Ollama in a headless setup.
Benchmark methodology
To keep comparisons fair, we held hardware and models constant across all three engines where possible. vLLM requires CUDA, so the GPU tests ran on an RTX 3090 (24GB VRAM). llama.cpp and Ollama were tested on the same RTX 3090 with CUDA enabled, and separately on an M2 MacBook Pro (16GB unified memory) for CPU/Metal comparisons. CPU-only runs used a Ryzen 9 5950X.
Models tested
- *Llama 3 8B (GGUF Q4_K_M / BF16 for vLLM)
- *Mistral 7B v0.3 (same quant settings)
- *Qwen2.5 7B Instruct (same quant settings)
Metrics measured
- *Throughput: tokens/sec at batch size 1 and 8
- *TTFT: time-to-first-token in milliseconds
- *Memory: peak VRAM during inference
- *Concurrency: requests per second, 1-20 users
vLLM loaded models in BF16 format from HuggingFace (it does not use GGUF). llama.cpp and Ollama both used GGUF Q4_K_M. This means vLLM had access to slightly higher precision weights, at the cost of roughly 3x the VRAM requirement. AWQ/GPTQ quantized versions of each model were also tested under vLLM separately for the memory comparison section.
Throughput results
Single user
With one user sending requests, llama.cpp and Ollama hold their own. On the RTX 3090, llama.cpp generated Llama 3 8B at Q4_K_M at roughly 85-95 tokens per second. Ollama came in about 10-15% lower at 72-80 tokens per second, consistent with the overhead its management layer introduces. vLLM running BF16 on the same card hit 75-85 tokens per second for a single stream.
That is not a typo. For one user, llama.cpp at Q4_K_M beats vLLM at BF16 on throughput. The C++ runtime has less scheduling overhead, and there is no batching logic to run when only one request exists. If you are using the server yourself, llama.cpp is the faster choice.
Concurrent users: where vLLM separates itself
At batch size 8 (eight concurrent requests), the picture changes completely. vLLM scaled throughput to 420-480 tokens per second across the batch. llama.cpp queues requests sequentially, so effective throughput per request dropped as the queue grew. Aggregate output stayed at roughly 85-95 tokens per second total since the GPU is only ever processing one request at a time. Ollama matched llama.cpp closely.
| Concurrency | vLLM (tokens/sec) | llama.cpp (tokens/sec) | Ollama (tokens/sec) |
|---|---|---|---|
| 1 user | 80 | 90 | 76 |
| 4 users | 290 | 90 | 80 |
| 8 users | 450 | 88 | 75 |
| 16 users | 820 | 85 | timeout/60s+ |
Benchmarks published on the particula.tech blog and MDPI research papers show vLLM outperforming Ollama by 16-29x in aggregate throughput once concurrency exceeds 10 users. At 20 concurrent users, Ollama experiences request timeouts. These numbers align with what we saw: Ollama starts queuing aggressively and P95 latency blows out past one minute at double-digit concurrency.
Skills File System Playbook for AI Agents
Stop re-explaining your stack to every AI chat session. The Skills File System Playbook shows you how to build a version-controlled instruction layer that any AI coding agent can load on demand.
GET_THE_PLAYBOOKLatency and TTFT results
Time-to-first-token matters most for interactive applications. A user watching a cursor blink will notice anything over 500ms. An agent waiting for a tool call response compounds latency across dozens of sequential steps.
Single-user TTFT
For a single request, llama.cpp wins on TTFT. The C++ runtime starts generating the first token faster because there is no scheduler or batch management layer in the path. On our RTX 3090, llama.cpp produced the first token in roughly 180-220ms for a 7B Q4_K_M model. Ollama added about 40-80ms of overhead on top of that. vLLM came in at 250-320ms as the Python runtime and scheduling logic initialize the compute graph before the first token appears.
For agentic workloads where you are making many sequential single requests, that 100ms gap per call compounds. If your agent makes 30 tool calls per task, llama.cpp saves you 3 seconds per task. Not enormous, but not nothing.
TTFT under load
This is where vLLM wins decisively. Under load, continuous batching keeps TTFT stable. At 16 concurrent users, vLLM maintained P50 TTFT around 350ms and P95 around 800ms. llama.cpp at the same concurrency showed P50 around 4 seconds (all the requests queued ahead) and P95 north of 12 seconds. Ollama was similar to llama.cpp at lower concurrency, then blew past 60 seconds at 16 users. The continuous batching PagedAttention combination in vLLM is what makes it the only viable choice for shared inference servers.
| Scenario | vLLM (P50 TTFT) | llama.cpp (P50 TTFT) | Ollama (P50 TTFT) |
|---|---|---|---|
| 1 user | 280ms | 200ms | 260ms |
| 4 users | 310ms | 1.2s | 1.5s |
| 16 users | 350ms | 4.2s | timeout |
Memory usage and VRAM requirements
VRAM is often the hard ceiling that determines which engine you can even run. Here is the practical breakdown for 7B models on a single GPU:
| Format | Engine | VRAM required | Min GPU |
|---|---|---|---|
| Q4_K_M | llama.cpp / Ollama | 4-5GB | RTX 3050 / 4060 |
| Q5_K_M | llama.cpp / Ollama | 5-6GB | RTX 3060 / 4060 |
| Q8_0 | llama.cpp / Ollama | 7-8GB | RTX 3070 / 4070 |
| BF16 (full) | vLLM | 14-16GB | RTX 3090 / 4090 |
| AWQ / GPTQ 4-bit | vLLM | 5-6GB | RTX 3060 / 4060 |
vLLM also pre-allocates 90% of available VRAM on startup for the KV cache pool. That means even on a 24GB RTX 3090 running a 7B AWQ model (5GB weights), vLLM will claim another 17-18GB for the cache. If you try to run two GPU processes simultaneously you will get OOM errors. This is a deliberate design choice: vLLM assumes the GPU belongs entirely to one model server.
llama.cpp and Ollama are far more conservative. They load the model weights and allocate only what is needed for the current context window. On a 4GB GPU (RTX 3050), a 7B Q4_K_M model runs comfortably. For more on GPU selection for local inference, our 2026 GPU guide for local LLMs breaks down the cost-per-token math across consumer and prosumer hardware.
Setup difficulty and developer experience
Setting up vLLM
vLLM installs via pip on Linux. On Windows you need WSL2. Mac support is experimental and not worth attempting in production. The basic setup:
pip install vllm python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3-8B-Instruct \ --port 8000
The most common pitfall is CUDA version mismatch. vLLM is strict about matching your CUDA toolkit version to the PyTorch build it ships with. Mismatches produce cryptic errors at runtime. Budget 1-2 hours for your first setup, especially if your system has multiple CUDA versions installed. After that, running new models is just swapping the --model flag.
Setting up llama.cpp
You can either build from source or grab pre-built binaries from the GitHub releases page. Building from source gives you control over backends (CUDA, Metal, Vulkan) and is recommended if you want GPU acceleration:
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j $(nproc) ./build/bin/llama-server -m model.gguf --port 8080
The llama-server binary spins up an OpenAI-compatible HTTP server. Point your existing OpenAI client at it and almost everything works out of the box. Build time on a modern machine is about 5-10 minutes. Getting GGUF models is straightforward: search HuggingFace for any model name plus "GGUF" and you will find community-converted versions from Bartowski or official repo uploads.
Setting up Ollama
Ollama is genuinely fast to get running. Install the binary, pull a model, run it:
# macOS / Linux curl -fsSL https://ollama.com/install.sh | sh ollama pull llama3.2 ollama run llama3.2
The REST API at http://localhost:11434 is OpenAI-compatible for chat completions. LM Studio can connect to a remote Ollama instance if you expose the port. The tradeoff is that you give up fine-grained control over context lengths, quantization settings, and server tuning parameters that llama.cpp exposes directly.
Which should you choose?
Use vLLM if:
You are building a multi-user API, running a shared coding assistant for a team, or deploying any service where more than 5 people will hit the endpoint simultaneously. You need a dedicated NVIDIA GPU server (8GB minimum, 24GB+ recommended for 7B BF16). This is the only choice for production throughput.
Use llama.cpp if:
You are on CPU, have limited VRAM (under 8GB), need to run on Mac with Metal acceleration, or want direct control over quantization levels and context tuning. Also the right choice if you are embedding inference directly into an application binary.
Use Ollama if:
You want a model running in under 3 minutes, you are prototyping, you are the only user, or you need quick LM Studio integration for local development. Do not use it as a shared team API unless concurrency stays very low.
The hybrid approach
Most teams end up using more than one. Prototype with Ollama locally. Once the prompt and model are settled, move to vLLM for the shared staging server. If the feature needs to work offline or on edge hardware, port the llama.cpp version. The tools are complementary, not competing.
Real-world use cases
vLLM: team coding assistant
Self-hosted Qwen2.5-Coder 32B on a 2x RTX 3090 server. 15 engineers sending concurrent completions throughout the day. vLLM keeps TTFT under 400ms for everyone. Same setup on Ollama would queue requests and average 10-30 second waits by mid-morning.
llama.cpp: offline document Q&A
Legal firm running Mistral 7B Q5_K_M on air-gapped laptops. No network connection allowed, no cloud APIs, no GPU. llama.cpp on AVX2 CPUs gives them 12-15 tokens per second, which is fast enough for the analysts to read. No other engine works in this setup.
Ollama: local agent prototyping
Developer building a personal research agent using Llama 3.2 3B. Ollama runs on a MacBook Pro M2. Pulls new models in 2 minutes, swaps between them without restarting, connects to LangChain via the OpenAI-compatible endpoint. Perfect for iteration, not for scale.
Frequently asked questions
Is vLLM faster than llama.cpp?
It depends on concurrency. For one user sending requests one at a time, llama.cpp is usually faster in both throughput and TTFT because there is no batching overhead. vLLM wins decisively once you have multiple concurrent users. PagedAttention and continuous batching keep the GPU saturated and latency stable in ways llama.cpp cannot match. If you are serving a team, vLLM wins. If you are the only user, llama.cpp wins.
Can Ollama compete with vLLM in performance?
For single-user workloads, yes, closely. Ollama uses llama.cpp under the hood so the numbers track together. Under concurrent load, no. Ollama processes requests sequentially by default and does not implement PagedAttention or continuous batching. At 10 or more concurrent users, P95 latency exceeds 60 seconds and requests start timing out. Ollama was not built for high-throughput serving.
Does llama.cpp work without a GPU?
Yes, it was specifically designed for CPU inference first. On a modern CPU with AVX2 support (most Intel and AMD chips from 2016 onward), a 7B Q4_K_M model runs at 10-30 tokens per second. On Apple Silicon with Metal, you can push 50-100 tokens per second depending on the chip. vLLM has no functional CPU path. Ollama also supports CPU but inherits llama.cpp's backend for it.
What is GGUF and do I need it for all three tools?
GGUF is the quantized model format used by llama.cpp and Ollama. vLLM does not use it. vLLM loads models directly from HuggingFace in BF16, FP16, or quantized formats like AWQ and GPTQ. If you use vLLM, download models from HuggingFace directly. If you use llama.cpp or Ollama, you want GGUF versions, which are available from Bartowski and official model repos on HuggingFace.
Which tool is best for Apple Silicon (M1/M2/M3 Mac)?
llama.cpp and Ollama both support Apple Metal and run well on M-series chips. vLLM's Mac support is experimental and unreliable. For Mac developers, Ollama is the easiest starting point. llama.cpp gives you more control over quantization and server settings. Neither will match a dedicated NVIDIA GPU for raw throughput, but Apple unified memory allows loading larger models than equivalent discrete VRAM would permit.
Can I use these tools as an OpenAI API drop-in?
All three expose OpenAI-compatible REST endpoints. vLLM's implementation is the most complete, covering chat completions, embeddings, and function calling. llama.cpp's llama-server and Ollama's API cover chat completions and basic endpoints but vary on advanced features. Check the specific version of each tool for the full endpoint coverage before assuming drop-in compatibility for embedding or fine-tuning workflows.
How much VRAM do I need to run a 7B model with each tool?
llama.cpp and Ollama at Q4_K_M need 4-5GB, so an RTX 3050 or 4060 works. Q8 needs 7-8GB. vLLM at BF16 needs 14-16GB plus cache headroom, so a 24GB RTX 3090 or 4090 is the practical minimum. vLLM with AWQ quantization brings this down to 5-6GB for weights, but vLLM still pre-allocates most remaining VRAM for the KV cache pool, so you need at least 8GB total.
Is Ollama just llama.cpp with a wrapper?
Mostly, but not entirely. Ollama uses llama.cpp's inference core and now also uses MLX on newer macOS versions. On top of that it adds a model registry (pull, push, list, delete), automatic model switching, a simplified REST API, and Modelfile-based customization. The convenience layer is real and genuinely useful. If you need raw control over build flags, context parameters, or server internals, llama.cpp directly is the cleaner path.
vLLM for throughput. llama.cpp for control. Ollama for moving fast. Pick based on your actual workload, not the tool with the most GitHub stars.
Local AI Infrastructure
Next up: how to connect LM Studio to a remote Ollama or llama.cpp server for a headless local setup that works across your network.