In about five minutes you’ll have a real language model running entirely on your own machine — no API key, no cloud account, no per-token bill, nothing leaving your network. Three commands get you there. The tricky part isn’t installation; it’s picking a model that actually fits your hardware. Get that wrong and you’ll wait thirty seconds for each response while your laptop fan screams. Get it right and local AI feels surprisingly usable.
This guide is the fastest correct path from zero to a working local LLM, plus the one sizing mistake that trips up almost everyone who tries this for the first time.
What do you need to run a local LLM with Ollama?
RAM is the real constraint, not CPU speed or OS. A rough rule: you need slightly more available RAM than the model file size. The defaults Ollama downloads are 4-bit quantized (Q4) versions — compressed representations that cut file sizes dramatically while preserving most of the useful capability. According to the Real Python Ollama guide, the default pull is a 4-bit quant, which keeps downloads manageable for most hardware.
Matching model to machine:
| Model size | Download size (Q4) | RAM/VRAM you need | Practical use |
|---|---|---|---|
| 1–3B | ~1–2 GB | 8 GB | Summaries, quick Q&A, low-end laptops |
| 7–8B | ~4–5 GB | 16 GB | The everyday sweet spot — capable and fast |
| 13–14B | ~8 GB | 24 GB | Stronger reasoning if you have the headroom |
| 30B+ | 18 GB+ | 24 GB+ GPU VRAM | Serious work, needs dedicated hardware |
When in doubt, start one tier below your instinct. A model that fits in RAM and responds in two seconds beats a larger one that thrashes disk and takes forty. You can always step up.
Beyond RAM: you need 5–10 GB of free disk for your first model or two, and a Mac, Windows, or Linux machine. No GPU is required — models run on CPU, just slower. If you do have a supported GPU, Ollama uses it automatically (more on that below).
Step 1: Install Ollama
Go to ollama.com/download and grab the installer for your OS. On Mac and Windows, the installer sets up a background service automatically — you don’t need to think about it. On Linux, the same page offers a one-liner:
curl -fsSL https://ollama.com/install.sh | sh
Once installed, verify it:
ollama --version
If you see a version number, you’re good. On Linux (and on Mac if you’re running headless), you may need to start the server manually:
ollama serve
This launches the local server that listens on http://localhost:11434. On Mac and Windows with the desktop app running, it’s already up.
Step 2: Pull a model sized for your hardware
Now download a model. Start with the 3B parameter version of Llama 3.2 — it’s fast, requires only 8 GB of RAM, and is a reasonable test of what local AI feels like:
ollama pull llama3.2:3b
The :3b tag selects the 3-billion-parameter size. The download is around 2 GB. While it downloads, you can browse what else is available at ollama.com/search — the library covers Llama, Mistral, Qwen, Phi, Gemma, and many others. The SitePoint 2026 setup guide confirms that Ollama’s default pull always grabs the 4-bit quantized version unless you specify otherwise, which is the right default for almost everyone.
To see what you’ve downloaded and how large each model is:
ollama list
Step 3: Start chatting
Launch an interactive session:
ollama run llama3.2:3b
Type your prompt and press Enter. The model responds in the terminal. That’s it. To exit, type /bye.
The first response after a reboot takes a few extra seconds — the model is loading into memory. After that, it stays resident and subsequent prompts are fast. You can check what’s currently loaded with ollama ps.
If you’re on a machine with a compatible NVIDIA or AMD GPU (or Apple Silicon), Ollama detects it automatically and uses it. You don’t configure anything. The difference is dramatic: an 8B model that takes 10–15 seconds per response on CPU can drop to under a second on a mid-range GPU.
Step 4: Hit the built-in API
Ollama isn’t just a terminal chatbot — it exposes a local HTTP endpoint that any script or tool can call. With the server running:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Explain MCP in one sentence."
}'
This is what makes Ollama genuinely useful as infrastructure: point your own Python scripts, a local chat UI, or a coding assistant at localhost:11434 and you have a private model backend. The API also supports an OpenAI-compatible endpoint (/v1/chat/completions), which means many tools built for the OpenAI API work against Ollama without modification — you change the base URL and model name, and that’s it.
What does “Q4” mean, and does quantization matter?
If you’ve seen model tags like llama3.2:3b-instruct-q4_K_M and wondered what that means: quantization compresses a model’s weights from high-precision floats (32-bit or 16-bit) down to 4-bit integers. That cuts memory requirements by roughly 4–8x with a relatively small quality penalty. Q4_K_M is the community’s current favorite for balancing size and quality — it’s what Ollama downloads by default when you don’t specify a tag.
If you have extra RAM and want better output, Q8 (8-bit) is a noticeable step up in quality at roughly double the memory cost. If you’re on a very constrained machine, Q2 or Q3 cuts further but quality degrades more noticeably. For most use cases, Q4_K_M is the right answer and you don’t need to touch it.
One non-obvious point: a well-quantized smaller model often outperforms a poorly-quantized larger one. A 7B at Q4 can beat a 13B at Q2 in practice. When evaluating model options, size and quant together determine the tradeoff — neither alone tells the whole story. For a deeper look at how open-weight models differ, see the open-source AI hub.
Why is Ollama slow, and how do you fix it?
This is the most common question after getting things running. Skip ahead to the fix that matches your situation.
“Connection refused” on the API. The server isn’t running. On Linux and macOS (headless), start it with ollama serve. On Mac/Windows, relaunch the Ollama desktop app.
Responses are painfully slow. Either the model is too large for your available RAM and is paging to disk, or you’re running entirely on CPU with a larger model. Drop to a smaller model (:3b instead of :8b) or confirm your GPU is being used with ollama ps — it shows whether a model is loaded on CPU or GPU.
Out of memory / it crashes. The model doesn’t fit. Go one size smaller. RAM/VRAM is almost always the cause. There’s no configuration trick to make a 13B model run on 8 GB of RAM; the math doesn’t work.
Weak, shallow answers. A 3B model is not frontier-class and shouldn’t be expected to perform like one. If your hardware supports it, pull the 8B version of the same model family — the difference is substantial. For coding tasks, a model tuned for code (like qwen2.5-coder:7b) will outperform a general-purpose model of the same size.
Model feels stale. Model authors push updated weights under the same name tag. Re-pulling fetches the latest version:
ollama pull llama3.2:3b
The commands you’ll actually use
ollama pull <model> # Download a model (e.g. ollama pull qwen3:8b)
ollama run <model> # Start an interactive chat session
ollama list # Show downloaded models and their sizes
ollama ps # Show which models are loaded in memory right now
ollama rm <model> # Delete a model (they add up — each is several GB)
ollama serve # Start the background server manually if needed
That’s the complete day-to-day toolkit. The only maintenance habit worth building: run ollama list every few weeks and ollama rm the experiments you’ve abandoned. A handful of large models accumulates quietly into tens of gigabytes.
Connecting a UI or external tool
If you’d rather not live in the terminal, several open-source front-ends connect to Ollama’s API at localhost:11434 out of the box. They give you a browser-based chat window backed entirely by your local model — no cloud dependency at any layer. The same endpoint is what scripts, coding tools, and notebook integrations use.
This is where Ollama’s architecture earns its keep. Running a model locally used to mean assembling a Python environment, wrangling CUDA drivers, converting model weights, and writing your own inference loop.
Ollama reduces all of that to a package manager abstraction: pull a named model, run it, hit an HTTP endpoint. The complexity hasn’t disappeared — but it’s now Ollama’s problem to manage rather than yours.
That shift is why it became the default tool for local LLM work across Mac, Windows, and Linux in roughly two years. If you’re choosing between open-weight models to run this way, the open-source AI hub tracks what’s worth pulling in 2026.
For what’s worth running on your hardware in 2026, see the open-source model roundup — the field moves fast and the rankings shift as new architectures land.
Before you call it done
- [ ]
ollama --versionreturns a version number - [ ] Server is running (
ollama serveor the desktop app) - [ ] A model is pulled that fits your RAM (
ollama pull llama3.2:3b) - [ ]
ollama run llama3.2:3bstarts an interactive session and responds - [ ]
/byeexits cleanly - [ ] (Optional) The API responds at
http://localhost:11434/api/generate
Bottom line
Pull llama3.2:3b, run it, ask it something you’d normally send to a cloud model, and judge the result. Most people who try it land on a 7–8B model as their daily driver once they’ve confirmed their hardware can carry it. If the quality gap versus a cloud model is too wide, step up one tier — but first check whether the model you’re running is actually sized for your RAM. That single decision determines 80% of the experience.
Local AI with Ollama is not a research project in 2026. It’s three commands and a five-minute test to find out whether it fits your workflow.
Last verified June 13, 2026 against the Ollama download page, Real Python’s Ollama guide, and the SitePoint 2026 setup guide.
