Not every AI task needs to go through a paid API. For internal tools, high-volume processing, or anything where privacy matters, running a model locally is often the better choice. Ollama makes this straightforward — a single install, a one-line command to pull a model, and you're running inference on your own machine.
What Ollama is
Ollama is a tool that manages local LLMs. It handles downloading models, running them as a local server, and exposes a simple REST API that mirrors the OpenAI format — which means you can swap it into most existing code with minimal changes.
It runs well on Linux, macOS, and Windows. If you're on WSL2, there's one extra step to route requests correctly, covered below.
Install and pull a model
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (llama3.1:8b is a solid general-purpose choice)
ollama pull llama3.1:8b
# Run it interactively to verify
ollama run llama3.1:8b
The 8B model is about 5GB and runs fine on most modern hardware with 8GB+ RAM. If you need something smaller, llama3.2:3b is roughly 2GB and still capable for summarization and classification tasks.
WSL2 note: Ollama runs as a Windows service but WSL2 can reach it via the Windows host IP. Set OLLAMA_HOST=http://$(cat /etc/resolv.conf | grep nameserver | awk '{print $2}'):11434 before your scripts run, or add it to your .bashrc.
Calling Ollama from Python
Ollama's REST API is OpenAI-compatible. The simplest way to call it from Python is with the requests library — no extra SDK needed.
import requests
import json
OLLAMA_URL = "http://localhost:11434/api/generate"
def ask(prompt, model="llama3.1:8b"):
response = requests.post(OLLAMA_URL, json={
"model": model,
"prompt": prompt,
"stream": False
})
return response.json()["response"]
result = ask("Summarize this in one sentence: Ollama runs LLMs locally.")
print(result)
If you prefer the chat format (system + user messages), use /api/chat instead:
def chat(system, user, model="llama3.1:8b"):
response = requests.post("http://localhost:11434/api/chat", json={
"model": model,
"stream": False,
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": user}
]
})
return response.json()["message"]["content"]
A practical use: document classifier
Say you have a folder of text files — support tickets, notes, whatever — and you want to sort them into categories automatically. Ollama handles this well because the task is simple and volume might be high.
import os
CATEGORIES = ["billing", "technical", "general", "urgent"]
def classify(text):
prompt = f"""Classify this message into exactly one category.
Categories: {", ".join(CATEGORIES)}
Reply with only the category name, nothing else.
Message: {text}"""
return ask(prompt).strip().lower()
folder = "inbox/"
for filename in os.listdir(folder):
with open(os.path.join(folder, filename)) as f:
content = f.read()
category = classify(content)
print(f"{filename} → {category}")
When to use Ollama vs a paid API
There's no universal answer, but the tradeoffs are clear:
- Use Ollama when volume is high, privacy matters, or you're processing data that shouldn't leave your machine
- Use Claude or GPT when quality is critical, context windows need to be large, or the task needs strong reasoning
- Use both in a hybrid setup — Ollama for fast/cheap preprocessing, Claude for the final step where quality matters
The hybrid pattern is what I use in most projects. A local model handles triage and filtering; a cloud model handles the final output. Costs stay low, quality stays high.
Running Ollama as a service
On Linux, Ollama installs as a systemd service automatically. To confirm it's running:
systemctl status ollama
# Start it if not running
systemctl start ollama
# Enable on boot
systemctl enable ollama
Once it's a background service, your Python scripts can call it at any time — no need to manually start Ollama before running your automations.
Questions about a specific use case? Drop me a message — I'm happy to talk through the tradeoffs.