Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)
Ollama tutorial — complete guide to running LLaMA, Mistral, and Phi locally on Mac, Windows, and Linux with zero cloud costs, privacy, and OpenAI-compatible API setup.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)
The first time I ran Mistral 7B on my laptop — no internet connection, no API key, no cost per query — and got a coherent, useful answer in under five seconds, something clicked. Local AI isn't a compromise. For many use cases, it's genuinely better than cloud APIs.
No privacy concerns for sensitive data. No surprise bills. No latency from network requests. No vendor dependency. And with quantization making models smaller without destroying quality, a $800 GPU now runs models that were research-only two years ago.
Ollama is the tool that made all of this accessible. This guide takes you from installation to building a local AI pipeline, with the practical details that make it actually useful.
Installation
macOS (Recommended for Apple Silicon)
# Download from ollama.com or use Homebrew
brew install ollama
# Start Ollama service
ollama serve
# Or download the Mac app (installs as menubar app)
# https://ollama.com/download/mac
Linux
# One-line install
curl -fsSL https://ollama.com/install.sh | sh
# Start service
systemctl start ollama # systemd
# Or run manually
ollama serve
Windows
Download the installer from ollama.com — installs as a system service. Alternatively:
# winget
winget install Ollama.Ollama
# Or download from ollama.com/download/windows
Verify Installation
ollama --version
# ollama version 0.3.x
# Test with a quick run
ollama run phi3 "What is 2+2? Answer in one sentence."
Downloading and Running Models
# Pull a model (downloads, doesn't run yet)
ollama pull llama3.1
# Pull specific size
ollama pull llama3.1:70b # 70B (needs 40GB VRAM)
ollama pull llama3.1:8b # 8B (needs 8GB RAM)
ollama pull llama3.1:405b-fp16 # Full precision 405B (research only)
# Run models interactively
ollama run llama3.1 # Chat mode
ollama run mistral
ollama run phi3
ollama run codellama # Coding specialist
ollama run gemma2:9b
# One-shot query (not interactive)
ollama run llama3.1 "Explain transformer attention in one paragraph"
# List downloaded models
ollama list
# Remove a model (free disk space)
ollama rm mistral
# Show model details
ollama show llama3.1
Model Recommendations by Hardware
8GB RAM (CPU only):
ollama pull phi3:mini # 3.8B, fast, good quality
ollama pull gemma2:2b # Google Gemma 2, excellent for size
8-16GB RAM / 8GB VRAM:
ollama pull llama3.1:8b # Best 8B general model
ollama pull mistral # Great for commercial use (Apache 2.0)
ollama pull codellama # Coding tasks
24GB VRAM (RTX 3090/4090):
ollama pull llama3.1:70b # Near GPT-3.5 quality
ollama pull mixtral:8x7b # Mixtral MoE, very capable
Apple Silicon (M1/M2/M3, 16GB+):
ollama pull llama3.1:8b # Very fast on Metal GPU
ollama pull llama3.1:70b # M2 Ultra 192GB can handle this
Using the Ollama REST API
Ollama serves a REST API on http://localhost:11434:
# Generate endpoint (simple completion)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Why is the sky blue?",
"stream": false
}'
# Chat endpoint (conversation)
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{"role": "user", "content": "What is machine learning?"}
],
"stream": false
}'
Python Integration
Method 1: Official Ollama Python Library
# pip install ollama
import ollama
# Simple generation
response = ollama.generate(
model="llama3.1",
prompt="Explain neural networks in simple terms.",
)
print(response["response"])
# Chat completion
response = ollama.chat(
model="llama3.1",
messages=[
{"role": "system", "content": "You are a helpful Python tutor."},
{"role": "user", "content": "Explain list comprehensions with an example."}
]
)
print(response["message"]["content"])
# Streaming response
stream = ollama.chat(
model="llama3.1",
messages=[{"role": "user", "content": "Write a bubble sort function"}],
stream=True
)
for chunk in stream:
print(chunk["message"]["content"], end="", flush=True)
print() # Newline at end
# List available models
models = ollama.list()
for model in models["models"]:
print(f"{model['name']}: {model['size'] / 1e9:.1f} GB")
# Embeddings (for semantic search)
embedding = ollama.embeddings(
model="nomic-embed-text", # Pull first: ollama pull nomic-embed-text
prompt="Hello world"
)
print(f"Embedding dimensions: {len(embedding['embedding'])}")
Method 2: OpenAI SDK (Drop-In Replacement)
from openai import OpenAI
# Point to local Ollama instead of OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but value doesn't matter
)
# Exact same code as OpenAI — just different base_url
response = client.chat.completions.create(
model="llama3.1", # Use any Ollama model name
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the main benefits of Python?"}
],
temperature=0.7
)
print(response.choices[0].message.content)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
# Streaming
stream = client.chat.completions.create(
model="mistral",
messages=[{"role": "user", "content": "Write a FastAPI hello world endpoint"}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
Custom Models with Modelfiles
Modelfiles let you create specialized models with custom system prompts:
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM llama3.1
# Set a custom system prompt
SYSTEM """
You are an expert Python developer who:
- Writes clean, Pythonic code following PEP 8
- Always includes type hints
- Adds brief docstrings for non-obvious functions
- Prefers standard library solutions when available
- Points out potential issues and edge cases
"""
# Lower temperature for more deterministic code
PARAMETER temperature 0.2
# Larger context for long files
PARAMETER num_ctx 8192
# Stop tokens
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
EOF
# Build the custom model
ollama create python-expert -f Modelfile
# Run your custom model
ollama run python-expert "Write a function to parse CSV files with error handling"
# List to confirm it's there
ollama list
# NAMES ID SIZE MODIFIED
# python-expert abc123... 4.7 GB 2 minutes ago
# llama3.1 ... 4.7 GB ...
Pre-configured Model Examples
# Customer service assistant
cat > customer-service.Modelfile << 'EOF'
FROM mistral
SYSTEM """
You are a helpful customer service representative for TechStore.
Store Hours: Monday-Friday 9AM-6PM EST, Saturday 10AM-4PM EST
Return Policy: 30 days with receipt, items must be unused
Warranty: 1 year manufacturer warranty on all electronics
Shipping: Free standard shipping on orders over $50
Always be polite, empathetic, and solution-focused.
If you don't know the answer, say so and offer to escalate.
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
EOF
ollama create customer-service -f customer-service.Modelfile
# Code reviewer
cat > code-reviewer.Modelfile << 'EOF'
FROM codellama
SYSTEM """
You are a senior code reviewer. For any code shared:
1. Identify bugs and logic errors
2. Point out security vulnerabilities
3. Suggest performance improvements
4. Note readability/maintainability issues
5. Give specific examples of improved code
Be constructive and specific. Format findings as numbered lists.
"""
PARAMETER temperature 0.1
EOF
ollama create code-reviewer -f code-reviewer.Modelfile
Local RAG System with Ollama
Build a complete local AI assistant over your documents:
# pip install ollama chromadb langchain-community
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader, TextLoader
# 1. Load documents
loader = DirectoryLoader("./my_docs/", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()
# 2. Split documents
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
# 3. Create local embeddings (no API key needed)
# ollama pull nomic-embed-text
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# 4. Create vector store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./local_chroma"
)
# 5. Create RAG chain with local LLM
llm = Ollama(model="llama3.1", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)
# 6. Query — fully local, no API costs
result = qa_chain.invoke({"query": "What are the main topics in these documents?"})
print(result["result"])
Performance Tips
# GPU acceleration is automatic when GPU is detected
# Check what's being used
ollama ps # Shows running models and which GPU layer they're using
# Increase GPU layers for faster inference
OLLAMA_NUM_GPU=999 ollama serve # Use all available GPU layers
# Keep model loaded in memory (avoid reload delay between requests)
# Set OLLAMA_KEEP_ALIVE duration (default: 5m)
OLLAMA_KEEP_ALIVE=1h ollama serve
# For Apple Silicon: ensure Metal GPU is used
# It's automatic, but check with:
ollama run llama3.1 --verbose "hi" 2>&1 | grep "metal"
Conclusion
Ollama has removed the friction from local AI to the point where it's now the right default for development, privacy-sensitive applications, and cost-sensitive production systems. The single-command setup, OpenAI-compatible API, and Modelfile customization make it genuinely practical.
For most developers, the workflow is: use Ollama locally during development (zero cost, fast iteration), then deploy with cloud APIs in production only if local quality is insufficient. Many find that local models are sufficient for the final system too.
For choosing the right local model for your hardware, see our open-source LLM guide. For building a full local RAG system, see our RAG guide.
Frequently Asked Questions
What is Ollama and what can I do with it?
Ollama is an open-source tool for running LLMs locally with a single command. Run LLaMA 3.1, Mistral, Phi-3, Code LLaMA, and 100+ models privately with no API costs. It exposes an OpenAI-compatible REST API, making it a drop-in replacement for development. Use cases: private AI assistant, offline coding help, local RAG systems, development without API costs.
What hardware do I need to run Ollama?
Minimum: 8GB RAM for slow CPU inference of 7B models. Good: 16GB RAM + RTX 3060 or M1/M2 Mac (40 tokens/second for 7B models). Excellent: 24GB+ VRAM for 13B-34B models. Apple Silicon (M2 Pro 16GB+) is outstanding for local models — unified memory means all RAM is usable as GPU memory.
Which Ollama models should I download first?
Start with ollama pull llama3.1 (best all-around) and ollama pull phi3 (fastest for quick queries). For coding: ollama pull codellama. 8GB RAM users: start with phi3:mini. The 8B LLaMA 3.1 is the sweet spot for most users.
Can I use Ollama with Python and the OpenAI SDK?
Yes. Set base_url='http://localhost:11434/v1' and api_key='ollama' in the OpenAI client. All features work identically — same code, just different base URL. LangChain, LlamaIndex, and most AI frameworks have native Ollama support too.
How do I create a custom Ollama model with a system prompt?
Create a Modelfile with FROM llama3.1 and SYSTEM "your prompt". Run ollama create my-model -f Modelfile. Your model appears in ollama list. Set temperature with PARAMETER temperature 0.3. Share Modelfiles for consistent team assistants.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)
AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.
Embeddings Explained: How AI Converts Words to Numbers That Mean Something
Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.
Fine-Tuning LLMs: When to Do It and How to Do It Right
Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.
GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?
GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.