What hardware do I need to run Ollama?

Minimum for a useful experience: 8GB RAM + modern CPU (runs 7B models slowly, ~1-3 tokens/second). Good experience: 16GB RAM + NVIDIA RTX 3060/3080 or Apple M1/M2 (runs 7B-13B models at 15-40 tokens/second). Excellent experience: 24GB+ VRAM (RTX 3090/4090) — runs 13B-34B models fast or 7B models at 60+ tokens/second. Apple Silicon (M1/M2/M3): exceptional performance for local models — 16GB unified memory M2 Pro runs Mistral 7B at 40+ tokens/second. The minimum recommended setup for daily use: Apple M1/M2 MacBook with 16GB RAM or a PC with RTX 3060 12GB.

Which Ollama models should I download first?

Best starter models by hardware: 8GB RAM → Phi-3 Mini (3.8B, fast, smart for its size); 16GB RAM / 8GB VRAM → Mistral 7B or LLaMA 3.1 8B (strong general purpose); 24GB VRAM → LLaMA 3.1 70B Q4 or Mixtral 8x7B (excellent quality); For coding → Code LLaMA or DeepSeek Coder V2. Starting recommendations: ollama pull llama3.1 (best all-around) and ollama pull phi3 (fastest, good for quick queries). Models are auto-quantized to fit your hardware — 4-bit quantization reduces 7B model from 14GB to ~4GB with minimal quality loss.

Can I use Ollama with Python and the OpenAI SDK?

Yes — Ollama exposes an OpenAI-compatible REST API on localhost:11434. Point the OpenAI SDK to the local URL and it works exactly like the OpenAI API. Code change required: base_url='http://localhost:11434/v1' and api_key='ollama' (any string). All features work: streaming, system messages, chat completions, JSON mode. LangChain, LlamaIndex, and most AI frameworks have native Ollama support too. This makes Ollama ideal for development — build with Ollama locally, deploy with OpenAI in production, same code throughout.

How do I create a custom Ollama model with a system prompt?

Create a Modelfile — similar to a Dockerfile but for models. Specify the base model (FROM llama3.1), add a SYSTEM prompt (SYSTEM 'You are a Python expert...'), set temperature (PARAMETER temperature 0.3), set context length (PARAMETER num_ctx 4096). Run 'ollama create my-model -f Modelfile' to build it. Your custom model appears in 'ollama list' and can be used like any other. This enables creating specialized assistants (coding, writing, customer service) that always start with the right context without repetitive prompting. Share Modelfiles with your team for consistent AI assistants.

AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

large language model architecture diagram on screen — ollama tutorial

Llm Learning

Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)

⚡ Quick Answer

Ollama tutorial — complete guide to running LLaMA, Mistral, and Phi locally on Mac, Windows, and Linux with zero cloud costs, privacy, and OpenAI-compatible API setup.

AiTechWorlds Team May 27, 2026 8 min read

#ollama-tutorial #ollama-setup #run-llm-locally #llm-learning

📚Part of the Llm Learning guide — explore all Llm Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)

The first time I ran Mistral 7B on my laptop — no internet connection, no API key, no cost per query — and got a coherent, useful answer in under five seconds, something clicked. Local AI isn't a compromise. For many use cases, it's genuinely better than cloud APIs.

No privacy concerns for sensitive data. No surprise bills. No latency from network requests. No vendor dependency. And with quantization making models smaller without destroying quality, a $800 GPU now runs models that were research-only two years ago.

Ollama is the tool that made all of this accessible. This guide takes you from installation to building a local AI pipeline, with the practical details that make it actually useful.

Installation

macOS (Recommended for Apple Silicon)

# Download from ollama.com or use Homebrew
brew install ollama

# Start Ollama service
ollama serve

# Or download the Mac app (installs as menubar app)
# https://ollama.com/download/mac

Linux

# One-line install
curl -fsSL https://ollama.com/install.sh | sh

# Start service
systemctl start ollama  # systemd

# Or run manually
ollama serve

Windows

Download the installer from ollama.com — installs as a system service. Alternatively:

# winget
winget install Ollama.Ollama

# Or download from ollama.com/download/windows

Verify Installation

ollama --version
# ollama version 0.3.x

# Test with a quick run
ollama run phi3 "What is 2+2? Answer in one sentence."

Downloading and Running Models

# Pull a model (downloads, doesn't run yet)
ollama pull llama3.1

# Pull specific size
ollama pull llama3.1:70b         # 70B (needs 40GB VRAM)
ollama pull llama3.1:8b          # 8B (needs 8GB RAM)
ollama pull llama3.1:405b-fp16   # Full precision 405B (research only)

# Run models interactively
ollama run llama3.1              # Chat mode
ollama run mistral
ollama run phi3
ollama run codellama             # Coding specialist
ollama run gemma2:9b

# One-shot query (not interactive)
ollama run llama3.1 "Explain transformer attention in one paragraph"

# List downloaded models
ollama list

# Remove a model (free disk space)
ollama rm mistral

# Show model details
ollama show llama3.1

Model Recommendations by Hardware

8GB RAM (CPU only):
  ollama pull phi3:mini           # 3.8B, fast, good quality
  ollama pull gemma2:2b           # Google Gemma 2, excellent for size

8-16GB RAM / 8GB VRAM:
  ollama pull llama3.1:8b         # Best 8B general model
  ollama pull mistral             # Great for commercial use (Apache 2.0)
  ollama pull codellama           # Coding tasks
  
24GB VRAM (RTX 3090/4090):
  ollama pull llama3.1:70b        # Near GPT-3.5 quality
  ollama pull mixtral:8x7b        # Mixtral MoE, very capable

Apple Silicon (M1/M2/M3, 16GB+):
  ollama pull llama3.1:8b         # Very fast on Metal GPU
  ollama pull llama3.1:70b        # M2 Ultra 192GB can handle this

Using the Ollama REST API

Ollama serves a REST API on http://localhost:11434:

# Generate endpoint (simple completion)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Chat endpoint (conversation)
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "What is machine learning?"}
  ],
  "stream": false
}'

Python Integration

Method 1: Official Ollama Python Library

# pip install ollama

import ollama

# Simple generation
response = ollama.generate(
    model="llama3.1",
    prompt="Explain neural networks in simple terms.",
)
print(response["response"])

# Chat completion
response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a helpful Python tutor."},
        {"role": "user", "content": "Explain list comprehensions with an example."}
    ]
)
print(response["message"]["content"])

# Streaming response
stream = ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a bubble sort function"}],
    stream=True
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)
print()  # Newline at end

# List available models
models = ollama.list()
for model in models["models"]:
    print(f"{model['name']}: {model['size'] / 1e9:.1f} GB")

# Embeddings (for semantic search)
embedding = ollama.embeddings(
    model="nomic-embed-text",  # Pull first: ollama pull nomic-embed-text
    prompt="Hello world"
)
print(f"Embedding dimensions: {len(embedding['embedding'])}")

Method 2: OpenAI SDK (Drop-In Replacement)

from openai import OpenAI

# Point to local Ollama instead of OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but value doesn't matter
)

# Exact same code as OpenAI — just different base_url
response = client.chat.completions.create(
    model="llama3.1",  # Use any Ollama model name
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the main benefits of Python?"}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")

# Streaming
stream = client.chat.completions.create(
    model="mistral",
    messages=[{"role": "user", "content": "Write a FastAPI hello world endpoint"}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Custom Models with Modelfiles

Modelfiles let you create specialized models with custom system prompts:

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM llama3.1

# Set a custom system prompt
SYSTEM """
You are an expert Python developer who:
- Writes clean, Pythonic code following PEP 8
- Always includes type hints
- Adds brief docstrings for non-obvious functions
- Prefers standard library solutions when available
- Points out potential issues and edge cases
"""

# Lower temperature for more deterministic code
PARAMETER temperature 0.2

# Larger context for long files
PARAMETER num_ctx 8192

# Stop tokens
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
EOF

# Build the custom model
ollama create python-expert -f Modelfile

# Run your custom model
ollama run python-expert "Write a function to parse CSV files with error handling"

# List to confirm it's there
ollama list
# NAMES           ID            SIZE    MODIFIED
# python-expert   abc123...     4.7 GB  2 minutes ago
# llama3.1        ...           4.7 GB  ...

Pre-configured Model Examples

# Customer service assistant
cat > customer-service.Modelfile << 'EOF'
FROM mistral

SYSTEM """
You are a helpful customer service representative for TechStore.
Store Hours: Monday-Friday 9AM-6PM EST, Saturday 10AM-4PM EST
Return Policy: 30 days with receipt, items must be unused
Warranty: 1 year manufacturer warranty on all electronics
Shipping: Free standard shipping on orders over $50

Always be polite, empathetic, and solution-focused.
If you don't know the answer, say so and offer to escalate.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 4096
EOF

ollama create customer-service -f customer-service.Modelfile

# Code reviewer
cat > code-reviewer.Modelfile << 'EOF'
FROM codellama

SYSTEM """
You are a senior code reviewer. For any code shared:
1. Identify bugs and logic errors
2. Point out security vulnerabilities
3. Suggest performance improvements
4. Note readability/maintainability issues
5. Give specific examples of improved code

Be constructive and specific. Format findings as numbered lists.
"""

PARAMETER temperature 0.1
EOF

ollama create code-reviewer -f code-reviewer.Modelfile

Local RAG System with Ollama

Build a complete local AI assistant over your documents:

# pip install ollama chromadb langchain-community

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader, TextLoader

# 1. Load documents
loader = DirectoryLoader("./my_docs/", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()

# 2. Split documents
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# 3. Create local embeddings (no API key needed)
# ollama pull nomic-embed-text
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# 4. Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./local_chroma"
)

# 5. Create RAG chain with local LLM
llm = Ollama(model="llama3.1", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)

# 6. Query — fully local, no API costs
result = qa_chain.invoke({"query": "What are the main topics in these documents?"})
print(result["result"])

Performance Tips

# GPU acceleration is automatic when GPU is detected
# Check what's being used
ollama ps  # Shows running models and which GPU layer they're using

# Increase GPU layers for faster inference
OLLAMA_NUM_GPU=999 ollama serve  # Use all available GPU layers

# Keep model loaded in memory (avoid reload delay between requests)
# Set OLLAMA_KEEP_ALIVE duration (default: 5m)
OLLAMA_KEEP_ALIVE=1h ollama serve

# For Apple Silicon: ensure Metal GPU is used
# It's automatic, but check with:
ollama run llama3.1 --verbose "hi" 2>&1 | grep "metal"

Conclusion

Ollama has removed the friction from local AI to the point where it's now the right default for development, privacy-sensitive applications, and cost-sensitive production systems. The single-command setup, OpenAI-compatible API, and Modelfile customization make it genuinely practical.

For most developers, the workflow is: use Ollama locally during development (zero cost, fast iteration), then deploy with cloud APIs in production only if local quality is insufficient. Many find that local models are sufficient for the final system too.

For choosing the right local model for your hardware, see our open-source LLM guide. For building a full local RAG system, see our RAG guide.

Frequently Asked Questions

Ollama is an open-source tool that makes running large language models locally as simple as running a Docker container. You can download and run LLaMA 3.1, Mistral, Phi-3, Code LLaMA, Gemma, and 100+ other models with a single command. Use cases: private AI assistant (no data leaves your machine), offline coding assistant, local RAG system, development/testing without API costs, experimenting with model fine-tunes. Ollama handles model downloading, quantization, hardware acceleration (NVIDIA/AMD GPU, Apple Silicon), and exposes an OpenAI-compatible REST API that works as a drop-in replacement for the OpenAI SDK.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

large language model architecture diagram on screen — ai hallucination explained

AI Learning

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.

May 27, 2026 10 min read

large language model architecture diagram on screen — embeddings explained

AI Learning

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.

May 27, 2026 8 min read

large language model architecture diagram on screen — fine-tuning llms fine tuning llm guide

AI Learning

Fine-Tuning LLMs: When to Do It and How to Do It Right

Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.

May 27, 2026 9 min read

large language model architecture diagram on screen — gpt-4 vs claude vs gemini gpt4 vs claude vs gemini

AI Learning

🔥 Trending

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.

May 27, 2026 8 min read

Go deeper on this topic

NotesPrompt Engineering Cheat Sheet NotesLLM Core Concepts Explained NotesChatGPT Tips & Tricks Cheat Sheet NotesTransformer Architecture Cheat Sheet NotesPrompt Engineering vs Fine-Tuning vs RLHF NotesRAG: Retrieval-Augmented Generation Guide

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Llm Learning

Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)

⚡ Quick Answer

Ollama tutorial — complete guide to running LLaMA, Mistral, and Phi locally on Mac, Windows, and Linux with zero cloud costs, privacy, and OpenAI-compatible API setup.

AiTechWorlds Team May 27, 2026 8 min read

#ollama-tutorial #ollama-setup #run-llm-locally #llm-learning

📚Part of the Llm Learning guide — explore all Llm Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)

Ollama is the tool that made all of this accessible. This guide takes you from installation to building a local AI pipeline, with the practical details that make it actually useful.

Installation

macOS (Recommended for Apple Silicon)

# Download from ollama.com or use Homebrew
brew install ollama

# Start Ollama service
ollama serve

# Or download the Mac app (installs as menubar app)
# https://ollama.com/download/mac

Linux

# One-line install
curl -fsSL https://ollama.com/install.sh | sh

# Start service
systemctl start ollama  # systemd

# Or run manually
ollama serve

Windows

Download the installer from ollama.com — installs as a system service. Alternatively:

# winget
winget install Ollama.Ollama

# Or download from ollama.com/download/windows

Verify Installation

ollama --version
# ollama version 0.3.x

# Test with a quick run
ollama run phi3 "What is 2+2? Answer in one sentence."

Downloading and Running Models

# Pull a model (downloads, doesn't run yet)
ollama pull llama3.1

# Pull specific size
ollama pull llama3.1:70b         # 70B (needs 40GB VRAM)
ollama pull llama3.1:8b          # 8B (needs 8GB RAM)
ollama pull llama3.1:405b-fp16   # Full precision 405B (research only)

# Run models interactively
ollama run llama3.1              # Chat mode
ollama run mistral
ollama run phi3
ollama run codellama             # Coding specialist
ollama run gemma2:9b

# One-shot query (not interactive)
ollama run llama3.1 "Explain transformer attention in one paragraph"

# List downloaded models
ollama list

# Remove a model (free disk space)
ollama rm mistral

# Show model details
ollama show llama3.1

Model Recommendations by Hardware

8GB RAM (CPU only):
  ollama pull phi3:mini           # 3.8B, fast, good quality
  ollama pull gemma2:2b           # Google Gemma 2, excellent for size

8-16GB RAM / 8GB VRAM:
  ollama pull llama3.1:8b         # Best 8B general model
  ollama pull mistral             # Great for commercial use (Apache 2.0)
  ollama pull codellama           # Coding tasks
  
24GB VRAM (RTX 3090/4090):
  ollama pull llama3.1:70b        # Near GPT-3.5 quality
  ollama pull mixtral:8x7b        # Mixtral MoE, very capable

Apple Silicon (M1/M2/M3, 16GB+):
  ollama pull llama3.1:8b         # Very fast on Metal GPU
  ollama pull llama3.1:70b        # M2 Ultra 192GB can handle this

Using the Ollama REST API

Ollama serves a REST API on http://localhost:11434:

# Generate endpoint (simple completion)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Chat endpoint (conversation)
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "What is machine learning?"}
  ],
  "stream": false
}'

Python Integration

Method 1: Official Ollama Python Library

# pip install ollama

import ollama

# Simple generation
response = ollama.generate(
    model="llama3.1",
    prompt="Explain neural networks in simple terms.",
)
print(response["response"])

# Chat completion
response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a helpful Python tutor."},
        {"role": "user", "content": "Explain list comprehensions with an example."}
    ]
)
print(response["message"]["content"])

# Streaming response
stream = ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a bubble sort function"}],
    stream=True
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)
print()  # Newline at end

# List available models
models = ollama.list()
for model in models["models"]:
    print(f"{model['name']}: {model['size'] / 1e9:.1f} GB")

# Embeddings (for semantic search)
embedding = ollama.embeddings(
    model="nomic-embed-text",  # Pull first: ollama pull nomic-embed-text
    prompt="Hello world"
)
print(f"Embedding dimensions: {len(embedding['embedding'])}")

Method 2: OpenAI SDK (Drop-In Replacement)

from openai import OpenAI

# Point to local Ollama instead of OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but value doesn't matter
)

# Exact same code as OpenAI — just different base_url
response = client.chat.completions.create(
    model="llama3.1",  # Use any Ollama model name
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the main benefits of Python?"}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")

# Streaming
stream = client.chat.completions.create(
    model="mistral",
    messages=[{"role": "user", "content": "Write a FastAPI hello world endpoint"}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Custom Models with Modelfiles

Modelfiles let you create specialized models with custom system prompts:

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM llama3.1

# Set a custom system prompt
SYSTEM """
You are an expert Python developer who:
- Writes clean, Pythonic code following PEP 8
- Always includes type hints
- Adds brief docstrings for non-obvious functions
- Prefers standard library solutions when available
- Points out potential issues and edge cases
"""

# Lower temperature for more deterministic code
PARAMETER temperature 0.2

# Larger context for long files
PARAMETER num_ctx 8192

# Stop tokens
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
EOF

# Build the custom model
ollama create python-expert -f Modelfile

# Run your custom model
ollama run python-expert "Write a function to parse CSV files with error handling"

# List to confirm it's there
ollama list
# NAMES           ID            SIZE    MODIFIED
# python-expert   abc123...     4.7 GB  2 minutes ago
# llama3.1        ...           4.7 GB  ...

Pre-configured Model Examples

# Customer service assistant
cat > customer-service.Modelfile << 'EOF'
FROM mistral

SYSTEM """
You are a helpful customer service representative for TechStore.
Store Hours: Monday-Friday 9AM-6PM EST, Saturday 10AM-4PM EST
Return Policy: 30 days with receipt, items must be unused
Warranty: 1 year manufacturer warranty on all electronics
Shipping: Free standard shipping on orders over $50

Always be polite, empathetic, and solution-focused.
If you don't know the answer, say so and offer to escalate.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 4096
EOF

ollama create customer-service -f customer-service.Modelfile

# Code reviewer
cat > code-reviewer.Modelfile << 'EOF'
FROM codellama

SYSTEM """
You are a senior code reviewer. For any code shared:
1. Identify bugs and logic errors
2. Point out security vulnerabilities
3. Suggest performance improvements
4. Note readability/maintainability issues
5. Give specific examples of improved code

Be constructive and specific. Format findings as numbered lists.
"""

PARAMETER temperature 0.1
EOF

ollama create code-reviewer -f code-reviewer.Modelfile

Local RAG System with Ollama

Build a complete local AI assistant over your documents:

# pip install ollama chromadb langchain-community

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader, TextLoader

# 1. Load documents
loader = DirectoryLoader("./my_docs/", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()

# 2. Split documents
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# 3. Create local embeddings (no API key needed)
# ollama pull nomic-embed-text
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# 4. Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./local_chroma"
)

# 5. Create RAG chain with local LLM
llm = Ollama(model="llama3.1", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)

# 6. Query — fully local, no API costs
result = qa_chain.invoke({"query": "What are the main topics in these documents?"})
print(result["result"])

Performance Tips

# GPU acceleration is automatic when GPU is detected
# Check what's being used
ollama ps  # Shows running models and which GPU layer they're using

# Increase GPU layers for faster inference
OLLAMA_NUM_GPU=999 ollama serve  # Use all available GPU layers

# Keep model loaded in memory (avoid reload delay between requests)
# Set OLLAMA_KEEP_ALIVE duration (default: 5m)
OLLAMA_KEEP_ALIVE=1h ollama serve

# For Apple Silicon: ensure Metal GPU is used
# It's automatic, but check with:
ollama run llama3.1 --verbose "hi" 2>&1 | grep "metal"

Conclusion

For choosing the right local model for your hardware, see our open-source LLM guide. For building a full local RAG system, see our RAG guide.

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI Learning

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.

May 27, 2026 10 min read

AI Learning

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.

May 27, 2026 8 min read

AI Learning

Fine-Tuning LLMs: When to Do It and How to Do It Right

Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.

May 27, 2026 9 min read

AI Learning

🔥 Trending

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.

May 27, 2026 8 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)

Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)

Installation

macOS (Recommended for Apple Silicon)

Linux

Windows

Verify Installation

Downloading and Running Models

Model Recommendations by Hardware

Using the Ollama REST API

Python Integration

Method 1: Official Ollama Python Library

Method 2: OpenAI SDK (Drop-In Replacement)

Custom Models with Modelfiles

Pre-configured Model Examples

Local RAG System with Ollama

Performance Tips

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Fine-Tuning LLMs: When to Do It and How to Do It Right

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

Go deeper on this topic

Get Free AI Notes Daily

Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)

Ollama Tutorial: Run LLMs Locally on Your Computer (Complete Setup Guide)

Installation

macOS (Recommended for Apple Silicon)

Linux

Windows

Verify Installation

Downloading and Running Models

Model Recommendations by Hardware

Using the Ollama REST API

Python Integration

Method 1: Official Ollama Python Library

Method 2: OpenAI SDK (Drop-In Replacement)

Custom Models with Modelfiles

Pre-configured Model Examples

Local RAG System with Ollama

Performance Tips

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Fine-Tuning LLMs: When to Do It and How to Do It Right

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

Go deeper on this topic

Get Free AI Notes Daily