LLM Core Concepts Explained
Transformers, attention, embeddings, tokens, context windows — all explained in plain English.
LLM Core Concepts Explained
What is an LLM?
A Large Language Model (LLM) is an AI trained on vast amounts of text to predict the next token in a sequence. Models like GPT-4, Claude, Gemini, and LLaMA are LLMs.
Key Concepts
Tokens
- Text is broken into tokens (words, sub-words, or characters)
- 1 token ≈ 0.75 words in English
- "Hello world" = 2 tokens
- Token limits define how much text a model can process at once
"The quick brown fox" → ["The", " quick", " brown", " fox"] = 4 tokensContext Window
The context window is the maximum number of tokens a model can "see" at once — including your prompt AND the response.
| Model | Context Window |
|---|---|
| GPT-3.5 | 16,385 tokens |
| GPT-4o | 128,000 tokens |
| Claude 3.5 Sonnet | 200,000 tokens |
| Gemini 1.5 Pro | 1,000,000 tokens |
Transformer Architecture
LLMs are built on the Transformer architecture (introduced in 2017).
Key Components
- Embeddings — Convert tokens into numerical vectors
- Attention Mechanism — Lets the model focus on relevant parts of input
- Feed-Forward Layers — Process and transform information
- Self-Attention — Each token "attends" to all other tokens
Self-Attention Formula
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) x V
Q = Query matrix
K = Key matrix
V = Value matrix
d_k = dimension of key vectorsKey Parameters
Temperature
Controls randomness of outputs.
| Value | Effect |
|---|---|
| 0.0 | Deterministic, always same answer |
| 0.5 | Balanced creativity |
| 1.0 | Creative, more varied |
| 2.0 | Very random, often incoherent |
Top-P (Nucleus Sampling)
Limits token selection to top % of probability mass.
top_p = 0.9means consider tokens that make up 90% of probability
Max Tokens
Maximum number of tokens in the response.
Embeddings
Embeddings are numerical representations of text in high-dimensional vector space. Similar concepts are close together.
# Example: using OpenAI embeddings
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="Hello world"
)
vector = response.data[0].embedding # 1536-dimensional vectorTraining Stages
1. Pre-training
- Train on massive text corpus (internet, books, code)
- Learn language patterns, facts, reasoning
- Very expensive: millions of dollars
2. Fine-tuning
- Train on specific domain data
- Adjust model for particular tasks
- Much cheaper than pre-training
3. RLHF (Reinforcement Learning from Human Feedback)
- Human raters score model outputs
- Model learns to prefer human-rated good answers
- Makes models more helpful, honest, harmless
Hallucination
When an LLM confidently states false information, it's called hallucination.
Causes:
- Training data had errors
- Model doesn't "know" what it doesn't know
- Extrapolates beyond actual knowledge
Mitigation:
- Use RAG (Retrieval Augmented Generation)
- Ask model to cite sources
- Use lower temperature for factual tasks
- Verify outputs with tools/search
RAG (Retrieval Augmented Generation)
User Query
↓
[Vector Database Search]
↓
Relevant Documents Retrieved
↓
LLM receives: Query + Retrieved Context
↓
Accurate, grounded responsePopular LLM APIs
| Provider | Models | Best For |
|---|---|---|
| OpenAI | GPT-4o, o1 | General purpose, coding |
| Anthropic | Claude 3.5, Claude 4 | Long context, analysis |
| Gemini 1.5, 2.0 | Multimodal, long context | |
| Meta | LLaMA 3.1, 3.3 | Open source, self-hosting |
| Mistral | Mistral Large, Codestral | European, code |
Get Free AI Notes Daily
Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!
No spam. Leave anytime.