Overview
Large Language Models (LLMs) have become the most widely used form of artificial intelligence, yet most people use them without understanding the mechanism underneath. This research explains, from first principles, what actually happens when you type a prompt into ChatGPT, Claude, or Gemini — and why that mechanism produces both remarkable results and confident mistakes.
What an LLM really does
At its core, an LLM is a next-token predictor. Given a sequence of text, it outputs a probability distribution over the next token and samples from it. It repeats this one token at a time. There is no internal "belief", "intent", or "understanding" in the human sense — only an extraordinarily rich statistical model of how language tends to continue. This single fact explains most of an LLM's strengths and weaknesses.
Tokens: the atoms of language models
Text is not processed as words or characters but as tokens — sub-word fragments produced by a tokenizer. The word "unbelievable" might split into "un", "believ", and "able". Tokenization matters because pricing, context limits, and even model behavior are measured in tokens, not words. Roughly, one English word is about 1.3 tokens.
Embeddings: turning tokens into meaning
Each token is mapped to a high-dimensional vector called an embedding. These vectors are learned so that tokens with similar meaning or usage sit close together in vector space. This is why a model can treat "king" and "queen", or "Paris" and "France", as related: their embeddings encode that relationship numerically. Embeddings are also the foundation of semantic search and Retrieval-Augmented Generation (RAG).
Attention and the Transformer
The breakthrough that made modern LLMs possible is the attention mechanism, introduced in the 2017 paper "Attention Is All You Need". Attention lets the model weigh how relevant every other token is to the token it is currently processing. Stacking many attention layers produces the Transformer architecture. This is what allows a model to connect a pronoun at the end of a paragraph with the noun it refers to near the beginning.
Context windows and their limits
The context window is the maximum number of tokens the model can attend to at once — its working memory. Everything outside the window is invisible to the model. A larger context window lets the model consider more of your document or conversation, but it does not give the model permanent memory: once the conversation grows beyond the window, the earliest content is dropped.
Why hallucinations happen
Because the model optimizes for plausible continuations rather than true ones, it can produce fluent, confident, and entirely fabricated answers — known as hallucinations. Understanding this reframes how you should use AI: as a powerful drafting and reasoning aid whose factual claims must be verified, not as an oracle.
Practical implications
If you internalize that an LLM is a pattern-completion engine, you immediately become better at using it: provide rich context (the model only knows what is in the window), ask for structured output, give examples (few-shot), and verify factual claims. These techniques are not tricks — they are direct consequences of how the architecture works.
Conclusion
LLMs are not thinking machines; they are the most sophisticated text-prediction systems ever built. That distinction is not a criticism — it is the key to using them effectively and safely.
