Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Q: What are embeddings in AI?

Embeddings are dense numerical vectors that represent the meaning of text, images, or other data. Instead of treating words as discrete symbols, embeddings place semantically similar items close together in a high-dimensional space. 'dog' and 'puppy' have embeddings that are near each other; 'dog' and 'calculus' are far apart. Modern text embedding models produce vectors with 768–3072 dimensions. These numbers capture semantic relationships: the vector difference between 'king' and 'man' is approximately equal to the difference between 'queen' and 'woman' — meaning, gender relationships, and concepts are encoded geometrically.

Q: How are embeddings created?

Text embeddings are created by passing text through a neural network (typically a transformer encoder like BERT) and extracting the hidden-state representation from the final layer. During training, the model is trained to produce similar vectors for semantically related text and dissimilar vectors for unrelated text — using techniques like contrastive learning, MNRL (Multiple Negatives Ranking Loss), or masked language modeling. The embedding is typically the [CLS] token's vector representation, or the mean pooling of all token representations. Modern embedding models (like OpenAI's text-embedding-3 or Cohere's embed) are fine-tuned on billions of (query, relevant document) pairs.

Q: What is the difference between word2vec, BERT, and modern embedding models?

Word2Vec (2013): produces static embeddings — one vector per word regardless of context. 'bank' has the same embedding in 'river bank' and 'bank account.' Fast, small, useful for simple tasks. BERT-based embeddings (2018+): produce contextual embeddings — different vectors for the same word in different contexts. Sentences are embedded as one vector (CLS token or mean pooling). Better quality but slower. Modern embedding models (OpenAI text-embedding-3, Cohere embed-3, E5, GTE): dedicated models optimized specifically for semantic similarity, not just MLM. Best quality for search and retrieval. Generally 3-5× better than vanilla BERT for semantic search tasks.

Q: How do I use embeddings for semantic search?

Semantic search pipeline: 1) Embed all documents in your corpus and store in a vector database (Pinecone, Chroma, Weaviate, pgvector). 2) When a query arrives, embed the query with the same model. 3) Find the documents with the most similar embeddings (cosine similarity or dot product). 4) Return the top-k results. This finds semantically similar documents even without exact keyword matches — 'how do I cancel my subscription' finds 'membership termination procedures.' Embedding quality is critical: use the same model for documents and queries; don't mix OpenAI and HuggingFace embeddings in the same index.

Q: Which embedding model should I use?

For most applications in 2025: OpenAI text-embedding-3-small (1536 dims, $0.02/1M tokens) — excellent quality, fast, OpenAI ecosystem. OpenAI text-embedding-3-large (3072 dims) — best quality in OpenAI family, 2× cost. Cohere embed-v3 — strong multilingual performance, supports different task types. For open-source/local: BAAI/bge-large-en-v1.5 or Alibaba-NLP/gte-large — top of MTEB leaderboard, free. For production RAG: test on your specific domain, not just benchmarks. Domain matters: a legal embedding model outperforms general models on legal text. MTEB (Massive Text Embedding Benchmark) is the authoritative leaderboard for comparison.

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

The moment that made embeddings click for me was this: type "cat" and "feline" into a similarity checker, and they score 0.87. Type "cat" and "automobile" and they score 0.12. The model has never been told these words are related — it learned the relationships from billions of texts where they appear in similar contexts.

That's what embeddings are: a learned geometric representation of meaning. And once you understand them, you realize they're the foundation of semantic search, RAG systems, recommendation engines, anomaly detection, and nearly every modern AI application that handles text.

This guide explains how embeddings work, how to create and use them, and the practical differences between embedding models that matter for production systems.

The Core Idea: Meaning as Geometry

Traditional text processing treated words as arbitrary symbols. "cat" was just a string — no relationship to "feline" or "kitten." Search required exact matches.

Embeddings encode meaning geometrically. Words and sentences become points in a high-dimensional space, where:

Similar meanings → nearby points
Different meanings → distant points
Relationships → consistent directions

import numpy as np

# Illustrative example of what embedding vectors look like
# (not actual values — real embeddings have 768-3072 dimensions)
king =   np.array([0.7, 0.1, 0.9, 0.2, ...])  # 1536 values
man =    np.array([0.6, 0.2, 0.8, 0.1, ...])
woman =  np.array([0.5, 0.8, 0.7, 0.3, ...])
queen =  np.array([0.4, 0.9, 0.8, 0.4, ...])

# The famous word analogy: king - man + woman ≈ queen
analogy = king - man + woman
similarity = np.dot(analogy, queen) / (np.linalg.norm(analogy) * np.linalg.norm(queen))
# similarity ≈ 0.89 (very close to queen)

This geometric property means you can do arithmetic with meaning. King − Man + Woman ≈ Queen. Paris − France + Germany ≈ Berlin. The spatial structure of the embedding space reflects conceptual structure.

How Embeddings Are Created

Word2Vec (2013): Static Embeddings

from gensim.models import Word2Vec

# Training corpus
sentences = [
    ["machine", "learning", "is", "powerful"],
    ["deep", "learning", "uses", "neural", "networks"],
    ["python", "is", "used", "for", "machine", "learning"],
]

# Train Word2Vec
model = Word2Vec(
    sentences,
    vector_size=100,   # Embedding dimensions
    window=5,          # Context window
    min_count=1,       # Minimum word frequency
    workers=4
)

# Access word vectors
king_vec = model.wv['machine']
print(f"Vector shape: {king_vec.shape}")  # (100,)

# Find similar words
similar = model.wv.most_similar("learning", topn=5)
print(similar)  # [('machine', 0.95), ('deep', 0.88), ...]

# Limitation: "bank" has ONE embedding regardless of context
# "river bank" and "bank account" get the same vector

BERT: Contextual Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_bert_embedding(text: str) -> np.ndarray:
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Mean pooling of last hidden states (better than CLS for sentence similarity)
    token_embeddings = outputs.last_hidden_state
    attention_mask = inputs["attention_mask"]
    
    # Mask padding tokens
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    mean_pooled = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    return mean_pooled.numpy()[0]  # Shape: (768,)

# Same word, different context = different embedding
bank_river = get_bert_embedding("She sat by the river bank")
bank_money = get_bert_embedding("I opened a new bank account")

from numpy.linalg import norm
def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

# These should be less similar than you'd expect if context matters
print(cosine_similarity(bank_river, bank_money))  # ~0.82 (still somewhat similar)

Modern Embedding Models (2024-2025)

from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    response = client.embeddings.create(
        model=model,
        input=texts,
        encoding_format="float"
    )
    return [item.embedding for item in response.data]

# Batch embedding (more efficient)
texts = [
    "machine learning is a subset of AI",
    "deep learning uses neural networks",
    "I love cooking pasta",
    "the stock market crashed today"
]

embeddings = embed_texts(texts)
print(f"Embedding dimensions: {len(embeddings[0])}")  # 1536

# Semantic similarity matrix
import numpy as np

def cosine_similarity_matrix(embeddings):
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / norms
    return np.dot(normalized, normalized.T)

emb_array = np.array(embeddings)
sim_matrix = cosine_similarity_matrix(emb_array)

print("\nSimilarity Matrix:")
print(f"ML ↔ Deep Learning: {sim_matrix[0, 1]:.3f}")  # High: ~0.88
print(f"ML ↔ Cooking: {sim_matrix[0, 2]:.3f}")         # Low: ~0.12

Semantic Search Pipeline

import numpy as np
from openai import OpenAI

client = OpenAI()

class SemanticSearch:
    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: list[str]):
        """Add documents to the search index."""
        response = client.embeddings.create(
            model=self.model,
            input=documents
        )
        new_embeddings = [item.embedding for item in response.data]
        
        self.documents.extend(documents)
        self.embeddings.extend(new_embeddings)
        print(f"Indexed {len(documents)} documents. Total: {len(self.documents)}")
    
    def search(self, query: str, top_k: int = 5) -> list[dict]:
        """Find most similar documents to query."""
        query_response = client.embeddings.create(
            model=self.model,
            input=[query]
        )
        query_embedding = np.array(query_response.data[0].embedding)
        
        doc_embeddings = np.array(self.embeddings)
        
        # Cosine similarity
        doc_norms = np.linalg.norm(doc_embeddings, axis=1)
        query_norm = np.linalg.norm(query_embedding)
        
        similarities = np.dot(doc_embeddings, query_embedding) / (doc_norms * query_norm)
        
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        return [
            {
                "document": self.documents[i],
                "similarity": float(similarities[i]),
                "rank": rank + 1
            }
            for rank, i in enumerate(top_indices)
        ]

# Example usage
search = SemanticSearch()

# Index documents about AI topics
docs = [
    "Transformers use self-attention mechanisms to process sequences.",
    "BERT is a bidirectional encoder trained on masked language modeling.",
    "GPT models are autoregressive — they predict the next token.",
    "The vanishing gradient problem affects deep recurrent networks.",
    "Fine-tuning adapts pre-trained models to specific tasks.",
    "RAG combines retrieval with language model generation.",
    "Vector databases store embeddings for fast similarity search.",
]

search.add_documents(docs)

# Semantic search — finds related content without exact word matches
results = search.search("how do attention-based models work?", top_k=3)

for r in results:
    print(f"Rank {r['rank']} ({r['similarity']:.3f}): {r['document']}")

# Output:
# Rank 1 (0.847): Transformers use self-attention mechanisms to process sequences.
# Rank 2 (0.721): BERT is a bidirectional encoder trained on masked language modeling.
# Rank 3 (0.698): GPT models are autoregressive — they predict the next token.

Open-Source Embedding Models

from sentence_transformers import SentenceTransformer
import numpy as np

# BAAI/bge-large-en-v1.5 — top of MTEB leaderboard (free, local)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

sentences = [
    "The cat sat on the mat.",
    "A feline rested on the rug.",  # Semantically similar
    "Python is a programming language.",  # Unrelated
]

# BGE models work better with a prefix for queries
query = "bge instruction: Retrieve relevant passages\nQuery: animal resting on floor covering"
docs_for_embedding = sentences

query_embedding = model.encode(query, normalize_embeddings=True)
doc_embeddings = model.encode(docs_for_embedding, normalize_embeddings=True)

# Dot product gives cosine similarity when normalized
similarities = doc_embeddings @ query_embedding

for sent, sim in zip(sentences, similarities):
    print(f"Score {sim:.3f}: {sent}")

# Multilingual embeddings
multilingual_model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")

mixed_languages = [
    "How do I cancel my subscription?",  # English
    "¿Cómo cancelo mi suscripción?",      # Spanish — same meaning
    "Comment annuler mon abonnement?",    # French — same meaning
    "What is the weather today?",          # Different topic
]

embeddings = multilingual_model.encode(mixed_languages, normalize_embeddings=True)
sim_matrix = embeddings @ embeddings.T

# Cross-lingual similarity: English/Spanish/French versions should score ~0.85+
print(f"EN ↔ ES: {sim_matrix[0,1]:.3f}")  # ~0.87
print(f"EN ↔ FR: {sim_matrix[0,2]:.3f}")  # ~0.85
print(f"EN ↔ different topic: {sim_matrix[0,3]:.3f}")  # ~0.12

Embedding Models Comparison

Model	Dimensions	MTEB Score	Cost	Best For
OpenAI text-embedding-3-large	3072	64.6	$0.13/1M tokens	Best quality, OpenAI ecosystem
OpenAI text-embedding-3-small	1536	62.3	$0.02/1M tokens	Cost-efficient, good quality
Cohere embed-v3.0	1024	64.5	$0.10/1M tokens	Multilingual, task-aware
BAAI/bge-large-en	1024	64.2	Free (local)	Best free English model
E5-large-v2	1024	62.2	Free (local)	Good quality, open source
all-mpnet-base-v2	768	57.8	Free (local)	Lightweight, fast

Practical Applications

Document Clustering

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Embed a corpus
texts = [...]  # Your documents
embeddings = np.array(embed_texts(texts))

# Cluster by semantic content
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(embeddings)

# Visualize with UMAP (better than t-SNE for embeddings)
import umap
reducer = umap.UMAP(n_components=2, random_state=42)
reduced = reducer.fit_transform(embeddings)

plt.scatter(reduced[:, 0], reduced[:, 1], c=labels, cmap="tab10")
plt.title("Document Clusters")

Anomaly Detection

def find_outliers(texts: list[str], threshold: float = 0.3) -> list[int]:
    """Find documents that don't fit the main cluster."""
    embeddings = np.array(embed_texts(texts))
    
    centroid = embeddings.mean(axis=0)
    centroid_norm = centroid / np.linalg.norm(centroid)
    emb_norms = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
    
    similarities = emb_norms @ centroid_norm
    
    return [i for i, sim in enumerate(similarities) if sim < threshold]

Conclusion

Embeddings are the fundamental data structure of modern AI — they're how neural networks understand meaning. Once you understand that meaning is encoded geometrically, the applications follow naturally: semantic search, clustering, anomaly detection, recommendation, and the retrieval step in every RAG system.

The practical lesson: embedding model choice matters as much as algorithm choice. A better embedding model consistently outperforms a better retrieval algorithm with worse embeddings. Test on your domain, use the MTEB leaderboard as a starting point, and benchmark before choosing.

For using embeddings in a complete retrieval system, see our RAG guide. For the underlying transformer architecture that creates these embeddings, see our transformer architecture guide.

Frequently Asked Questions

What are embeddings in AI?

Dense numerical vectors that represent the meaning of text. Semantically similar items have nearby vectors; unrelated items are far apart. King − Man + Woman ≈ Queen — relationships encode geometrically. Modern embeddings have 768–3072 dimensions and are created by training transformers on massive text corpora.

How are embeddings created?

Text is passed through a transformer encoder (like BERT). The final layer's hidden states are pooled (mean or CLS token) to produce one fixed-length vector per text. The model is trained using contrastive learning on (query, relevant document) pairs — similar texts get similar vectors.

What is the difference between word2vec, BERT, and modern embedding models?

Word2Vec: static, one vector per word regardless of context. BERT: contextual, different vectors for same word in different contexts. Modern embedding models (OpenAI, Cohere, BGE): optimized specifically for semantic similarity, 3-5× better than BERT for retrieval tasks.

How do I use embeddings for semantic search?

Embed all documents and store in a vector database. Embed incoming queries with the same model. Find top-k nearest documents by cosine similarity. Return matching documents — finds semantic matches without exact keyword overlap.

Which embedding model should I use?

OpenAI text-embedding-3-small for OpenAI ecosystem applications. BAAI/bge-large-en-v1.5 for free local use. Cohere embed-v3 for multilingual. Always test on your specific domain — general benchmarks (MTEB) don't always predict domain performance.

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

The Core Idea: Meaning as Geometry

How Embeddings Are Created

Word2Vec (2013): Static Embeddings

BERT: Contextual Embeddings

Modern Embedding Models (2024-2025)

Semantic Search Pipeline

Open-Source Embedding Models

Embedding Models Comparison

Practical Applications

Document Clustering

Anomaly Detection

Conclusion

Frequently Asked Questions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

Fine-Tuning LLMs: When to Do It and How to Do It Right

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

How Large Language Models Work: A Clear Technical Explanation

Get Free AI Notes Daily