Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

AiTechWorlds

📊

AI Learning

Embeddings & Vector Databases Reference

Embedding models comparison, similarity metrics, Chroma/Qdrant/Pinecone setup, ANN algorithms — complete reference.

#embeddings #vector-database #semantic-search #pinecone #chroma

Back to Notes Library

Embeddings & Vector Databases: Complete Reference

What Are Embeddings?

An embedding is a dense numerical vector that represents text (or images, audio) in a high-dimensional space where semantic similarity = geometric proximity.

text

"dog"    → [0.21, -0.44, 0.89, ...]  (384-3072 numbers)
"puppy"  → [0.19, -0.41, 0.91, ...]  (very close)
"rocket" → [-0.78, 0.62, -0.11, ...] (far away)

Embeddings power: semantic search, RAG, classification, deduplication, clustering, recommendation.

How Embeddings Are Created

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = ["How do I install Python?", "Python installation guide"]
embeddings = model.encode(sentences)  # shape: (2, 384)

# Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([embeddings[0]], [embeddings[1]])
print(sim[0][0])  # ~0.93 — semantically similar

Embedding Model Comparison

Model	Dimensions	Tokens	Speed	Best For
`text-embedding-3-small`	1,536	8,191	Fast	Production RAG (OpenAI)
`text-embedding-3-large`	3,072	8,191	Slower	High-accuracy search
`text-embedding-ada-002`	1,536	8,191	Fast	Legacy OpenAI
`all-MiniLM-L6-v2`	384	256	Very fast	Local, edge devices
`bge-large-en-v1.5`	1,024	512	Fast	Open-source English
`nomic-embed-text-v1.5`	768	8,192	Fast	Long docs, open source
`bge-m3`	1,024	8,192	Medium	Multilingual
`e5-large-v2`	1,024	512	Medium	Passage retrieval

Similarity Metrics

Metric	Formula	When to Use
Cosine similarity	`cos θ = (A·B) / (‖A‖‖B‖)`	Most embedding tasks — direction matters, not magnitude
Dot product	`A · B`	When embeddings are normalized (same as cosine)
Euclidean distance	`‖A - B‖₂`	When absolute position matters
Manhattan distance	`Σ	aᵢ - bᵢ	`	Sparse high-dim spaces

Cosine similarity returns values from -1 to 1. Values above 0.85 typically indicate strong semantic match.

Vector Database Comparison

Database	Type	Deployment	Max Scale	ANN Algorithm
Chroma	Open source	Local / self-hosted	Millions	HNSW
Pinecone	Managed cloud	Cloud only	Billions	Proprietary
Qdrant	Open source	Self-hosted / cloud	Billions	HNSW
Weaviate	Open source	Self-hosted / cloud	Billions	HNSW
Milvus	Open source	Self-hosted	Billions	IVF + HNSW
FAISS	Library	In-memory	Hundreds of millions	IVF / HNSW / PQ
pgvector	PostgreSQL ext.	Self-hosted	Millions	IVFFlat / HNSW
Redis VSS	Redis module	Self-hosted / cloud	Millions	HNSW

ANN (Approximate Nearest Neighbor) Algorithms

Algorithm	Recall	Speed	Memory	Notes
HNSW	Very high	Fast	High	Default for most DBs
IVF (Inverted File)	High	Fast	Medium	Good for billions of vectors
PQ (Product Quantization)	Medium	Very fast	Very low	Compress vectors
ScaNN	Very high	Very fast	Medium	Google, used in production
Annoy	Medium	Fast	Low	Read-only index (Spotify)

Quick Start with Qdrant

python

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(":memory:")  # or url="http://localhost:6333"

# Create collection
client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)

# Insert vectors
client.upsert(
    collection_name="docs",
    points=[
        PointStruct(id=1, vector=[0.1, 0.2, ...], payload={"text": "Python tutorial"}),
        PointStruct(id=2, vector=[0.3, 0.1, ...], payload={"text": "JavaScript guide"}),
    ]
)

# Search
results = client.search(
    collection_name="docs",
    query_vector=[0.1, 0.2, ...],
    limit=3
)

Metadata Filtering

Always combine vector search with metadata filters for precision:

python

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search(
    collection_name="docs",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="python"))]
    ),
    limit=5
)

This restricts search to a subset — 10x faster than filtering results post-search on large collections.

Chunking for Embeddings

Content Type	Chunk Strategy	Chunk Size
General text	Recursive character split	512 chars
Code	Function-level split	1 function per chunk
Tables	Row-level or full table	Preserve structure
PDFs	Page or heading section	1–3 paragraphs
Conversations	Turn-level	1–3 exchanges

Embedding Dimensionality Tradeoffs

Dimensions	Storage (1M vectors)	Search Speed	Accuracy
384	~1.5 GB	Very fast	Good
768	~3 GB	Fast	Better
1,536	~6 GB	Moderate	High
3,072	~12 GB	Slower	Highest

Common Mistakes

Using a general embedding model for specialized domains (medical, legal, code) — domain-specific models outperform by 10–30%
Not normalizing vectors before dot-product search — get wrong similarity scores
Embedding queries and documents with different models — similarity scores become meaningless
Ignoring chunk overlap — splitting mid-sentence loses context
Not updating embeddings when source documents change — stale vectors return outdated results

Download Embeddings & Vector Databases Reference

Get this note + 100s more free on Telegram

Join Free →

📱

Get more notes like this daily on Telegram!

Free study notes, cheat sheets & AI tips

Join Free →

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

📊

AI Learning

Embeddings & Vector Databases Reference

Embedding models comparison, similarity metrics, Chroma/Qdrant/Pinecone setup, ANN algorithms — complete reference.

#embeddings #vector-database #semantic-search #pinecone #chroma

Back to Notes Library

Embeddings & Vector Databases: Complete Reference

What Are Embeddings?

An embedding is a dense numerical vector that represents text (or images, audio) in a high-dimensional space where semantic similarity = geometric proximity.

text

"dog"    → [0.21, -0.44, 0.89, ...]  (384-3072 numbers)
"puppy"  → [0.19, -0.41, 0.91, ...]  (very close)
"rocket" → [-0.78, 0.62, -0.11, ...] (far away)

Embeddings power: semantic search, RAG, classification, deduplication, clustering, recommendation.

How Embeddings Are Created

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = ["How do I install Python?", "Python installation guide"]
embeddings = model.encode(sentences)  # shape: (2, 384)

# Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([embeddings[0]], [embeddings[1]])
print(sim[0][0])  # ~0.93 — semantically similar

Embedding Model Comparison

Model	Dimensions	Tokens	Speed	Best For
`text-embedding-3-small`	1,536	8,191	Fast	Production RAG (OpenAI)
`text-embedding-3-large`	3,072	8,191	Slower	High-accuracy search
`text-embedding-ada-002`	1,536	8,191	Fast	Legacy OpenAI
`all-MiniLM-L6-v2`	384	256	Very fast	Local, edge devices
`bge-large-en-v1.5`	1,024	512	Fast	Open-source English
`nomic-embed-text-v1.5`	768	8,192	Fast	Long docs, open source
`bge-m3`	1,024	8,192	Medium	Multilingual
`e5-large-v2`	1,024	512	Medium	Passage retrieval

Similarity Metrics

Metric	Formula	When to Use
Cosine similarity	`cos θ = (A·B) / (‖A‖‖B‖)`	Most embedding tasks — direction matters, not magnitude
Dot product	`A · B`	When embeddings are normalized (same as cosine)
Euclidean distance	`‖A - B‖₂`	When absolute position matters
Manhattan distance	`Σ	aᵢ - bᵢ	`	Sparse high-dim spaces

Cosine similarity returns values from -1 to 1. Values above 0.85 typically indicate strong semantic match.

Vector Database Comparison

Database	Type	Deployment	Max Scale	ANN Algorithm
Chroma	Open source	Local / self-hosted	Millions	HNSW
Pinecone	Managed cloud	Cloud only	Billions	Proprietary
Qdrant	Open source	Self-hosted / cloud	Billions	HNSW
Weaviate	Open source	Self-hosted / cloud	Billions	HNSW
Milvus	Open source	Self-hosted	Billions	IVF + HNSW
FAISS	Library	In-memory	Hundreds of millions	IVF / HNSW / PQ
pgvector	PostgreSQL ext.	Self-hosted	Millions	IVFFlat / HNSW
Redis VSS	Redis module	Self-hosted / cloud	Millions	HNSW

ANN (Approximate Nearest Neighbor) Algorithms

Algorithm	Recall	Speed	Memory	Notes
HNSW	Very high	Fast	High	Default for most DBs
IVF (Inverted File)	High	Fast	Medium	Good for billions of vectors
PQ (Product Quantization)	Medium	Very fast	Very low	Compress vectors
ScaNN	Very high	Very fast	Medium	Google, used in production
Annoy	Medium	Fast	Low	Read-only index (Spotify)

Quick Start with Qdrant

python

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(":memory:")  # or url="http://localhost:6333"

# Create collection
client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)

# Insert vectors
client.upsert(
    collection_name="docs",
    points=[
        PointStruct(id=1, vector=[0.1, 0.2, ...], payload={"text": "Python tutorial"}),
        PointStruct(id=2, vector=[0.3, 0.1, ...], payload={"text": "JavaScript guide"}),
    ]
)

# Search
results = client.search(
    collection_name="docs",
    query_vector=[0.1, 0.2, ...],
    limit=3
)

Metadata Filtering

Always combine vector search with metadata filters for precision:

python

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search(
    collection_name="docs",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="python"))]
    ),
    limit=5
)

This restricts search to a subset — 10x faster than filtering results post-search on large collections.

Chunking for Embeddings

Content Type	Chunk Strategy	Chunk Size
General text	Recursive character split	512 chars
Code	Function-level split	1 function per chunk
Tables	Row-level or full table	Preserve structure
PDFs	Page or heading section	1–3 paragraphs
Conversations	Turn-level	1–3 exchanges

Embedding Dimensionality Tradeoffs

Dimensions	Storage (1M vectors)	Search Speed	Accuracy
384	~1.5 GB	Very fast	Good
768	~3 GB	Fast	Better
1,536	~6 GB	Moderate	High
3,072	~12 GB	Slower	Highest

Common Mistakes

Using a general embedding model for specialized domains (medical, legal, code) — domain-specific models outperform by 10–30%
Not normalizing vectors before dot-product search — get wrong similarity scores
Embedding queries and documents with different models — similarity scores become meaningless
Ignoring chunk overlap — splitting mid-sentence loses context
Not updating embeddings when source documents change — stale vectors return outdated results

Download Embeddings & Vector Databases Reference

Get this note + 100s more free on Telegram

Join Free →

📱

Get more notes like this daily on Telegram!

Free study notes, cheat sheets & AI tips

Join Free →

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.