AiTechWorlds
AiTechWorlds
Embedding models comparison, similarity metrics, Chroma/Qdrant/Pinecone setup, ANN algorithms — complete reference.
An embedding is a dense numerical vector that represents text (or images, audio) in a high-dimensional space where semantic similarity = geometric proximity.
"dog" → [0.21, -0.44, 0.89, ...] (384-3072 numbers)
"puppy" → [0.19, -0.41, 0.91, ...] (very close)
"rocket" → [-0.78, 0.62, -0.11, ...] (far away)Embeddings power: semantic search, RAG, classification, deduplication, clustering, recommendation.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = ["How do I install Python?", "Python installation guide"]
embeddings = model.encode(sentences) # shape: (2, 384)
# Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([embeddings[0]], [embeddings[1]])
print(sim[0][0]) # ~0.93 — semantically similar| Model | Dimensions | Tokens | Speed | Best For |
|---|---|---|---|---|
text-embedding-3-small | 1,536 | 8,191 | Fast | Production RAG (OpenAI) |
text-embedding-3-large | 3,072 | 8,191 | Slower | High-accuracy search |
text-embedding-ada-002 | 1,536 | 8,191 | Fast | Legacy OpenAI |
all-MiniLM-L6-v2 | 384 | 256 | Very fast | Local, edge devices |
bge-large-en-v1.5 | 1,024 | 512 | Fast | Open-source English |
nomic-embed-text-v1.5 | 768 | 8,192 | Fast | Long docs, open source |
bge-m3 | 1,024 | 8,192 | Medium | Multilingual |
e5-large-v2 | 1,024 | 512 | Medium | Passage retrieval |
| Metric | Formula | When to Use | ||
|---|---|---|---|---|
| Cosine similarity | cos θ = (A·B) / (‖A‖‖B‖) | Most embedding tasks — direction matters, not magnitude | ||
| Dot product | A · B | When embeddings are normalized (same as cosine) | ||
| Euclidean distance | ‖A - B‖₂ | When absolute position matters | ||
| Manhattan distance | `Σ | aᵢ - bᵢ | ` | Sparse high-dim spaces |
Cosine similarity returns values from -1 to 1. Values above 0.85 typically indicate strong semantic match.
| Database | Type | Deployment | Max Scale | ANN Algorithm |
|---|---|---|---|---|
| Chroma | Open source | Local / self-hosted | Millions | HNSW |
| Pinecone | Managed cloud | Cloud only | Billions | Proprietary |
| Qdrant | Open source | Self-hosted / cloud | Billions | HNSW |
| Weaviate | Open source | Self-hosted / cloud | Billions | HNSW |
| Milvus | Open source | Self-hosted | Billions | IVF + HNSW |
| FAISS | Library | In-memory | Hundreds of millions | IVF / HNSW / PQ |
| pgvector | PostgreSQL ext. | Self-hosted | Millions | IVFFlat / HNSW |
| Redis VSS | Redis module | Self-hosted / cloud | Millions | HNSW |
| Algorithm | Recall | Speed | Memory | Notes |
|---|---|---|---|---|
| HNSW | Very high | Fast | High | Default for most DBs |
| IVF (Inverted File) | High | Fast | Medium | Good for billions of vectors |
| PQ (Product Quantization) | Medium | Very fast | Very low | Compress vectors |
| ScaNN | Very high | Very fast | Medium | Google, used in production |
| Annoy | Medium | Fast | Low | Read-only index (Spotify) |
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(":memory:") # or url="http://localhost:6333"
# Create collection
client.create_collection(
collection_name="docs",
vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)
# Insert vectors
client.upsert(
collection_name="docs",
points=[
PointStruct(id=1, vector=[0.1, 0.2, ...], payload={"text": "Python tutorial"}),
PointStruct(id=2, vector=[0.3, 0.1, ...], payload={"text": "JavaScript guide"}),
]
)
# Search
results = client.search(
collection_name="docs",
query_vector=[0.1, 0.2, ...],
limit=3
)Always combine vector search with metadata filters for precision:
from qdrant_client.models import Filter, FieldCondition, MatchValue
results = client.search(
collection_name="docs",
query_vector=query_embedding,
query_filter=Filter(
must=[FieldCondition(key="category", match=MatchValue(value="python"))]
),
limit=5
)This restricts search to a subset — 10x faster than filtering results post-search on large collections.
| Content Type | Chunk Strategy | Chunk Size |
|---|---|---|
| General text | Recursive character split | 512 chars |
| Code | Function-level split | 1 function per chunk |
| Tables | Row-level or full table | Preserve structure |
| PDFs | Page or heading section | 1–3 paragraphs |
| Conversations | Turn-level | 1–3 exchanges |
| Dimensions | Storage (1M vectors) | Search Speed | Accuracy |
|---|---|---|---|
| 384 | ~1.5 GB | Very fast | Good |
| 768 | ~3 GB | Fast | Better |
| 1,536 | ~6 GB | Moderate | High |
| 3,072 | ~12 GB | Slower | Highest |
Download Embeddings & Vector Databases Reference
Get this note + 100s more free on Telegram
Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!
No spam. Leave anytime.