What is semantic search and how is it different from keyword search?

Keyword search finds documents containing the exact query words. Semantic search finds documents with the same meaning, even with different words. Example: keyword search for 'car maintenance' won't find 'vehicle upkeep tips'; semantic search will. Semantic search uses embeddings — neural network representations of text meaning — to measure conceptual similarity. It works by: embedding the query and all documents into the same vector space, then finding documents whose embeddings are closest to the query embedding. This handles synonyms, paraphrases, and concept-level similarity that keyword search misses.

What embedding model should I use for semantic search?

For most applications: OpenAI text-embedding-3-small (1536 dims, $0.02/1M tokens) — excellent quality, fast, OpenAI ecosystem. For free/local: BAAI/bge-large-en-v1.5 (top of MTEB leaderboard). For multilingual: intfloat/multilingual-e5-large or paraphrase-multilingual-mpnet-base-v2. Critical rule: always use the same embedding model for documents and queries — mixing models breaks semantic search completely. Also critical: some models (like BGE) require a task prefix for queries ('Represent this sentence for searching relevant passages:') but not for documents.

How do I improve semantic search relevance?

Common improvements in order: 1) Try a better embedding model — quality varies significantly, and MTEB leaderboard rankings strongly predict real performance. 2) Add hybrid search — combine semantic with BM25 keyword search. Hybrid consistently outperforms either alone. 3) Add reranking — cross-encoder models rerank top results more accurately than bi-encoder similarity. 4) Improve chunking — better chunk boundaries preserve semantic context. 5) Query expansion — generate alternative phrasings, retrieve for all, merge results. 6) Fine-tune embeddings on domain data — significant gains for specialized domains (legal, medical, technical).

What is the difference between bi-encoder and cross-encoder models?

Bi-encoder: encodes query and document separately, computes similarity of embeddings. Fast — documents can be pre-embedded. Used in vector databases. Good for initial retrieval. Cross-encoder: takes (query, document) as a pair and computes relevance score jointly. Much more accurate because it sees both texts together. But slow — must process each (query, document) pair at inference time. Can't pre-compute. Used for reranking: first retrieve 50-100 candidates with bi-encoder (fast), then rerank with cross-encoder (slow but accurate) to get top-10 results. This two-stage approach combines the speed of bi-encoders with the accuracy of cross-encoders.

How do I handle a multilingual corpus in semantic search?

Use a multilingual embedding model: paraphrase-multilingual-mpnet-base-v2 (50 languages), intfloat/multilingual-e5-large (100 languages), or Cohere's multilingual embed-v3. These produce embeddings in a shared multilingual space — a query in English finds results in French, German, or Japanese with the same semantic meaning. Store language metadata and filter if needed. For very high-quality cross-lingual search, translate queries to the corpus's primary language first (using DeepL or Google Translate API), then do monolingual search. This avoids cross-lingual embedding quality limitations at the cost of translation API calls.

Semantic Search Tutorial: Build Search That Understands Meaning, Not Just Keywords

The search box on most internal tools is embarrassingly bad. You search for "cancel subscription" and get zero results because the document says "terminate membership." Keyword search fails whenever users don't know the exact terminology.

Semantic search fixes this by understanding meaning, not matching characters. I rebuilt a customer support search system from keyword to semantic and watched the "no results" rate drop from 34% to 4%. The implementation took a week. The user experience improvement was immediate.

Here's how to build it.

The Architecture

Semantic Search System:

Index Time (one-time):
  Documents → Embedding Model → Vectors
  Vectors → Vector Database (with metadata)

Query Time (real-time):
  User Query → Embedding Model → Query Vector
  Query Vector → Vector Database → Top-K Similar Vectors
  → Return Documents + Similarity Scores

Optional Improvements:
  → Hybrid: BM25 keyword + semantic fusion
  → Reranking: cross-encoder for better precision
  → Query expansion: multiple phrasings

Part 1: Basic Semantic Search

# pip install openai chromadb sentence-transformers

import numpy as np
from openai import OpenAI

client = OpenAI()

class SemanticSearchEngine:
    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model
        self.documents = []
        self.embeddings = []
        self.metadata = []
    
    def embed(self, texts: list[str]) -> np.ndarray:
        response = client.embeddings.create(model=self.model, input=texts)
        return np.array([item.embedding for item in response.data])
    
    def add_documents(self, documents: list[str], metadata: list[dict] | None = None):
        print(f"Embedding {len(documents)} documents...")
        new_embeddings = self.embed(documents)
        
        self.documents.extend(documents)
        self.embeddings.extend(new_embeddings)
        self.metadata.extend(metadata or [{}] * len(documents))
        
        print(f"Total indexed: {len(self.documents)}")
    
    def search(self, query: str, top_k: int = 5) -> list[dict]:
        query_emb = self.embed([query])[0]
        doc_embs = np.array(self.embeddings)
        
        # Cosine similarity
        query_norm = query_emb / np.linalg.norm(query_emb)
        doc_norms = doc_embs / np.linalg.norm(doc_embs, axis=1, keepdims=True)
        similarities = doc_norms @ query_norm
        
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        return [
            {
                "document": self.documents[i],
                "similarity": float(similarities[i]),
                "metadata": self.metadata[i],
                "rank": rank + 1
            }
            for rank, i in enumerate(top_indices)
        ]

# Example
search_engine = SemanticSearchEngine()

docs = [
    "How to cancel your subscription and get a refund.",
    "Troubleshooting network connectivity issues.",
    "Setting up two-factor authentication on your account.",
    "How to export your data before closing your account.",
    "Upgrading your plan to access premium features.",
    "Contacting customer support for billing inquiries.",
    "Password reset instructions for locked accounts.",
]

search_engine.add_documents(docs, metadata=[{"category": "support"} for _ in docs])

# Semantic matches: finds results even without exact keywords
queries = [
    "end my membership",           # → finds "cancel subscription"
    "wifi not working",            # → finds "network connectivity"
    "secure my login",             # → finds "two-factor authentication"
]

for query in queries:
    print(f"\nQuery: '{query}'")
    results = search_engine.search(query, top_k=2)
    for r in results:
        print(f"  {r['rank']}. [{r['similarity']:.3f}] {r['document']}")

Part 2: Free Local Embeddings with Sentence-Transformers

from sentence_transformers import SentenceTransformer, util
import torch

# BAAI/bge-large-en-v1.5 — top of MTEB leaderboard, free
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

def semantic_search_local(
    query: str,
    corpus: list[str],
    top_k: int = 5
) -> list[dict]:
    # BGE models work better with a query prefix
    prefixed_query = f"Represent this sentence for searching relevant passages: {query}"
    
    # Encode query and corpus
    query_emb = model.encode(prefixed_query, normalize_embeddings=True)
    corpus_embs = model.encode(corpus, normalize_embeddings=True, batch_size=32)
    
    # Cosine similarity (dot product since normalized)
    scores = corpus_embs @ query_emb
    
    # Top-k results
    top_indices = np.argsort(scores)[::-1][:top_k]
    
    return [
        {
            "document": corpus[i],
            "score": float(scores[i]),
            "rank": rank + 1
        }
        for rank, i in enumerate(top_indices)
    ]

results = semantic_search_local("how to terminate account", docs)

Part 3: Hybrid Search (Semantic + BM25)

from rank_bm25 import BM25Okapi  # pip install rank-bm25
import re
from typing import Optional

class HybridSearchEngine:
    def __init__(self, semantic_weight: float = 0.6):
        """
        semantic_weight: 0.0 = pure keyword, 1.0 = pure semantic
        0.6 is a good starting point for most use cases
        """
        self.semantic_weight = semantic_weight
        self.bm25_weight = 1 - semantic_weight
        self.documents = []
        
        # Semantic components
        self.embed_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
        self.doc_embeddings = None
        
        # BM25 components
        self.bm25 = None
    
    def tokenize(self, text: str) -> list[str]:
        """Simple tokenizer for BM25."""
        return re.sub(r'[^a-z0-9\s]', '', text.lower()).split()
    
    def index(self, documents: list[str]):
        self.documents = documents
        
        # Build semantic index
        self.doc_embeddings = self.embed_model.encode(
            documents, normalize_embeddings=True, batch_size=32
        )
        
        # Build BM25 index
        tokenized = [self.tokenize(doc) for doc in documents]
        self.bm25 = BM25Okapi(tokenized)
        
        print(f"Indexed {len(documents)} documents")
    
    def search(self, query: str, top_k: int = 5) -> list[dict]:
        n = len(self.documents)
        
        # Semantic scores
        query_prefix = f"Represent this sentence for searching relevant passages: {query}"
        query_emb = self.embed_model.encode(query_prefix, normalize_embeddings=True)
        semantic_scores = self.doc_embeddings @ query_emb
        
        # Normalize to [0, 1]
        semantic_min, semantic_max = semantic_scores.min(), semantic_scores.max()
        if semantic_max > semantic_min:
            semantic_normalized = (semantic_scores - semantic_min) / (semantic_max - semantic_min)
        else:
            semantic_normalized = semantic_scores
        
        # BM25 keyword scores
        tokenized_query = self.tokenize(query)
        bm25_scores = np.array(self.bm25.get_scores(tokenized_query))
        
        # Normalize BM25
        bm25_min, bm25_max = bm25_scores.min(), bm25_scores.max()
        if bm25_max > bm25_min:
            bm25_normalized = (bm25_scores - bm25_min) / (bm25_max - bm25_min)
        else:
            bm25_normalized = bm25_scores
        
        # Combine scores
        combined = (
            self.semantic_weight * semantic_normalized +
            self.bm25_weight * bm25_normalized
        )
        
        top_indices = np.argsort(combined)[::-1][:top_k]
        
        return [
            {
                "document": self.documents[i],
                "combined_score": float(combined[i]),
                "semantic_score": float(semantic_normalized[i]),
                "bm25_score": float(bm25_normalized[i]),
                "rank": rank + 1
            }
            for rank, i in enumerate(top_indices)
        ]

hybrid = HybridSearchEngine(semantic_weight=0.6)
hybrid.index(docs)

# Test: hybrid finds both semantic matches AND exact keyword matches
results = hybrid.search("cancel account", top_k=3)
for r in results:
    print(f"Rank {r['rank']}: Sem={r['semantic_score']:.2f}, BM25={r['bm25_score']:.2f}")
    print(f"  {r['document']}")

Part 4: Reranking for Better Precision

from sentence_transformers import CrossEncoder

class RerankedSearchEngine:
    def __init__(self):
        # Bi-encoder for fast initial retrieval
        self.bi_encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")
        # Cross-encoder for accurate reranking
        self.cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
        self.documents = []
        self.doc_embeddings = None
    
    def index(self, documents: list[str]):
        self.documents = documents
        self.doc_embeddings = self.bi_encoder.encode(
            documents, normalize_embeddings=True
        )
    
    def search(self, query: str, top_k: int = 5, initial_k: int = 20) -> list[dict]:
        # Stage 1: Fast semantic retrieval (get more candidates than needed)
        query_emb = self.bi_encoder.encode(
            f"Represent this sentence for searching relevant passages: {query}",
            normalize_embeddings=True
        )
        scores = self.doc_embeddings @ query_emb
        top_initial_indices = np.argsort(scores)[::-1][:initial_k]
        
        # Stage 2: Accurate cross-encoder reranking
        candidates = [(query, self.documents[i]) for i in top_initial_indices]
        rerank_scores = self.cross_encoder.predict(candidates)
        
        # Sort by rerank scores
        sorted_indices = np.argsort(rerank_scores)[::-1][:top_k]
        
        return [
            {
                "document": self.documents[top_initial_indices[i]],
                "rerank_score": float(rerank_scores[i]),
                "initial_rank": rank_in_initial + 1,
                "final_rank": final_rank + 1
            }
            for final_rank, (rank_in_initial, i) in enumerate(
                sorted(enumerate(sorted_indices), key=lambda x: rerank_scores[x[1]], reverse=True)
            )
        ]

Part 5: Production with Qdrant

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

class ProductionSemanticSearch:
    def __init__(self):
        self.qdrant = QdrantClient(host="localhost", port=6333)
        self.embed_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
        self.collection = "search_index"
    
    def create_collection(self, dimension: int = 1024):
        self.qdrant.recreate_collection(
            collection_name=self.collection,
            vectors_config=VectorParams(size=dimension, distance=Distance.COSINE)
        )
    
    def index_documents(self, documents: list[dict]):
        embeddings = self.embed_model.encode(
            [d["text"] for d in documents],
            normalize_embeddings=True,
            batch_size=32,
            show_progress_bar=True
        )
        
        points = [
            PointStruct(
                id=i,
                vector=emb.tolist(),
                payload={k: v for k, v in doc.items() if k != "text"}
            )
            for i, (doc, emb) in enumerate(zip(documents, embeddings))
        ]
        
        self.qdrant.upsert(collection_name=self.collection, points=points)
    
    def search(self, query: str, top_k: int = 10) -> list[dict]:
        query_emb = self.embed_model.encode(
            f"Represent this sentence for searching relevant passages: {query}",
            normalize_embeddings=True
        )
        
        results = self.qdrant.search(
            collection_name=self.collection,
            query_vector=query_emb.tolist(),
            limit=top_k,
            with_payload=True
        )
        
        return [{"score": r.score, **r.payload} for r in results]

Conclusion

Semantic search transforms user experience in any search-heavy application. The implementation path is clear: start with simple bi-encoder search, add hybrid BM25 fusion for better recall, add reranking for precision, and scale with Qdrant or Pinecone when document volumes grow.

For most use cases, BAAI/bge-large-en-v1.5 (free) with hybrid search achieves 90% of what OpenAI's embedding API provides at zero marginal cost.

For the vector database that stores these embeddings at scale, see our vector database guide. For building a complete RAG system on top of this search layer, see our RAG system tutorial.