AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

cloud vector database architecture — LangChain Pinecone serverless vector store

How to Use LangChain with Pinecone Serverless (Cloud RAG)

⚡ Quick Answer

Deploy cloud-native RAG with LangChain and Pinecone Serverless. Complete guide covering setup, upsert, query, namespaces, metadata filtering, and cost estimates.

AiTechWorlds Team May 31, 2026 11 min read

#LangChain #Pinecone #serverless #vector database #RAG

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Cloud RAG is the architecture that makes AI applications actually scalable. Running your own vector database is fine for prototypes, but the moment you hit millions of documents or need zero-maintenance infrastructure, Pinecone Serverless becomes the obvious choice.

This guide walks you through every step of building a production RAG system with LangChain and Pinecone Serverless — from account setup to namespaced multi-tenant queries to cost optimization.

Before you start, make sure you're comfortable with embedding basics from the Vector database guide and the RAG system tutorial.

What Is Pinecone Serverless?

Pinecone launched its serverless tier in early 2024. Unlike pod-based indexes (which require you to provision p1, p2, or s1 pod types), serverless indexes scale automatically. You pay only for:

Storage: ~$0.033 per GB per month
Read units: ~$4 per million read units (1 RU ≈ retrieving 1 vector)
Write units: ~$2 per million write units

For a RAG application with 100K documents (roughly 1M vectors at 1,536 dimensions), monthly storage runs about $0.33. Read costs at 10,000 queries/day with k=5 retrieval: 10,000 × 5 × 30 = 1.5M RUs = $6/month. Compare this to a dedicated pod costing $70+/month.

Installation and Setup

pip install langchain langchain-openai langchain-pinecone pinecone-client python-dotenv

import os
from dotenv import load_dotenv
load_dotenv()

# Required environment variables:
# PINECONE_API_KEY=your-pinecone-api-key
# OPENAI_API_KEY=your-openai-api-key
# PINECONE_INDEX_NAME=your-index-name

Creating a Pinecone Serverless Index

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

index_name = "langchain-rag-demo"

# Check if index exists, create if not
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,          # OpenAI text-embedding-ada-002 / text-embedding-3-small
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",          # "aws", "gcp", or "azure"
            region="us-east-1"    # region for your cloud provider
        )
    )
    print(f"Created index: {index_name}")
else:
    print(f"Index {index_name} already exists")

# Get index stats
index = pc.Index(index_name)
print(index.describe_index_stats())

Dimension guide:

text-embedding-3-small: 1,536 dimensions (default), or 512 with dimensions param
text-embedding-3-large: 3,072 dimensions
text-embedding-ada-002: 1,536 dimensions

Loading Documents and Creating the Vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize embeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=1536
)

# Load and split documents
loader = WebBaseLoader([
    "https://docs.example.com/page1",
    "https://docs.example.com/page2"
])
raw_docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
docs = splitter.split_documents(raw_docs)
print(f"Split into {len(docs)} chunks")

# Create vector store and upsert documents
vectorstore = PineconeVectorStore.from_documents(
    documents=docs,
    embedding=embeddings,
    index_name=index_name,
    namespace="prod"  # namespace for isolation
)

print("Documents upserted to Pinecone")

About namespaces: Each Pinecone index supports multiple namespaces. Think of them as logical partitions — same index, isolated data. Use them for:

Multi-tenant SaaS (one namespace per customer)
Environment isolation (dev/staging/prod)
Document collection separation (docs/blog/support)

Basic Similarity Search

# Connect to existing index (skip from_documents on second run)
vectorstore = PineconeVectorStore(
    index_name=index_name,
    embedding=embeddings,
    namespace="prod"
)

# Simple similarity search
query = "How do I reset my password?"
results = vectorstore.similarity_search(query, k=5)
for doc in results:
    print(f"Score source: {doc.metadata.get('source', 'N/A')}")
    print(f"Content: {doc.page_content[:200]}\n")

# With similarity scores
results_with_scores = vectorstore.similarity_search_with_score(query, k=5)
for doc, score in results_with_scores:
    print(f"Score: {score:.4f} | {doc.page_content[:150]}")

Pinecone returns cosine similarity scores (0–1 for normalized vectors). Scores above 0.85 are typically high-relevance matches.

Metadata Filtering

One of Pinecone's most powerful features is metadata filtering. You can narrow vector search to specific subsets of your data:

from langchain_core.documents import Document
import uuid

# Upsert with rich metadata
docs_with_metadata = [
    Document(
        page_content="How to configure two-factor authentication in the admin panel.",
        metadata={
            "source": "admin-docs",
            "category": "security",
            "version": "2.0",
            "last_updated": "2026-01-15",
            "tenant_id": "customer_123"
        }
    ),
    Document(
        page_content="Setting up email notifications for billing events.",
        metadata={
            "source": "billing-docs",
            "category": "billing",
            "version": "1.5",
            "last_updated": "2025-11-20",
            "tenant_id": "customer_456"
        }
    )
]

vectorstore.add_documents(docs_with_metadata)

# Filter by category
security_results = vectorstore.similarity_search(
    query="authentication settings",
    k=5,
    filter={"category": "security"}
)

# Filter by tenant (multi-tenant RAG)
tenant_results = vectorstore.similarity_search(
    query="billing notifications",
    k=5,
    filter={"tenant_id": "customer_123"}
)

# Complex filter: category AND version
filtered_results = vectorstore.similarity_search(
    query="two factor auth",
    k=5,
    filter={
        "$and": [
            {"category": {"$eq": "security"}},
            {"version": {"$gte": "2.0"}}
        ]
    }
)

Pinecone supports $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or operators in metadata filters.

Building the RAG Chain

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_openai import ChatOpenAI

# Create retriever from vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 6,
        "filter": {"category": "security"}  # optional global filter
    }
)

# RAG prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful documentation assistant. 
    Answer questions based only on the provided context.
    If the answer is not in the context, say "I don't have that information."
    Always cite the source document when answering."""),
    ("human", """Context:
{context}

Question: {question}""")
])

def format_docs(docs):
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("source", "unknown")
        formatted.append(f"[{i}] Source: {source}\n{doc.page_content}")
    return "\n\n".join(formatted)

llm = ChatOpenAI(model="gpt-4o", temperature=0)

rag_chain = (
    RunnableParallel({
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    })
    | prompt
    | llm
    | StrOutputParser()
)

# Run a query
answer = rag_chain.invoke("How do I enable two-factor authentication?")
print(answer)

Upsert with Explicit IDs and Batch Processing

For production ingestion pipelines, control IDs and batch size explicitly:

import hashlib
from typing import List
from langchain_core.documents import Document

def doc_to_id(doc: Document) -> str:
    """Generate deterministic ID from document content + source."""
    content = doc.page_content + doc.metadata.get("source", "")
    return hashlib.md5(content.encode()).hexdigest()

def batch_upsert(
    docs: List[Document],
    vectorstore: PineconeVectorStore,
    batch_size: int = 100,
    namespace: str = "prod"
) -> None:
    """Upsert documents in batches with progress tracking."""
    total = len(docs)
    for i in range(0, total, batch_size):
        batch = docs[i:i + batch_size]
        ids = [doc_to_id(doc) for doc in batch]
        
        vectorstore.add_documents(
            documents=batch,
            ids=ids,
            namespace=namespace
        )
        
        pct = min(100, (i + batch_size) / total * 100)
        print(f"Upserted {min(i + batch_size, total)}/{total} ({pct:.0f}%)")

# Process a large document collection
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader)
all_docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(all_docs)

print(f"Total chunks to upsert: {len(chunks)}")
batch_upsert(chunks, vectorstore, batch_size=100)

Pinecone's recommended batch size is 100 vectors per upsert call. Larger batches hit size limits; smaller batches increase API overhead.

Updating and Deleting Vectors

# Delete specific documents by ID
vectorstore.delete(ids=["abc123", "def456"])

# Delete all vectors in a namespace
index = pc.Index(index_name)
index.delete(delete_all=True, namespace="dev")

# Update a document (delete + re-add)
def update_document(
    doc: Document,
    vectorstore: PineconeVectorStore,
    namespace: str = "prod"
) -> str:
    doc_id = doc_to_id(doc)
    
    # Delete old version
    vectorstore.delete(ids=[doc_id])
    
    # Add new version
    vectorstore.add_documents(
        documents=[doc],
        ids=[doc_id],
        namespace=namespace
    )
    return doc_id

updated_doc = Document(
    page_content="Updated: 2FA now supports hardware security keys in addition to TOTP apps.",
    metadata={"source": "admin-docs", "category": "security", "version": "2.1"}
)
update_document(updated_doc, vectorstore)

Multi-Namespace RAG for Multi-Tenant Applications

from langchain_pinecone import PineconeVectorStore
from typing import Optional

class MultiTenantRAG:
    def __init__(self, index_name: str, embeddings, llm):
        self.index_name = index_name
        self.embeddings = embeddings
        self.llm = llm

    def get_retriever(self, tenant_id: str, k: int = 5):
        """Get a retriever scoped to a specific tenant namespace."""
        vs = PineconeVectorStore(
            index_name=self.index_name,
            embedding=self.embeddings,
            namespace=f"tenant_{tenant_id}"
        )
        return vs.as_retriever(search_kwargs={"k": k})

    def answer(self, question: str, tenant_id: str) -> str:
        retriever = self.get_retriever(tenant_id)

        prompt = ChatPromptTemplate.from_messages([
            ("system", "Answer based on the provided context only."),
            ("human", "Context:\n{context}\n\nQuestion: {question}")
        ])

        chain = (
            RunnableParallel({
                "context": retriever | format_docs,
                "question": RunnablePassthrough()
            })
            | prompt
            | self.llm
            | StrOutputParser()
        )

        return chain.invoke(question)

    def ingest_for_tenant(self, docs: List[Document], tenant_id: str):
        """Ingest documents into a tenant-specific namespace."""
        vs = PineconeVectorStore(
            index_name=self.index_name,
            embedding=self.embeddings,
            namespace=f"tenant_{tenant_id}"
        )
        chunks = splitter.split_documents(docs)
        vs.add_documents(chunks)
        print(f"Ingested {len(chunks)} chunks for tenant {tenant_id}")

# Usage
rag = MultiTenantRAG(
    index_name=index_name,
    embeddings=embeddings,
    llm=ChatOpenAI(model="gpt-4o")
)

# Tenant A gets only their data
answer_a = rag.answer("What is my billing cycle?", tenant_id="customer_123")
# Tenant B gets only their data
answer_b = rag.answer("What is my billing cycle?", tenant_id="customer_456")

This pattern is one of the cleanest ways to build multi-tenant AI applications. Each customer's data stays logically isolated at the namespace level while sharing the same underlying index infrastructure.

For the agent side of this architecture, see Build AI agent with LangChain and the AI research agent build.

Pinecone Serverless vs Pod-Based vs Self-Hosted

Feature	Serverless	Pod-Based (p2)	Self-Hosted (Weaviate/Qdrant)
Setup time	2 minutes	5 minutes	30–120 minutes
Maintenance	Zero	Low	High
Latency (p99)	~50–200ms	~10–50ms	~5–20ms
Cost (100K docs)	~$6/month	~$70/month	Server cost (~$20–80/month)
Max dimensions	20,000	20,000	65,535+
Namespaces	Yes	Yes	Collections/tenants
Metadata filtering	Yes	Yes	Yes
Hybrid search	Beta	Yes	Yes
Data residency	AWS/GCP/Azure	AWS/GCP/Azure	Full control

Cost calculation example (1M vectors, 1,536 dims, 10K queries/day):

Serverless: $33/month storage + $6/month reads = $39/month
Pod-based (p2.x1): $87/month (fixed)
Self-hosted on AWS t3.xlarge: $120/month (EC2 + storage + ops time)

Serverless wins below ~5M vectors. At very high query throughput (>100K queries/day), pod-based latency advantages can justify the cost difference.

Hybrid Search (Dense + Sparse)

Pinecone supports hybrid search combining dense vector similarity with BM25 keyword matching:

# pip install pinecone-text
from pinecone_text.sparse import BM25Encoder
from langchain_community.retrievers import PineconeHybridSearchRetriever

# Fit BM25 on your corpus
bm25_encoder = BM25Encoder()
bm25_encoder.fit([doc.page_content for doc in docs])
bm25_encoder.dump("bm25_params.json")

# Create hybrid retriever
hybrid_retriever = PineconeHybridSearchRetriever(
    embeddings=embeddings,
    sparse_encoder=bm25_encoder,
    index=index,
    top_k=5,
    alpha=0.5  # 0=pure sparse (BM25), 1=pure dense (embeddings), 0.5=balanced
)

# Hybrid search handles both semantic and keyword queries well
results = hybrid_retriever.invoke("2FA hardware key FIDO2 setup")

Hybrid search is particularly valuable for technical documentation, code search, and any domain with specialized terminology where exact keyword matching matters.

Async Operations for High Throughput

import asyncio
from langchain_pinecone import PineconeVectorStore

async def async_rag_pipeline(questions: list[str], tenant_id: str) -> list[str]:
    vs = PineconeVectorStore(
        index_name=index_name,
        embedding=embeddings,
        namespace=f"tenant_{tenant_id}"
    )
    retriever = vs.as_retriever(search_kwargs={"k": 5})

    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer based on context only."),
        ("human", "Context:\n{context}\n\nQuestion: {question}")
    ])

    chain = (
        RunnableParallel({
            "context": retriever | format_docs,
            "question": RunnablePassthrough()
        })
        | prompt
        | ChatOpenAI(model="gpt-4o")
        | StrOutputParser()
    )

    # Run all questions concurrently
    tasks = [chain.ainvoke(q) for q in questions]
    answers = await asyncio.gather(*tasks)
    return answers

# Run 10 concurrent queries
questions = [f"Question {i}" for i in range(10)]
answers = asyncio.run(async_rag_pipeline(questions, "customer_123"))

Async operations are critical for production RAG applications. A synchronous pipeline serving 100 concurrent users would queue requests sequentially; async allows true parallelism within Python's event loop.

Monitoring Index Health

from pinecone import Pinecone

def monitor_index(index_name: str):
    pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
    index = pc.Index(index_name)
    
    stats = index.describe_index_stats()
    
    print(f"Total vectors: {stats.total_vector_count:,}")
    print(f"Dimension: {stats.dimension}")
    print(f"Index fullness: {stats.index_fullness:.2%}")
    
    print("\nNamespace breakdown:")
    for namespace, ns_stats in stats.namespaces.items():
        print(f"  {namespace}: {ns_stats.vector_count:,} vectors")

monitor_index(index_name)

Watch for index_fullness approaching 1.0 on pod-based indexes (serverless scales automatically). Query latency above 500ms usually indicates the need to either reduce k or upgrade the pod type.

Pinecone Serverless is the Right Default for Cloud RAG

If you're building a new RAG application and don't have a specific reason to host your own vector database, start with Pinecone Serverless. The zero-maintenance infrastructure, pay-per-use pricing, and tight LangChain integration make it the fastest path to production.

The namespace feature alone is worth it for SaaS builders — you get tenant isolation, environment separation, and collection management without managing separate databases.

Combine Pinecone Serverless with the OpenAI API integration for embeddings and the Deploy AI model to production guide for deployment patterns. If you're comparing options, the LangChain tutorial 2025 covers ChromaDB and FAISS alternatives.

Frequently Asked Questions

What is the difference between Pinecone Serverless and pod-based? Serverless Pinecone charges only for storage and queries — there are no always-on pods. Pod-based indexes provision dedicated resources with predictable latency. Serverless is cheaper for sporadic workloads; pod-based is better for sustained high-throughput applications.

Can I use multiple namespaces in LangChain with Pinecone? Yes. Pass the namespace parameter when creating a PineconeVectorStore or when calling similarity_search. Namespaces let you isolate data for different tenants, document collections, or environments within a single Pinecone index.

How do I delete vectors from Pinecone using LangChain? You can delete by IDs using vectorstore.delete(ids=['id1', 'id2']), or delete an entire namespace using the Pinecone client directly with index.delete(delete_all=True, namespace='your-namespace').

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Serverless Pinecone charges only for storage and queries — there are no always-on pods. Pod-based indexes provision dedicated resources with predictable latency. Serverless is cheaper for sporadic workloads; pod-based is better for sustained high-throughput applications.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes NotesEmbeddings & Vector Databases Reference BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide BookAWS for Developers

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

How to Use LangChain with Pinecone Serverless (Cloud RAG)

⚡ Quick Answer

Deploy cloud-native RAG with LangChain and Pinecone Serverless. Complete guide covering setup, upsert, query, namespaces, metadata filtering, and cost estimates.

AiTechWorlds Team May 31, 2026 11 min read

#LangChain #Pinecone #serverless #vector database #RAG

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

This guide walks you through every step of building a production RAG system with LangChain and Pinecone Serverless — from account setup to namespaced multi-tenant queries to cost optimization.

Before you start, make sure you're comfortable with embedding basics from the Vector database guide and the RAG system tutorial.

What Is Pinecone Serverless?

Pinecone launched its serverless tier in early 2024. Unlike pod-based indexes (which require you to provision p1, p2, or s1 pod types), serverless indexes scale automatically. You pay only for:

Storage: ~$0.033 per GB per month
Read units: ~$4 per million read units (1 RU ≈ retrieving 1 vector)
Write units: ~$2 per million write units

Installation and Setup

pip install langchain langchain-openai langchain-pinecone pinecone-client python-dotenv

import os
from dotenv import load_dotenv
load_dotenv()

# Required environment variables:
# PINECONE_API_KEY=your-pinecone-api-key
# OPENAI_API_KEY=your-openai-api-key
# PINECONE_INDEX_NAME=your-index-name

Creating a Pinecone Serverless Index

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

index_name = "langchain-rag-demo"

# Check if index exists, create if not
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,          # OpenAI text-embedding-ada-002 / text-embedding-3-small
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",          # "aws", "gcp", or "azure"
            region="us-east-1"    # region for your cloud provider
        )
    )
    print(f"Created index: {index_name}")
else:
    print(f"Index {index_name} already exists")

# Get index stats
index = pc.Index(index_name)
print(index.describe_index_stats())

Dimension guide:

text-embedding-3-small: 1,536 dimensions (default), or 512 with dimensions param
text-embedding-3-large: 3,072 dimensions
text-embedding-ada-002: 1,536 dimensions

Loading Documents and Creating the Vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize embeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=1536
)

# Load and split documents
loader = WebBaseLoader([
    "https://docs.example.com/page1",
    "https://docs.example.com/page2"
])
raw_docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
docs = splitter.split_documents(raw_docs)
print(f"Split into {len(docs)} chunks")

# Create vector store and upsert documents
vectorstore = PineconeVectorStore.from_documents(
    documents=docs,
    embedding=embeddings,
    index_name=index_name,
    namespace="prod"  # namespace for isolation
)

print("Documents upserted to Pinecone")

About namespaces: Each Pinecone index supports multiple namespaces. Think of them as logical partitions — same index, isolated data. Use them for:

Multi-tenant SaaS (one namespace per customer)
Environment isolation (dev/staging/prod)
Document collection separation (docs/blog/support)

Basic Similarity Search

# Connect to existing index (skip from_documents on second run)
vectorstore = PineconeVectorStore(
    index_name=index_name,
    embedding=embeddings,
    namespace="prod"
)

# Simple similarity search
query = "How do I reset my password?"
results = vectorstore.similarity_search(query, k=5)
for doc in results:
    print(f"Score source: {doc.metadata.get('source', 'N/A')}")
    print(f"Content: {doc.page_content[:200]}\n")

# With similarity scores
results_with_scores = vectorstore.similarity_search_with_score(query, k=5)
for doc, score in results_with_scores:
    print(f"Score: {score:.4f} | {doc.page_content[:150]}")

Pinecone returns cosine similarity scores (0–1 for normalized vectors). Scores above 0.85 are typically high-relevance matches.

Metadata Filtering

One of Pinecone's most powerful features is metadata filtering. You can narrow vector search to specific subsets of your data:

from langchain_core.documents import Document
import uuid

# Upsert with rich metadata
docs_with_metadata = [
    Document(
        page_content="How to configure two-factor authentication in the admin panel.",
        metadata={
            "source": "admin-docs",
            "category": "security",
            "version": "2.0",
            "last_updated": "2026-01-15",
            "tenant_id": "customer_123"
        }
    ),
    Document(
        page_content="Setting up email notifications for billing events.",
        metadata={
            "source": "billing-docs",
            "category": "billing",
            "version": "1.5",
            "last_updated": "2025-11-20",
            "tenant_id": "customer_456"
        }
    )
]

vectorstore.add_documents(docs_with_metadata)

# Filter by category
security_results = vectorstore.similarity_search(
    query="authentication settings",
    k=5,
    filter={"category": "security"}
)

# Filter by tenant (multi-tenant RAG)
tenant_results = vectorstore.similarity_search(
    query="billing notifications",
    k=5,
    filter={"tenant_id": "customer_123"}
)

# Complex filter: category AND version
filtered_results = vectorstore.similarity_search(
    query="two factor auth",
    k=5,
    filter={
        "$and": [
            {"category": {"$eq": "security"}},
            {"version": {"$gte": "2.0"}}
        ]
    }
)

Pinecone supports $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or operators in metadata filters.

Building the RAG Chain

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_openai import ChatOpenAI

# Create retriever from vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 6,
        "filter": {"category": "security"}  # optional global filter
    }
)

# RAG prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful documentation assistant. 
    Answer questions based only on the provided context.
    If the answer is not in the context, say "I don't have that information."
    Always cite the source document when answering."""),
    ("human", """Context:
{context}

Question: {question}""")
])

def format_docs(docs):
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("source", "unknown")
        formatted.append(f"[{i}] Source: {source}\n{doc.page_content}")
    return "\n\n".join(formatted)

llm = ChatOpenAI(model="gpt-4o", temperature=0)

rag_chain = (
    RunnableParallel({
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    })
    | prompt
    | llm
    | StrOutputParser()
)

# Run a query
answer = rag_chain.invoke("How do I enable two-factor authentication?")
print(answer)

Upsert with Explicit IDs and Batch Processing

For production ingestion pipelines, control IDs and batch size explicitly:

import hashlib
from typing import List
from langchain_core.documents import Document

def doc_to_id(doc: Document) -> str:
    """Generate deterministic ID from document content + source."""
    content = doc.page_content + doc.metadata.get("source", "")
    return hashlib.md5(content.encode()).hexdigest()

def batch_upsert(
    docs: List[Document],
    vectorstore: PineconeVectorStore,
    batch_size: int = 100,
    namespace: str = "prod"
) -> None:
    """Upsert documents in batches with progress tracking."""
    total = len(docs)
    for i in range(0, total, batch_size):
        batch = docs[i:i + batch_size]
        ids = [doc_to_id(doc) for doc in batch]
        
        vectorstore.add_documents(
            documents=batch,
            ids=ids,
            namespace=namespace
        )
        
        pct = min(100, (i + batch_size) / total * 100)
        print(f"Upserted {min(i + batch_size, total)}/{total} ({pct:.0f}%)")

# Process a large document collection
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader)
all_docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(all_docs)

print(f"Total chunks to upsert: {len(chunks)}")
batch_upsert(chunks, vectorstore, batch_size=100)

Pinecone's recommended batch size is 100 vectors per upsert call. Larger batches hit size limits; smaller batches increase API overhead.

Updating and Deleting Vectors

# Delete specific documents by ID
vectorstore.delete(ids=["abc123", "def456"])

# Delete all vectors in a namespace
index = pc.Index(index_name)
index.delete(delete_all=True, namespace="dev")

# Update a document (delete + re-add)
def update_document(
    doc: Document,
    vectorstore: PineconeVectorStore,
    namespace: str = "prod"
) -> str:
    doc_id = doc_to_id(doc)
    
    # Delete old version
    vectorstore.delete(ids=[doc_id])
    
    # Add new version
    vectorstore.add_documents(
        documents=[doc],
        ids=[doc_id],
        namespace=namespace
    )
    return doc_id

updated_doc = Document(
    page_content="Updated: 2FA now supports hardware security keys in addition to TOTP apps.",
    metadata={"source": "admin-docs", "category": "security", "version": "2.1"}
)
update_document(updated_doc, vectorstore)

Multi-Namespace RAG for Multi-Tenant Applications

from langchain_pinecone import PineconeVectorStore
from typing import Optional

class MultiTenantRAG:
    def __init__(self, index_name: str, embeddings, llm):
        self.index_name = index_name
        self.embeddings = embeddings
        self.llm = llm

    def get_retriever(self, tenant_id: str, k: int = 5):
        """Get a retriever scoped to a specific tenant namespace."""
        vs = PineconeVectorStore(
            index_name=self.index_name,
            embedding=self.embeddings,
            namespace=f"tenant_{tenant_id}"
        )
        return vs.as_retriever(search_kwargs={"k": k})

    def answer(self, question: str, tenant_id: str) -> str:
        retriever = self.get_retriever(tenant_id)

        prompt = ChatPromptTemplate.from_messages([
            ("system", "Answer based on the provided context only."),
            ("human", "Context:\n{context}\n\nQuestion: {question}")
        ])

        chain = (
            RunnableParallel({
                "context": retriever | format_docs,
                "question": RunnablePassthrough()
            })
            | prompt
            | self.llm
            | StrOutputParser()
        )

        return chain.invoke(question)

    def ingest_for_tenant(self, docs: List[Document], tenant_id: str):
        """Ingest documents into a tenant-specific namespace."""
        vs = PineconeVectorStore(
            index_name=self.index_name,
            embedding=self.embeddings,
            namespace=f"tenant_{tenant_id}"
        )
        chunks = splitter.split_documents(docs)
        vs.add_documents(chunks)
        print(f"Ingested {len(chunks)} chunks for tenant {tenant_id}")

# Usage
rag = MultiTenantRAG(
    index_name=index_name,
    embeddings=embeddings,
    llm=ChatOpenAI(model="gpt-4o")
)

# Tenant A gets only their data
answer_a = rag.answer("What is my billing cycle?", tenant_id="customer_123")
# Tenant B gets only their data
answer_b = rag.answer("What is my billing cycle?", tenant_id="customer_456")

For the agent side of this architecture, see Build AI agent with LangChain and the AI research agent build.

Pinecone Serverless vs Pod-Based vs Self-Hosted

Feature	Serverless	Pod-Based (p2)	Self-Hosted (Weaviate/Qdrant)
Setup time	2 minutes	5 minutes	30–120 minutes
Maintenance	Zero	Low	High
Latency (p99)	~50–200ms	~10–50ms	~5–20ms
Cost (100K docs)	~$6/month	~$70/month	Server cost (~$20–80/month)
Max dimensions	20,000	20,000	65,535+
Namespaces	Yes	Yes	Collections/tenants
Metadata filtering	Yes	Yes	Yes
Hybrid search	Beta	Yes	Yes
Data residency	AWS/GCP/Azure	AWS/GCP/Azure	Full control

Cost calculation example (1M vectors, 1,536 dims, 10K queries/day):

Serverless: $33/month storage + $6/month reads = $39/month
Pod-based (p2.x1): $87/month (fixed)
Self-hosted on AWS t3.xlarge: $120/month (EC2 + storage + ops time)

Serverless wins below ~5M vectors. At very high query throughput (>100K queries/day), pod-based latency advantages can justify the cost difference.

Hybrid Search (Dense + Sparse)

Pinecone supports hybrid search combining dense vector similarity with BM25 keyword matching:

# pip install pinecone-text
from pinecone_text.sparse import BM25Encoder
from langchain_community.retrievers import PineconeHybridSearchRetriever

# Fit BM25 on your corpus
bm25_encoder = BM25Encoder()
bm25_encoder.fit([doc.page_content for doc in docs])
bm25_encoder.dump("bm25_params.json")

# Create hybrid retriever
hybrid_retriever = PineconeHybridSearchRetriever(
    embeddings=embeddings,
    sparse_encoder=bm25_encoder,
    index=index,
    top_k=5,
    alpha=0.5  # 0=pure sparse (BM25), 1=pure dense (embeddings), 0.5=balanced
)

# Hybrid search handles both semantic and keyword queries well
results = hybrid_retriever.invoke("2FA hardware key FIDO2 setup")

Hybrid search is particularly valuable for technical documentation, code search, and any domain with specialized terminology where exact keyword matching matters.

Async Operations for High Throughput

import asyncio
from langchain_pinecone import PineconeVectorStore

async def async_rag_pipeline(questions: list[str], tenant_id: str) -> list[str]:
    vs = PineconeVectorStore(
        index_name=index_name,
        embedding=embeddings,
        namespace=f"tenant_{tenant_id}"
    )
    retriever = vs.as_retriever(search_kwargs={"k": 5})

    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer based on context only."),
        ("human", "Context:\n{context}\n\nQuestion: {question}")
    ])

    chain = (
        RunnableParallel({
            "context": retriever | format_docs,
            "question": RunnablePassthrough()
        })
        | prompt
        | ChatOpenAI(model="gpt-4o")
        | StrOutputParser()
    )

    # Run all questions concurrently
    tasks = [chain.ainvoke(q) for q in questions]
    answers = await asyncio.gather(*tasks)
    return answers

# Run 10 concurrent queries
questions = [f"Question {i}" for i in range(10)]
answers = asyncio.run(async_rag_pipeline(questions, "customer_123"))

Monitoring Index Health

from pinecone import Pinecone

def monitor_index(index_name: str):
    pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
    index = pc.Index(index_name)
    
    stats = index.describe_index_stats()
    
    print(f"Total vectors: {stats.total_vector_count:,}")
    print(f"Dimension: {stats.dimension}")
    print(f"Index fullness: {stats.index_fullness:.2%}")
    
    print("\nNamespace breakdown:")
    for namespace, ns_stats in stats.namespaces.items():
        print(f"  {namespace}: {ns_stats.vector_count:,} vectors")

monitor_index(index_name)

Watch for index_fullness approaching 1.0 on pod-based indexes (serverless scales automatically). Query latency above 500ms usually indicates the need to either reduce k or upgrade the pod type.

Pinecone Serverless is the Right Default for Cloud RAG

The namespace feature alone is worth it for SaaS builders — you get tenant isolation, environment separation, and collection management without managing separate databases.

Frequently Asked Questions

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

How to Use LangChain with Pinecone Serverless (Cloud RAG)

What Is Pinecone Serverless?

Installation and Setup

Creating a Pinecone Serverless Index

Loading Documents and Creating the Vector Store

Basic Similarity Search

Metadata Filtering

Building the RAG Chain

Upsert with Explicit IDs and Batch Processing

Updating and Deleting Vectors

Multi-Namespace RAG for Multi-Tenant Applications

Pinecone Serverless vs Pod-Based vs Self-Hosted

Hybrid Search (Dense + Sparse)

Async Operations for High Throughput

Monitoring Index Health

Pinecone Serverless is the Right Default for Cloud RAG

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

How to Use LangChain with Pinecone Serverless (Cloud RAG)

What Is Pinecone Serverless?

Installation and Setup

Creating a Pinecone Serverless Index

Loading Documents and Creating the Vector Store

Basic Similarity Search

Metadata Filtering

Building the RAG Chain

Upsert with Explicit IDs and Batch Processing

Updating and Deleting Vectors

Multi-Namespace RAG for Multi-Tenant Applications

Pinecone Serverless vs Pod-Based vs Self-Hosted

Hybrid Search (Dense + Sparse)

Async Operations for High Throughput

Monitoring Index Health

Pinecone Serverless is the Right Default for Cloud RAG

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily