How to Use LangChain with Pinecone Serverless (Cloud RAG)
Deploy cloud-native RAG with LangChain and Pinecone Serverless. Complete guide covering setup, upsert, query, namespaces, metadata filtering, and cost estimates.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Cloud RAG is the architecture that makes AI applications actually scalable. Running your own vector database is fine for prototypes, but the moment you hit millions of documents or need zero-maintenance infrastructure, Pinecone Serverless becomes the obvious choice.
This guide walks you through every step of building a production RAG system with LangChain and Pinecone Serverless — from account setup to namespaced multi-tenant queries to cost optimization.
Before you start, make sure you're comfortable with embedding basics from the Vector database guide and the RAG system tutorial.
What Is Pinecone Serverless?
Pinecone launched its serverless tier in early 2024. Unlike pod-based indexes (which require you to provision p1, p2, or s1 pod types), serverless indexes scale automatically. You pay only for:
- Storage: ~$0.033 per GB per month
- Read units: ~$4 per million read units (1 RU ≈ retrieving 1 vector)
- Write units: ~$2 per million write units
For a RAG application with 100K documents (roughly 1M vectors at 1,536 dimensions), monthly storage runs about $0.33. Read costs at 10,000 queries/day with k=5 retrieval: 10,000 × 5 × 30 = 1.5M RUs = $6/month. Compare this to a dedicated pod costing $70+/month.
Installation and Setup
pip install langchain langchain-openai langchain-pinecone pinecone-client python-dotenv
import os
from dotenv import load_dotenv
load_dotenv()
# Required environment variables:
# PINECONE_API_KEY=your-pinecone-api-key
# OPENAI_API_KEY=your-openai-api-key
# PINECONE_INDEX_NAME=your-index-name
Creating a Pinecone Serverless Index
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index_name = "langchain-rag-demo"
# Check if index exists, create if not
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536, # OpenAI text-embedding-ada-002 / text-embedding-3-small
metric="cosine",
spec=ServerlessSpec(
cloud="aws", # "aws", "gcp", or "azure"
region="us-east-1" # region for your cloud provider
)
)
print(f"Created index: {index_name}")
else:
print(f"Index {index_name} already exists")
# Get index stats
index = pc.Index(index_name)
print(index.describe_index_stats())
Dimension guide:
text-embedding-3-small: 1,536 dimensions (default), or 512 withdimensionsparamtext-embedding-3-large: 3,072 dimensionstext-embedding-ada-002: 1,536 dimensions
Loading Documents and Creating the Vector Store
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Initialize embeddings
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
dimensions=1536
)
# Load and split documents
loader = WebBaseLoader([
"https://docs.example.com/page1",
"https://docs.example.com/page2"
])
raw_docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
docs = splitter.split_documents(raw_docs)
print(f"Split into {len(docs)} chunks")
# Create vector store and upsert documents
vectorstore = PineconeVectorStore.from_documents(
documents=docs,
embedding=embeddings,
index_name=index_name,
namespace="prod" # namespace for isolation
)
print("Documents upserted to Pinecone")
About namespaces: Each Pinecone index supports multiple namespaces. Think of them as logical partitions — same index, isolated data. Use them for:
- Multi-tenant SaaS (one namespace per customer)
- Environment isolation (dev/staging/prod)
- Document collection separation (docs/blog/support)
Basic Similarity Search
# Connect to existing index (skip from_documents on second run)
vectorstore = PineconeVectorStore(
index_name=index_name,
embedding=embeddings,
namespace="prod"
)
# Simple similarity search
query = "How do I reset my password?"
results = vectorstore.similarity_search(query, k=5)
for doc in results:
print(f"Score source: {doc.metadata.get('source', 'N/A')}")
print(f"Content: {doc.page_content[:200]}\n")
# With similarity scores
results_with_scores = vectorstore.similarity_search_with_score(query, k=5)
for doc, score in results_with_scores:
print(f"Score: {score:.4f} | {doc.page_content[:150]}")
Pinecone returns cosine similarity scores (0–1 for normalized vectors). Scores above 0.85 are typically high-relevance matches.
Metadata Filtering
One of Pinecone's most powerful features is metadata filtering. You can narrow vector search to specific subsets of your data:
from langchain_core.documents import Document
import uuid
# Upsert with rich metadata
docs_with_metadata = [
Document(
page_content="How to configure two-factor authentication in the admin panel.",
metadata={
"source": "admin-docs",
"category": "security",
"version": "2.0",
"last_updated": "2026-01-15",
"tenant_id": "customer_123"
}
),
Document(
page_content="Setting up email notifications for billing events.",
metadata={
"source": "billing-docs",
"category": "billing",
"version": "1.5",
"last_updated": "2025-11-20",
"tenant_id": "customer_456"
}
)
]
vectorstore.add_documents(docs_with_metadata)
# Filter by category
security_results = vectorstore.similarity_search(
query="authentication settings",
k=5,
filter={"category": "security"}
)
# Filter by tenant (multi-tenant RAG)
tenant_results = vectorstore.similarity_search(
query="billing notifications",
k=5,
filter={"tenant_id": "customer_123"}
)
# Complex filter: category AND version
filtered_results = vectorstore.similarity_search(
query="two factor auth",
k=5,
filter={
"$and": [
{"category": {"$eq": "security"}},
{"version": {"$gte": "2.0"}}
]
}
)
Pinecone supports $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or operators in metadata filters.
Building the RAG Chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_openai import ChatOpenAI
# Create retriever from vector store
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={
"k": 6,
"filter": {"category": "security"} # optional global filter
}
)
# RAG prompt
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful documentation assistant.
Answer questions based only on the provided context.
If the answer is not in the context, say "I don't have that information."
Always cite the source document when answering."""),
("human", """Context:
{context}
Question: {question}""")
])
def format_docs(docs):
formatted = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "unknown")
formatted.append(f"[{i}] Source: {source}\n{doc.page_content}")
return "\n\n".join(formatted)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
rag_chain = (
RunnableParallel({
"context": retriever | format_docs,
"question": RunnablePassthrough()
})
| prompt
| llm
| StrOutputParser()
)
# Run a query
answer = rag_chain.invoke("How do I enable two-factor authentication?")
print(answer)
Upsert with Explicit IDs and Batch Processing
For production ingestion pipelines, control IDs and batch size explicitly:
import hashlib
from typing import List
from langchain_core.documents import Document
def doc_to_id(doc: Document) -> str:
"""Generate deterministic ID from document content + source."""
content = doc.page_content + doc.metadata.get("source", "")
return hashlib.md5(content.encode()).hexdigest()
def batch_upsert(
docs: List[Document],
vectorstore: PineconeVectorStore,
batch_size: int = 100,
namespace: str = "prod"
) -> None:
"""Upsert documents in batches with progress tracking."""
total = len(docs)
for i in range(0, total, batch_size):
batch = docs[i:i + batch_size]
ids = [doc_to_id(doc) for doc in batch]
vectorstore.add_documents(
documents=batch,
ids=ids,
namespace=namespace
)
pct = min(100, (i + batch_size) / total * 100)
print(f"Upserted {min(i + batch_size, total)}/{total} ({pct:.0f}%)")
# Process a large document collection
from langchain_community.document_loaders import DirectoryLoader, TextLoader
loader = DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader)
all_docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(all_docs)
print(f"Total chunks to upsert: {len(chunks)}")
batch_upsert(chunks, vectorstore, batch_size=100)
Pinecone's recommended batch size is 100 vectors per upsert call. Larger batches hit size limits; smaller batches increase API overhead.
Updating and Deleting Vectors
# Delete specific documents by ID
vectorstore.delete(ids=["abc123", "def456"])
# Delete all vectors in a namespace
index = pc.Index(index_name)
index.delete(delete_all=True, namespace="dev")
# Update a document (delete + re-add)
def update_document(
doc: Document,
vectorstore: PineconeVectorStore,
namespace: str = "prod"
) -> str:
doc_id = doc_to_id(doc)
# Delete old version
vectorstore.delete(ids=[doc_id])
# Add new version
vectorstore.add_documents(
documents=[doc],
ids=[doc_id],
namespace=namespace
)
return doc_id
updated_doc = Document(
page_content="Updated: 2FA now supports hardware security keys in addition to TOTP apps.",
metadata={"source": "admin-docs", "category": "security", "version": "2.1"}
)
update_document(updated_doc, vectorstore)
Multi-Namespace RAG for Multi-Tenant Applications
from langchain_pinecone import PineconeVectorStore
from typing import Optional
class MultiTenantRAG:
def __init__(self, index_name: str, embeddings, llm):
self.index_name = index_name
self.embeddings = embeddings
self.llm = llm
def get_retriever(self, tenant_id: str, k: int = 5):
"""Get a retriever scoped to a specific tenant namespace."""
vs = PineconeVectorStore(
index_name=self.index_name,
embedding=self.embeddings,
namespace=f"tenant_{tenant_id}"
)
return vs.as_retriever(search_kwargs={"k": k})
def answer(self, question: str, tenant_id: str) -> str:
retriever = self.get_retriever(tenant_id)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer based on the provided context only."),
("human", "Context:\n{context}\n\nQuestion: {question}")
])
chain = (
RunnableParallel({
"context": retriever | format_docs,
"question": RunnablePassthrough()
})
| prompt
| self.llm
| StrOutputParser()
)
return chain.invoke(question)
def ingest_for_tenant(self, docs: List[Document], tenant_id: str):
"""Ingest documents into a tenant-specific namespace."""
vs = PineconeVectorStore(
index_name=self.index_name,
embedding=self.embeddings,
namespace=f"tenant_{tenant_id}"
)
chunks = splitter.split_documents(docs)
vs.add_documents(chunks)
print(f"Ingested {len(chunks)} chunks for tenant {tenant_id}")
# Usage
rag = MultiTenantRAG(
index_name=index_name,
embeddings=embeddings,
llm=ChatOpenAI(model="gpt-4o")
)
# Tenant A gets only their data
answer_a = rag.answer("What is my billing cycle?", tenant_id="customer_123")
# Tenant B gets only their data
answer_b = rag.answer("What is my billing cycle?", tenant_id="customer_456")
This pattern is one of the cleanest ways to build multi-tenant AI applications. Each customer's data stays logically isolated at the namespace level while sharing the same underlying index infrastructure.
For the agent side of this architecture, see Build AI agent with LangChain and the AI research agent build.
Pinecone Serverless vs Pod-Based vs Self-Hosted
| Feature | Serverless | Pod-Based (p2) | Self-Hosted (Weaviate/Qdrant) |
|---|---|---|---|
| Setup time | 2 minutes | 5 minutes | 30–120 minutes |
| Maintenance | Zero | Low | High |
| Latency (p99) | ~50–200ms | ~10–50ms | ~5–20ms |
| Cost (100K docs) | ~$6/month | ~$70/month | Server cost (~$20–80/month) |
| Max dimensions | 20,000 | 20,000 | 65,535+ |
| Namespaces | Yes | Yes | Collections/tenants |
| Metadata filtering | Yes | Yes | Yes |
| Hybrid search | Beta | Yes | Yes |
| Data residency | AWS/GCP/Azure | AWS/GCP/Azure | Full control |
Cost calculation example (1M vectors, 1,536 dims, 10K queries/day):
- Serverless: $33/month storage + $6/month reads = $39/month
- Pod-based (p2.x1): $87/month (fixed)
- Self-hosted on AWS t3.xlarge: $120/month (EC2 + storage + ops time)
Serverless wins below ~5M vectors. At very high query throughput (>100K queries/day), pod-based latency advantages can justify the cost difference.
Hybrid Search (Dense + Sparse)
Pinecone supports hybrid search combining dense vector similarity with BM25 keyword matching:
# pip install pinecone-text
from pinecone_text.sparse import BM25Encoder
from langchain_community.retrievers import PineconeHybridSearchRetriever
# Fit BM25 on your corpus
bm25_encoder = BM25Encoder()
bm25_encoder.fit([doc.page_content for doc in docs])
bm25_encoder.dump("bm25_params.json")
# Create hybrid retriever
hybrid_retriever = PineconeHybridSearchRetriever(
embeddings=embeddings,
sparse_encoder=bm25_encoder,
index=index,
top_k=5,
alpha=0.5 # 0=pure sparse (BM25), 1=pure dense (embeddings), 0.5=balanced
)
# Hybrid search handles both semantic and keyword queries well
results = hybrid_retriever.invoke("2FA hardware key FIDO2 setup")
Hybrid search is particularly valuable for technical documentation, code search, and any domain with specialized terminology where exact keyword matching matters.
Async Operations for High Throughput
import asyncio
from langchain_pinecone import PineconeVectorStore
async def async_rag_pipeline(questions: list[str], tenant_id: str) -> list[str]:
vs = PineconeVectorStore(
index_name=index_name,
embedding=embeddings,
namespace=f"tenant_{tenant_id}"
)
retriever = vs.as_retriever(search_kwargs={"k": 5})
prompt = ChatPromptTemplate.from_messages([
("system", "Answer based on context only."),
("human", "Context:\n{context}\n\nQuestion: {question}")
])
chain = (
RunnableParallel({
"context": retriever | format_docs,
"question": RunnablePassthrough()
})
| prompt
| ChatOpenAI(model="gpt-4o")
| StrOutputParser()
)
# Run all questions concurrently
tasks = [chain.ainvoke(q) for q in questions]
answers = await asyncio.gather(*tasks)
return answers
# Run 10 concurrent queries
questions = [f"Question {i}" for i in range(10)]
answers = asyncio.run(async_rag_pipeline(questions, "customer_123"))
Async operations are critical for production RAG applications. A synchronous pipeline serving 100 concurrent users would queue requests sequentially; async allows true parallelism within Python's event loop.
Monitoring Index Health
from pinecone import Pinecone
def monitor_index(index_name: str):
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index(index_name)
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count:,}")
print(f"Dimension: {stats.dimension}")
print(f"Index fullness: {stats.index_fullness:.2%}")
print("\nNamespace breakdown:")
for namespace, ns_stats in stats.namespaces.items():
print(f" {namespace}: {ns_stats.vector_count:,} vectors")
monitor_index(index_name)
Watch for index_fullness approaching 1.0 on pod-based indexes (serverless scales automatically). Query latency above 500ms usually indicates the need to either reduce k or upgrade the pod type.
Pinecone Serverless is the Right Default for Cloud RAG
If you're building a new RAG application and don't have a specific reason to host your own vector database, start with Pinecone Serverless. The zero-maintenance infrastructure, pay-per-use pricing, and tight LangChain integration make it the fastest path to production.
The namespace feature alone is worth it for SaaS builders — you get tenant isolation, environment separation, and collection management without managing separate databases.
Combine Pinecone Serverless with the OpenAI API integration for embeddings and the Deploy AI model to production guide for deployment patterns. If you're comparing options, the LangChain tutorial 2025 covers ChromaDB and FAISS alternatives.
Frequently Asked Questions
What is the difference between Pinecone Serverless and pod-based? Serverless Pinecone charges only for storage and queries — there are no always-on pods. Pod-based indexes provision dedicated resources with predictable latency. Serverless is cheaper for sporadic workloads; pod-based is better for sustained high-throughput applications.
Can I use multiple namespaces in LangChain with Pinecone? Yes. Pass the namespace parameter when creating a PineconeVectorStore or when calling similarity_search. Namespaces let you isolate data for different tenants, document collections, or environments within a single Pinecone index.
How do I delete vectors from Pinecone using LangChain? You can delete by IDs using vectorstore.delete(ids=['id1', 'id2']), or delete an entire namespace using the Pinecone client directly with index.delete(delete_all=True, namespace='your-namespace').
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
How to Use AutoGen with Milvus (Vector Database Memory)
Integrate Milvus vector database with AutoGen agents for large-scale persistent memory. Full setup guide with LangChain integration and vector DB comparison table.
5 AutoGPT Memory Types (Vector, Redis, File, Conversation)
Compare AutoGPT's 5 memory backends — local file, Redis, Pinecone, Milvus, and Weaviate. Choose the right one for speed, cost, and persistence needs.
How to Set Up AutoGPT with Pinecone (Persistent Memory)
Step-by-step guide to configuring AutoGPT with Pinecone for persistent long-term memory. Covers Pinecone setup, memory.json config, and memory_backend settings.