How to Use LangChain with Vertex AI (Google Gemini 2026)
Integrate LangChain with Google Vertex AI and Gemini models. Complete guide covering ChatVertexAI, embeddings, multimodal inputs, function calling, and cost comparison.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Google's Gemini models on Vertex AI offer competitive performance, massive context windows, and first-class multimodal support. If your infrastructure already runs on Google Cloud — or if you need a 1-million-token context window for long-document processing — Vertex AI is the natural choice.
This guide covers everything you need to integrate LangChain with Vertex AI: authentication, the ChatVertexAI and VertexAI classes, embeddings, multimodal inputs, function calling, and a head-to-head cost comparison with OpenAI.
If you're building the same applications with OpenAI, see OpenAI API integration for comparison. The LangChain tutorial 2025 covers the shared LangChain patterns.
Why Vertex AI with LangChain?
Three reasons teams choose Vertex AI over other providers:
- Context window — Gemini 1.5 Pro supports 1M tokens; Gemini 1.5 Flash supports 1M at lower cost
- Google Cloud integration — Native access to BigQuery, Cloud Storage, GCS data, and GCP IAM
- Multimodal — Video, audio, image, and text in a single API call
The LangChain langchain-google-vertexai package provides drop-in replacements for OpenAI classes — swap ChatOpenAI → ChatVertexAI and most of your existing code continues working.
Installation and Authentication
pip install langchain langchain-google-vertexai google-cloud-aiplatform
Authentication Option 1: Application Default Credentials (local development)
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
Authentication Option 2: Service Account (production)
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/service-account-key.json"
os.environ["GOOGLE_CLOUD_PROJECT"] = "your-project-id"
Authentication Option 3: Explicit credentials in code
from google.oauth2 import service_account
from langchain_google_vertexai import ChatVertexAI
credentials = service_account.Credentials.from_service_account_file(
"service-account.json",
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
llm = ChatVertexAI(
model_name="gemini-1.5-pro",
credentials=credentials,
project="your-project-id",
location="us-central1"
)
Basic Usage: ChatVertexAI
from langchain_google_vertexai import ChatVertexAI
from langchain_core.messages import HumanMessage, SystemMessage
# Initialize Gemini 1.5 Pro
llm = ChatVertexAI(
model_name="gemini-1.5-pro",
project="your-project-id",
location="us-central1",
temperature=0.1,
max_output_tokens=2048,
)
# Simple invocation
response = llm.invoke([
SystemMessage(content="You are a helpful assistant specialized in cloud architecture."),
HumanMessage(content="What are the key differences between microservices and serverless architectures?")
])
print(response.content)
Available models (2026):
| Model | Context Window | Best For |
|---|---|---|
| gemini-1.5-pro | 1,000,000 tokens | Complex reasoning, long docs |
| gemini-1.5-flash | 1,000,000 tokens | Fast, cost-effective tasks |
| gemini-1.0-pro | 32,768 tokens | Standard chat and generation |
| gemini-1.5-pro-vision | 1,000,000 tokens | Image + text analysis |
VertexAI for Text Completion (Non-Chat)
For legacy completion-style workflows:
from langchain_google_vertexai import VertexAI
# Text completion (non-chat) model
text_llm = VertexAI(
model_name="gemini-1.5-pro",
project="your-project-id",
location="us-central1",
temperature=0,
max_output_tokens=1024,
top_p=0.8,
top_k=40
)
# Simple completion
result = text_llm.invoke("Explain transformer self-attention in 3 sentences.")
print(result)
# Batch processing
results = text_llm.batch([
"Explain gradient descent",
"What is RLHF?",
"Define embedding in ML"
])
for r in results:
print(r[:200])
Vertex AI Embeddings
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
# Initialize Vertex AI embeddings
embeddings = VertexAIEmbeddings(
model_name="textembedding-gecko@003", # or "text-embedding-004"
project="your-project-id",
location="us-central1"
)
# Embed a single text
embedding_vector = embeddings.embed_query("What is machine learning?")
print(f"Embedding dimension: {len(embedding_vector)}") # 768 for gecko, 768 for text-embedding-004
# Embed multiple texts
texts = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with many layers",
"Reinforcement learning trains through reward signals"
]
embedded_docs = embeddings.embed_documents(texts)
print(f"Embedded {len(embedded_docs)} documents, dimension {len(embedded_docs[0])}")
# Use with ChromaDB (drop-in for OpenAI embeddings)
docs = [
Document(page_content=text, metadata={"index": i})
for i, text in enumerate(texts)
]
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings,
collection_name="vertex_demo"
)
results = vectorstore.similarity_search("neural networks", k=2)
for doc in results:
print(doc.page_content)
Available embedding models:
| Model | Dimensions | Best For |
|---|---|---|
| textembedding-gecko@003 | 768 | General purpose |
| text-embedding-004 | 768 | Latest, improved quality |
| textembedding-gecko-multilingual@001 | 768 | 100+ languages |
Multimodal Inputs with Gemini Vision
One of Gemini's strongest advantages is native multimodal support:
from langchain_google_vertexai import ChatVertexAI
from langchain_core.messages import HumanMessage
import base64
from pathlib import Path
# Initialize vision-capable model
vision_llm = ChatVertexAI(
model_name="gemini-1.5-pro",
project="your-project-id",
location="us-central1",
temperature=0
)
def encode_image_base64(image_path: str) -> str:
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode()
# Image analysis
def analyze_image(image_path: str, question: str) -> str:
image_data = encode_image_base64(image_path)
message = HumanMessage(
content=[
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}
},
{
"type": "text",
"text": question
}
]
)
response = vision_llm.invoke([message])
return response.content
# Example: analyze a chart
result = analyze_image(
"quarterly_revenue.png",
"What trend do you see in this revenue chart? Identify any notable changes."
)
print(result)
# Video analysis (Gemini 1.5 exclusive feature)
def analyze_video_from_gcs(gcs_uri: str, question: str) -> str:
"""Analyze a video stored in Google Cloud Storage."""
message = HumanMessage(
content=[
{
"type": "media",
"file_uri": gcs_uri, # gs://bucket/video.mp4
"mime_type": "video/mp4"
},
{
"type": "text",
"text": question
}
]
)
response = vision_llm.invoke([message])
return response.content
# Analyze a product demo video
video_analysis = analyze_video_from_gcs(
"gs://my-bucket/product-demo.mp4",
"Summarize the main features demonstrated in this product video."
)
Video analysis is exclusive to Gemini 1.5 — you can pass entire video files and ask questions about them. This enables use cases like meeting summarization, training video indexing, and product demo analysis that are impossible with OpenAI models.
Function Calling with ChatVertexAI
from langchain_core.tools import tool
from langchain_google_vertexai import ChatVertexAI
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
@tool
def get_gcp_project_info(project_id: str) -> str:
"""Get information about a GCP project including billing and resources."""
# Mock implementation
return f"Project {project_id}: Region us-central1, Budget $500/month, Services: GCS, BigQuery, Vertex AI"
@tool
def query_bigquery(sql: str, project_id: str) -> str:
"""Execute a BigQuery SQL query and return results."""
# Mock implementation — replace with actual BigQuery client
return f"Query executed. Result: 1,247 rows returned. Sample: [('2026-01-01', 15234), ('2026-01-02', 16891)]"
@tool
def list_gcs_buckets(project_id: str) -> str:
"""List Cloud Storage buckets in a GCP project."""
return f"Buckets in {project_id}: ml-training-data, model-artifacts, raw-data, processed-features"
# Build Vertex AI agent
llm = ChatVertexAI(
model_name="gemini-1.5-pro",
project="your-project-id",
location="us-central1",
temperature=0
)
tools = [get_gcp_project_info, query_bigquery, list_gcs_buckets]
llm_with_tools = llm.bind_tools(tools)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a Google Cloud assistant. Use the available tools to answer questions about GCP resources."),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad")
])
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
result = executor.invoke({
"input": "What BigQuery data do we have in our project and how much storage are we using?",
"chat_history": []
})
print(result["output"])
RAG Pipeline with Vertex AI
Build a complete RAG system using Vertex AI embeddings and Gemini for generation:
from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
# Initialize Vertex AI components
embeddings = VertexAIEmbeddings(
model_name="text-embedding-004",
project="your-project-id",
location="us-central1"
)
llm = ChatVertexAI(
model_name="gemini-1.5-pro",
project="your-project-id",
location="us-central1",
temperature=0,
max_output_tokens=4096
)
# Load and index documents
loader = PyPDFLoader("technical_manual.pdf")
pages = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
chunks = splitter.split_documents(pages)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
collection_name="vertex_rag"
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# RAG prompt for Gemini
rag_prompt = ChatPromptTemplate.from_messages([
("system", """You are a technical documentation assistant.
Answer questions based strictly on the provided context.
If the answer is not in the context, say "This information is not in the provided documentation."
Cite specific sections when possible."""),
("human", """Context:
{context}
Question: {question}
Answer:""")
])
def format_docs(docs):
return "\n\n---\n\n".join(
f"[Page {doc.metadata.get('page', '?')}]\n{doc.page_content}"
for doc in docs
)
rag_chain = (
RunnableParallel({
"context": retriever | format_docs,
"question": RunnablePassthrough()
})
| rag_prompt
| llm
| StrOutputParser()
)
# Query the RAG system
answer = rag_chain.invoke("What are the safety requirements for high-voltage operations?")
print(answer)
Long-Context Processing with Gemini 1.5 Pro
Gemini's 1M token context window enables "whole-document RAG" — feeding an entire document as context:
from langchain_google_vertexai import ChatVertexAI
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.prompts import ChatPromptTemplate
llm = ChatVertexAI(
model_name="gemini-1.5-pro",
project="your-project-id",
location="us-central1",
temperature=0,
max_output_tokens=8192
)
# Load an entire book or long document
loader = PyPDFLoader("full_textbook.pdf")
all_pages = loader.load()
full_text = "\n\n".join(page.page_content for page in all_pages)
print(f"Document length: {len(full_text.split())} words")
# → For a 500-page book: ~125,000 words ≈ 166,000 tokens (well within 1M limit)
# Ask questions about the entire document at once
prompt = ChatPromptTemplate.from_messages([
("system", "You are analyzing a complete technical document. Answer questions about its full content."),
("human", """Here is the complete document:
{document}
Question: {question}""")
])
chain = prompt | llm
# No chunking or retrieval needed for documents under ~700K words
answer = chain.invoke({
"document": full_text,
"question": "What are the three most important concepts introduced in Chapter 7, and how do they relate to each other?"
})
print(answer.content)
This is a fundamentally different approach to document QA compared to traditional RAG. For documents under 700K words, you can skip chunking and retrieval entirely and just pass everything to Gemini. The RAG system tutorial compares both approaches with benchmarks.
Vertex AI vs OpenAI Pricing Comparison (2026)
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| Gemini 1.5 Pro (≤128K) | $1.25 | $5.00 | 1M tokens |
| Gemini 1.5 Pro (>128K) | $2.50 | $10.00 | 1M tokens |
| Gemini 1.5 Flash (≤128K) | $0.075 | $0.30 | 1M tokens |
| Gemini 1.5 Flash (>128K) | $0.15 | $0.60 | 1M tokens |
| GPT-4o | $5.00 | $15.00 | 128K tokens |
| GPT-4o-mini | $0.15 | $0.60 | 128K tokens |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K tokens |
Cost analysis for 10K queries/day (RAG, 4K tokens in / 500 out):
- Gemini 1.5 Flash: $0.075/M × 4K × 10K = $3/day in + $0.30/M × 500 × 10K = $1.50/day out = $4.50/day
- GPT-4o: $5/M × 4K × 10K = $200/day in + $15/M × 500 × 10K = $75/day out = $275/day
- GPT-4o-mini: $0.15/M × 4K × 10K = $6/day in + $0.60/M × 500 × 10K = $3/day out = $9/day
Gemini 1.5 Flash is the most cost-effective option for standard RAG. Gemini 1.5 Pro competes with Claude 3.5 Sonnet at lower per-token pricing.
Streaming Responses
from langchain_google_vertexai import ChatVertexAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
llm = ChatVertexAI(
model_name="gemini-1.5-flash",
project="your-project-id",
location="us-central1",
streaming=True # Enable streaming
)
prompt = ChatPromptTemplate.from_template(
"Write a detailed explanation of {topic} for a software engineer audience."
)
chain = prompt | llm | StrOutputParser()
# Synchronous streaming
print("Streaming response:")
for chunk in chain.stream({"topic": "transformer attention mechanisms"}):
print(chunk, end="", flush=True)
print()
# Async streaming
import asyncio
async def stream_async(topic: str):
print("\nAsync streaming:")
async for chunk in chain.astream({"topic": topic}):
print(chunk, end="", flush=True)
print()
asyncio.run(stream_async("RLHF training process"))
Switching Between Providers
One of LangChain's best features is provider portability. Swap Vertex AI for OpenAI with minimal code changes:
import os
# Switch via environment variable
PROVIDER = os.getenv("LLM_PROVIDER", "vertex")
if PROVIDER == "vertex":
from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings
llm = ChatVertexAI(
model_name="gemini-1.5-pro",
project=os.environ["GCP_PROJECT"],
location="us-central1"
)
embeddings = VertexAIEmbeddings(
model_name="text-embedding-004",
project=os.environ["GCP_PROJECT"],
location="us-central1"
)
elif PROVIDER == "openai":
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
elif PROVIDER == "anthropic":
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings # Anthropic has no embedding model
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# The rest of your RAG chain, agent code, etc. stays identical
This pattern is invaluable for running A/B tests between providers or building provider-agnostic applications. The OpenAI API integration guide covers the OpenAI-specific features that don't map directly to Vertex AI.
Async Batch Processing on Vertex AI
import asyncio
from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(
model_name="gemini-1.5-flash",
project="your-project-id",
location="us-central1",
temperature=0
)
async def process_documents_async(documents: list[str]) -> list[str]:
"""Process multiple documents concurrently."""
prompt = ChatPromptTemplate.from_template(
"Summarize this document in 2 sentences: {doc}"
)
chain = prompt | llm | StrOutputParser()
# Vertex AI allows up to 60 concurrent requests
semaphore = asyncio.Semaphore(20)
async def process_one(doc: str) -> str:
async with semaphore:
return await chain.ainvoke({"doc": doc})
tasks = [process_one(doc) for doc in documents]
return await asyncio.gather(*tasks, return_exceptions=True)
# Process 100 documents in parallel
documents = [f"Technical document {i} with content about ML systems..." for i in range(100)]
summaries = asyncio.run(process_documents_async(documents))
print(f"Processed {len(summaries)} documents")
For large-scale document processing, Vertex AI's batch API is even more cost-effective — 50% discount on gemini-1.5-flash for asynchronous batch jobs. See Google's Vertex AI Batch Prediction docs for setup.
Production Considerations
Quotas and Rate Limits:
- Gemini 1.5 Pro: 360 requests/minute, 4M tokens/minute
- Gemini 1.5 Flash: 1,000 requests/minute, 4M tokens/minute
- Request increases via GCP support for production workloads
Regional Availability:
- Models are available in
us-central1,us-east4,europe-west4, and several others - Deploy in the same region as your other GCP services to minimize latency and egress costs
Logging and Monitoring:
from langchain_core.callbacks import StdOutCallbackHandler
from langchain_google_vertexai import ChatVertexAI
# Enable Cloud Logging via GCP (automatic when running on GCP)
# For local development, use LangChain callbacks
llm = ChatVertexAI(
model_name="gemini-1.5-pro",
project="your-project-id",
location="us-central1",
callbacks=[StdOutCallbackHandler()] # Log all LLM calls
)
For deploying Vertex AI applications to production, the Deploy AI model to production guide covers Cloud Run, GKE, and serverless deployment patterns. For agent architectures compatible with Vertex AI, see Build AI agent with LangChain and AI agent memory and planning.
Frequently Asked Questions
Do I need a Google Cloud account to use LangChain with Vertex AI? Yes. Vertex AI requires a Google Cloud project with billing enabled. You authenticate either via Application Default Credentials (gcloud auth application-default login) or a service account JSON key. New GCP accounts receive $300 in free credits, which covers substantial Vertex AI usage for testing.
How does Gemini 1.5 Pro compare to GPT-4o for RAG applications? Gemini 1.5 Pro has a 1 million token context window (vs 128K for GPT-4o), making it better for whole-document RAG where you want to pass entire PDFs. GPT-4o generally has faster response times and broader tool ecosystem support. For pure context size, Gemini wins; for latency and ecosystem, OpenAI wins.
Can I use Vertex AI embeddings with Pinecone or ChromaDB in LangChain? Yes. VertexAIEmbeddings is a drop-in replacement for OpenAIEmbeddings in any LangChain vector store integration. Just replace the embeddings parameter with VertexAIEmbeddings() and the rest of your RAG code stays unchanged.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.