How to Use LangChain with Modal (Serverless GPU 2026)
Deploy LangChain pipelines on Modal's serverless GPU infrastructure — run local LLMs, scale to zero, and cut inference costs with cold-start optimization.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
OpenAI's API is convenient until it is not — rate limits, per-token costs that scale linearly with traffic, and data residency requirements all push teams toward self-hosted models. But running your own GPU infrastructure is expensive and operationally complex. Modal threads the needle: it provides on-demand GPU compute that scales to zero when idle and bills by the millisecond of actual usage.
This guide shows how to deploy LangChain pipelines on Modal, including running a local LLM (Llama 3), building a serverless RAG endpoint, and optimizing cold-start performance. You will also get a comparison table of serverless AI platforms.
Before working through this, review Deploy AI model to production for deployment fundamentals and RAG system tutorial for the RAG pipeline we will be deploying.
Why Modal for LangChain Deployments
Traditional deployment options all have trade-offs:
- OpenAI API: Easiest, but expensive at scale and subject to rate limits
- AWS EC2 with GPU: Full control, but you pay for idle time and manage infrastructure
- Kubernetes with GPU nodes: Powerful, but high operational overhead
- Modal: Serverless GPUs, scale to zero, millisecond billing, minimal ops
A team running a research assistant that handles 10,000 queries per day might pay $800/month on OpenAI for the equivalent compute. On Modal with a self-hosted 7B model, the same workload might cost $120–$200/month. The break-even depends heavily on query volume and average tokens per query.
Installation and Setup
pip install modal langchain langchain-openai
modal token new # authenticates with Modal's cloud
Your Modal token is stored in ~/.modal.toml after running modal token new.
Your First Modal Function
Start with a simple CPU function to understand the Modal programming model:
# app.py
import modal
app = modal.App("langchain-demo")
@app.function()
def greet(name: str) -> str:
return f"Hello from Modal, {name}!"
@app.local_entrypoint()
def main():
result = greet.remote("LangChain")
print(result)
Run it:
modal run app.py
# Hello from Modal, LangChain!
The function runs in Modal's cloud, not locally. The @app.function() decorator is what makes this happen.
Running LangChain with OpenAI on Modal
The simplest LangChain + Modal setup uses the OpenAI API from Modal's cloud:
import modal
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
app = modal.App("langchain-openai")
# Define the container image with required packages
image = modal.Image.debian_slim(python_version="3.11").pip_install(
"langchain",
"langchain-openai",
"openai"
)
@app.function(
image=image,
secrets=[modal.Secret.from_name("openai-secret")],
timeout=300
)
def run_langchain_chain(question: str) -> str:
"""
Runs a LangChain chain on Modal's cloud infrastructure.
"""
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template(
"Answer concisely: {question}"
)
chain = prompt | llm | StrOutputParser()
return chain.invoke({"question": question})
@app.local_entrypoint()
def main():
# Call the remote function
answer = run_langchain_chain.remote("What is LangChain?")
print(answer)
The modal.Secret.from_name("openai-secret") references a secret you created in Modal's dashboard with your OpenAI API key.
Running a Local LLM on Modal GPU
This is where Modal becomes genuinely powerful. Deploy Llama 3 8B on a GPU and use it as a LangChain-compatible LLM:
import modal
from typing import Iterator
app = modal.App("langchain-llama3")
# Container image with GPU ML stack
gpu_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"torch",
"transformers",
"accelerate",
"bitsandbytes",
"langchain",
"langchain-community",
"huggingface_hub"
)
)
# Volume to cache model weights across container starts
model_volume = modal.Volume.from_name("llama3-weights", create_if_missing=True)
MODEL_DIR = "/models"
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
@app.function(
image=gpu_image,
gpu="A10G",
volumes={MODEL_DIR: model_volume},
timeout=600,
secrets=[modal.Secret.from_name("hf-secret")] # Hugging Face token
)
def download_model():
"""
Pre-download model weights to the volume.
Run this once before deploying.
"""
from huggingface_hub import snapshot_download
import os
snapshot_download(
repo_id=MODEL_ID,
local_dir=f"{MODEL_DIR}/{MODEL_ID}",
token=os.environ["HF_TOKEN"]
)
print(f"Model downloaded to {MODEL_DIR}/{MODEL_ID}")
@app.cls(
image=gpu_image,
gpu="A10G",
volumes={MODEL_DIR: model_volume},
timeout=300,
container_idle_timeout=120 # Keep container warm for 2 minutes
)
class LlamaLangChain:
"""
A LangChain-compatible Llama 3 endpoint on Modal GPU.
"""
@modal.build()
def build(self):
"""Called at container build time."""
pass
@modal.enter()
def load_model(self):
"""Called when container starts — loads model into GPU memory."""
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_path = f"{MODEL_DIR}/{MODEL_ID}"
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto"
)
print("Llama 3 loaded into GPU memory")
@modal.method()
def generate(self, prompt: str, max_new_tokens: int = 512) -> str:
"""Generate text from a prompt."""
import torch
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = self.tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)
generated = outputs[0][inputs["input_ids"].shape[1]:]
return self.tokenizer.decode(generated, skip_special_tokens=True)
@app.local_entrypoint()
def main():
llama = LlamaLangChain()
response = llama.generate.remote(
"Explain the difference between RAG and fine-tuning in one paragraph."
)
print(response)
Using the Modal LLM in a LangChain Pipeline
To use the Modal-deployed LLM within a standard LangChain chain, wrap it in a custom BaseLLM:
from langchain_core.language_models.llms import BaseLLM
from langchain_core.outputs import LLMResult, Generation
from typing import Any, Optional
import modal
class ModalLlamaLLM(BaseLLM):
"""
LangChain-compatible wrapper around a Modal-deployed Llama 3 endpoint.
"""
model_name: str = "llama-3-8b"
max_new_tokens: int = 512
@property
def _llm_type(self) -> str:
return "modal_llama"
def _generate(
self,
prompts: list[str],
stop: Optional[list[str]] = None,
**kwargs: Any
) -> LLMResult:
# Get a reference to the deployed Modal class
LlamaLangChain = modal.Cls.from_name("langchain-llama3", "LlamaLangChain")
llama = LlamaLangChain()
generations = []
for prompt in prompts:
response = llama.generate.remote(
prompt,
max_new_tokens=self.max_new_tokens
)
generations.append([Generation(text=response)])
return LLMResult(generations=generations)
async def _agenerate(
self,
prompts: list[str],
stop: Optional[list[str]] = None,
**kwargs: Any
) -> LLMResult:
LlamaLangChain = modal.Cls.from_name("langchain-llama3", "LlamaLangChain")
llama = LlamaLangChain()
import asyncio
tasks = [
asyncio.to_thread(
llama.generate.remote,
prompt,
self.max_new_tokens
)
for prompt in prompts
]
responses = await asyncio.gather(*tasks)
generations = [[Generation(text=r)] for r in responses]
return LLMResult(generations=generations)
# Use in a standard LangChain pipeline
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
llm = ModalLlamaLLM(max_new_tokens=256)
prompt = ChatPromptTemplate.from_template("Answer this question: {question}")
chain = prompt | llm | StrOutputParser()
answer = chain.invoke({"question": "What is serverless computing?"})
print(answer)
Building a Serverless RAG Endpoint on Modal
A complete RAG API that runs entirely on Modal:
import modal
from fastapi import FastAPI
from pydantic import BaseModel
app = modal.App("langchain-rag-api")
rag_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"langchain",
"langchain-openai",
"langchain-chroma",
"chromadb",
"fastapi",
"uvicorn"
)
)
chroma_volume = modal.Volume.from_name("chroma-db", create_if_missing=True)
CHROMA_DIR = "/chroma"
web_app = FastAPI()
class QueryRequest(BaseModel):
question: str
k: int = 4
@app.function(
image=rag_image,
volumes={CHROMA_DIR: chroma_volume},
secrets=[modal.Secret.from_name("openai-secret")],
timeout=60,
container_idle_timeout=300
)
@modal.asgi_app()
def rag_api():
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = Chroma(
collection_name="knowledge_base",
embedding_function=embeddings,
persist_directory=CHROMA_DIR
)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template("""
Answer based on context only.
Context: {context}
Question: {question}
""")
rag_chain = (
{"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
"question": RunnablePassthrough()}
| prompt | llm | StrOutputParser()
)
@web_app.post("/query")
async def query(request: QueryRequest) -> dict:
answer = await rag_chain.ainvoke(request.question)
return {"answer": answer, "question": request.question}
@web_app.get("/health")
async def health():
return {"status": "ok"}
return web_app
Deploy and call it:
modal deploy app.py
# Deployed: https://your-workspace--langchain-rag-api-rag-api.modal.run
import requests
response = requests.post(
"https://your-workspace--langchain-rag-api-rag-api.modal.run/query",
json={"question": "How does the indexing API work?"}
)
print(response.json()["answer"])
GPU Selection Guide
| GPU | VRAM | Best for | Modal cost (approx) |
|---|---|---|---|
| T4 | 16 GB | 7B models, embeddings | $0.59/hr |
| A10G | 24 GB | 13B models, production inference | $1.10/hr |
| A100 (40 GB) | 40 GB | 34B models, fine-tuning | $3.04/hr |
| A100 (80 GB) | 80 GB | 70B models, large batches | $3.72/hr |
| H100 | 80 GB | Maximum throughput | $3.96/hr |
For most LangChain + 7B model deployments, the A10G is the sweet spot. It handles concurrent requests without GPU memory pressure and the 24GB VRAM comfortably fits Llama 3 8B in float16.
Serverless AI Platform Comparison
| Platform | GPU support | Scale to zero | Cold start | Container image | Best for |
|---|---|---|---|---|---|
| Modal | Yes (T4–H100) | Yes | 5–30s | Full control | ML inference, research |
| AWS Lambda | No | Yes | ~100ms | Limited (10 GB) | CPU-only pipelines |
| Google Cloud Run | No | Yes | ~500ms | Full control | CPU-only APIs |
| Replicate | Yes | Yes | 10–60s | Pre-built models | Hosted model APIs |
| RunPod | Yes | Partial | Manual | Full control | Long-running training |
| Banana (defunct) | Yes | Yes | 2–10s | Limited | Was: fast inference |
Modal's combination of GPU support, true scale-to-zero, and container flexibility makes it uniquely suited for LangChain workloads that use local models.
Optimizing Cold Start Performance
Cold starts are the main operational challenge with serverless GPU deployments:
@app.cls(
image=gpu_image,
gpu="A10G",
volumes={MODEL_DIR: model_volume},
# Keep containers warm between requests
container_idle_timeout=300, # 5 minutes
# Pre-scale to N containers to eliminate cold starts under load
allow_concurrent_inputs=10,
# Increase memory to avoid OOM on model loading
memory=32768 # 32 GB
)
class OptimizedLlamaEndpoint:
@modal.enter()
def load(self):
# Load model in float16 for faster load time
# Use bitsandbytes for 4-bit quantization to halve VRAM usage
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4"
)
model_path = f"{MODEL_DIR}/{MODEL_ID}"
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=quantization_config,
device_map="auto"
)
print("Model loaded with 4-bit quantization")
With container_idle_timeout=300, Modal keeps containers warm for 5 minutes after the last request. If you have consistent traffic above one request per 5 minutes, cold starts become rare. For the Build AI chatbot Python use case, this is usually sufficient.
Cost Estimation Script
def estimate_modal_cost(
daily_requests: int,
avg_generation_time_seconds: float,
gpu_type: str = "A10G"
) -> dict:
GPU_HOURLY_RATES = {
"T4": 0.59,
"A10G": 1.10,
"A100_40": 3.04,
"A100_80": 3.72,
"H100": 3.96,
}
rate = GPU_HOURLY_RATES.get(gpu_type, 1.10)
daily_gpu_seconds = daily_requests * avg_generation_time_seconds
daily_cost = (daily_gpu_seconds / 3600) * rate
monthly_cost = daily_cost * 30
return {
"gpu_type": gpu_type,
"daily_requests": daily_requests,
"avg_generation_seconds": avg_generation_time_seconds,
"daily_gpu_hours": round(daily_gpu_seconds / 3600, 2),
"daily_cost_usd": round(daily_cost, 2),
"monthly_cost_usd": round(monthly_cost, 2),
}
# 10,000 requests/day, 3 seconds each on A10G
estimate = estimate_modal_cost(10000, 3.0, "A10G")
print(f"Monthly cost estimate: ${estimate['monthly_cost_usd']}")
# Monthly cost estimate: ~$275
Compare this to the equivalent OpenAI cost: 10,000 requests × 500 tokens each × $0.00015/token ≈ $750/month. The self-hosted Modal deployment saves approximately 63% at this volume.
For more on structuring LangChain for production, see LangChain tutorial 2025 and AI agents explained.
Frequently Asked Questions
What makes Modal different from AWS Lambda or Google Cloud Functions for AI workloads? Standard serverless platforms (Lambda, Cloud Functions) do not support GPUs and have strict memory limits. Modal is purpose-built for ML workloads: it provides GPU access (A10G, A100, H100), supports container images with large model weights, has no cold-start timeout issues for GPU containers, and bills by the millisecond of actual GPU compute used rather than charging for idle time.
How does Modal handle model weight loading on cold starts? Modal uses volume mounts and container image layers to cache model weights. You can pre-download model weights into the container image at build time, or use Modal volumes to persist weights across container restarts. A well-configured setup brings a 7B parameter model from cold start to first token in under 30 seconds.
Can I run LangChain with open-source models on Modal instead of OpenAI? Yes — that is one of Modal's primary use cases. You can run Llama 3, Mistral, Gemma 2, Qwen, or any Hugging Face model on Modal GPUs and expose it as a LangChain-compatible endpoint. This gives you full control over the model, avoids per-token costs for high-volume applications, and keeps your data on infrastructure you control.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
How to Run AutoGPT on a VPS for 24/7 Autonomous Operation
Deploy AutoGPT on a VPS for round-the-clock operation. Covers VPS selection, systemd setup, tmux persistence, monitoring, and cost comparison across providers.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.