AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

serverless GPU deployment scaling for AI — LangChain Modal serverless cloud

How to Use LangChain with Modal (Serverless GPU 2026)

⚡ Quick Answer

Deploy LangChain pipelines on Modal's serverless GPU infrastructure — run local LLMs, scale to zero, and cut inference costs with cold-start optimization.

AiTechWorlds Team May 31, 2026 10 min read

#LangChain #Modal #serverless #GPU #deployment

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

OpenAI's API is convenient until it is not — rate limits, per-token costs that scale linearly with traffic, and data residency requirements all push teams toward self-hosted models. But running your own GPU infrastructure is expensive and operationally complex. Modal threads the needle: it provides on-demand GPU compute that scales to zero when idle and bills by the millisecond of actual usage.

This guide shows how to deploy LangChain pipelines on Modal, including running a local LLM (Llama 3), building a serverless RAG endpoint, and optimizing cold-start performance. You will also get a comparison table of serverless AI platforms.

Before working through this, review Deploy AI model to production for deployment fundamentals and RAG system tutorial for the RAG pipeline we will be deploying.

Traditional deployment options all have trade-offs:

OpenAI API: Easiest, but expensive at scale and subject to rate limits
AWS EC2 with GPU: Full control, but you pay for idle time and manage infrastructure
Kubernetes with GPU nodes: Powerful, but high operational overhead
Modal: Serverless GPUs, scale to zero, millisecond billing, minimal ops

A team running a research assistant that handles 10,000 queries per day might pay $800/month on OpenAI for the equivalent compute. On Modal with a self-hosted 7B model, the same workload might cost $120–$200/month. The break-even depends heavily on query volume and average tokens per query.

Installation and Setup

pip install modal langchain langchain-openai
modal token new  # authenticates with Modal's cloud

Your Modal token is stored in ~/.modal.toml after running modal token new.

Start with a simple CPU function to understand the Modal programming model:

# app.py
import modal

app = modal.App("langchain-demo")

@app.function()
def greet(name: str) -> str:
    return f"Hello from Modal, {name}!"

@app.local_entrypoint()
def main():
    result = greet.remote("LangChain")
    print(result)

Run it:

modal run app.py
# Hello from Modal, LangChain!

The function runs in Modal's cloud, not locally. The @app.function() decorator is what makes this happen.

The simplest LangChain + Modal setup uses the OpenAI API from Modal's cloud:

import modal
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

app = modal.App("langchain-openai")

# Define the container image with required packages
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "langchain",
    "langchain-openai",
    "openai"
)

@app.function(
    image=image,
    secrets=[modal.Secret.from_name("openai-secret")],
    timeout=300
)
def run_langchain_chain(question: str) -> str:
    """
    Runs a LangChain chain on Modal's cloud infrastructure.
    """
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    prompt = ChatPromptTemplate.from_template(
        "Answer concisely: {question}"
    )
    chain = prompt | llm | StrOutputParser()
    return chain.invoke({"question": question})


@app.local_entrypoint()
def main():
    # Call the remote function
    answer = run_langchain_chain.remote("What is LangChain?")
    print(answer)

The modal.Secret.from_name("openai-secret") references a secret you created in Modal's dashboard with your OpenAI API key.

This is where Modal becomes genuinely powerful. Deploy Llama 3 8B on a GPU and use it as a LangChain-compatible LLM:

import modal
from typing import Iterator

app = modal.App("langchain-llama3")

# Container image with GPU ML stack
gpu_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "torch",
        "transformers",
        "accelerate",
        "bitsandbytes",
        "langchain",
        "langchain-community",
        "huggingface_hub"
    )
)

# Volume to cache model weights across container starts
model_volume = modal.Volume.from_name("llama3-weights", create_if_missing=True)
MODEL_DIR = "/models"
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"


@app.function(
    image=gpu_image,
    gpu="A10G",
    volumes={MODEL_DIR: model_volume},
    timeout=600,
    secrets=[modal.Secret.from_name("hf-secret")]  # Hugging Face token
)
def download_model():
    """
    Pre-download model weights to the volume.
    Run this once before deploying.
    """
    from huggingface_hub import snapshot_download
    import os

    snapshot_download(
        repo_id=MODEL_ID,
        local_dir=f"{MODEL_DIR}/{MODEL_ID}",
        token=os.environ["HF_TOKEN"]
    )
    print(f"Model downloaded to {MODEL_DIR}/{MODEL_ID}")


@app.cls(
    image=gpu_image,
    gpu="A10G",
    volumes={MODEL_DIR: model_volume},
    timeout=300,
    container_idle_timeout=120  # Keep container warm for 2 minutes
)
class LlamaLangChain:
    """
    A LangChain-compatible Llama 3 endpoint on Modal GPU.
    """

    @modal.build()
    def build(self):
        """Called at container build time."""
        pass

    @modal.enter()
    def load_model(self):
        """Called when container starts — loads model into GPU memory."""
        from transformers import AutoTokenizer, AutoModelForCausalLM
        import torch

        model_path = f"{MODEL_DIR}/{MODEL_ID}"
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        print("Llama 3 loaded into GPU memory")

    @modal.method()
    def generate(self, prompt: str, max_new_tokens: int = 512) -> str:
        """Generate text from a prompt."""
        import torch

        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        inputs = self.tokenizer(text, return_tensors="pt").to("cuda")

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id
            )

        generated = outputs[0][inputs["input_ids"].shape[1]:]
        return self.tokenizer.decode(generated, skip_special_tokens=True)


@app.local_entrypoint()
def main():
    llama = LlamaLangChain()
    response = llama.generate.remote(
        "Explain the difference between RAG and fine-tuning in one paragraph."
    )
    print(response)

To use the Modal-deployed LLM within a standard LangChain chain, wrap it in a custom BaseLLM:

from langchain_core.language_models.llms import BaseLLM
from langchain_core.outputs import LLMResult, Generation
from typing import Any, Optional
import modal


class ModalLlamaLLM(BaseLLM):
    """
    LangChain-compatible wrapper around a Modal-deployed Llama 3 endpoint.
    """
    model_name: str = "llama-3-8b"
    max_new_tokens: int = 512

    @property
    def _llm_type(self) -> str:
        return "modal_llama"

    def _generate(
        self,
        prompts: list[str],
        stop: Optional[list[str]] = None,
        **kwargs: Any
    ) -> LLMResult:
        # Get a reference to the deployed Modal class
        LlamaLangChain = modal.Cls.from_name("langchain-llama3", "LlamaLangChain")
        llama = LlamaLangChain()

        generations = []
        for prompt in prompts:
            response = llama.generate.remote(
                prompt,
                max_new_tokens=self.max_new_tokens
            )
            generations.append([Generation(text=response)])

        return LLMResult(generations=generations)

    async def _agenerate(
        self,
        prompts: list[str],
        stop: Optional[list[str]] = None,
        **kwargs: Any
    ) -> LLMResult:
        LlamaLangChain = modal.Cls.from_name("langchain-llama3", "LlamaLangChain")
        llama = LlamaLangChain()

        import asyncio
        tasks = [
            asyncio.to_thread(
                llama.generate.remote,
                prompt,
                self.max_new_tokens
            )
            for prompt in prompts
        ]
        responses = await asyncio.gather(*tasks)
        generations = [[Generation(text=r)] for r in responses]
        return LLMResult(generations=generations)


# Use in a standard LangChain pipeline
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ModalLlamaLLM(max_new_tokens=256)
prompt = ChatPromptTemplate.from_template("Answer this question: {question}")
chain = prompt | llm | StrOutputParser()

answer = chain.invoke({"question": "What is serverless computing?"})
print(answer)

A complete RAG API that runs entirely on Modal:

import modal
from fastapi import FastAPI
from pydantic import BaseModel

app = modal.App("langchain-rag-api")

rag_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "langchain",
        "langchain-openai",
        "langchain-chroma",
        "chromadb",
        "fastapi",
        "uvicorn"
    )
)

chroma_volume = modal.Volume.from_name("chroma-db", create_if_missing=True)
CHROMA_DIR = "/chroma"

web_app = FastAPI()


class QueryRequest(BaseModel):
    question: str
    k: int = 4


@app.function(
    image=rag_image,
    volumes={CHROMA_DIR: chroma_volume},
    secrets=[modal.Secret.from_name("openai-secret")],
    timeout=60,
    container_idle_timeout=300
)
@modal.asgi_app()
def rag_api():
    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
    from langchain_chroma import Chroma
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.runnables import RunnablePassthrough
    from langchain_core.output_parsers import StrOutputParser

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vector_store = Chroma(
        collection_name="knowledge_base",
        embedding_function=embeddings,
        persist_directory=CHROMA_DIR
    )
    retriever = vector_store.as_retriever(search_kwargs={"k": 4})
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    prompt = ChatPromptTemplate.from_template("""
Answer based on context only.
Context: {context}
Question: {question}
    """)

    rag_chain = (
        {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
         "question": RunnablePassthrough()}
        | prompt | llm | StrOutputParser()
    )

    @web_app.post("/query")
    async def query(request: QueryRequest) -> dict:
        answer = await rag_chain.ainvoke(request.question)
        return {"answer": answer, "question": request.question}

    @web_app.get("/health")
    async def health():
        return {"status": "ok"}

    return web_app

Deploy and call it:

modal deploy app.py
# Deployed: https://your-workspace--langchain-rag-api-rag-api.modal.run

import requests

response = requests.post(
    "https://your-workspace--langchain-rag-api-rag-api.modal.run/query",
    json={"question": "How does the indexing API work?"}
)
print(response.json()["answer"])

GPU Selection Guide

GPU	VRAM	Best for	Modal cost (approx)
T4	16 GB	7B models, embeddings	$0.59/hr
A10G	24 GB	13B models, production inference	$1.10/hr
A100 (40 GB)	40 GB	34B models, fine-tuning	$3.04/hr
A100 (80 GB)	80 GB	70B models, large batches	$3.72/hr
H100	80 GB	Maximum throughput	$3.96/hr

For most LangChain + 7B model deployments, the A10G is the sweet spot. It handles concurrent requests without GPU memory pressure and the 24GB VRAM comfortably fits Llama 3 8B in float16.

Serverless AI Platform Comparison

Platform	GPU support	Scale to zero	Cold start	Container image	Best for
Modal	Yes (T4–H100)	Yes	5–30s	Full control	ML inference, research
AWS Lambda	No	Yes	~100ms	Limited (10 GB)	CPU-only pipelines
Google Cloud Run	No	Yes	~500ms	Full control	CPU-only APIs
Replicate	Yes	Yes	10–60s	Pre-built models	Hosted model APIs
RunPod	Yes	Partial	Manual	Full control	Long-running training
Banana (defunct)	Yes	Yes	2–10s	Limited	Was: fast inference

Modal's combination of GPU support, true scale-to-zero, and container flexibility makes it uniquely suited for LangChain workloads that use local models.

Optimizing Cold Start Performance

Cold starts are the main operational challenge with serverless GPU deployments:

@app.cls(
    image=gpu_image,
    gpu="A10G",
    volumes={MODEL_DIR: model_volume},
    # Keep containers warm between requests
    container_idle_timeout=300,  # 5 minutes
    # Pre-scale to N containers to eliminate cold starts under load
    allow_concurrent_inputs=10,
    # Increase memory to avoid OOM on model loading
    memory=32768  # 32 GB
)
class OptimizedLlamaEndpoint:
    @modal.enter()
    def load(self):
        # Load model in float16 for faster load time
        # Use bitsandbytes for 4-bit quantization to halve VRAM usage
        from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
        import torch

        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_quant_type="nf4"
        )

        model_path = f"{MODEL_DIR}/{MODEL_ID}"
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            quantization_config=quantization_config,
            device_map="auto"
        )
        print("Model loaded with 4-bit quantization")

With container_idle_timeout=300, Modal keeps containers warm for 5 minutes after the last request. If you have consistent traffic above one request per 5 minutes, cold starts become rare. For the Build AI chatbot Python use case, this is usually sufficient.

Cost Estimation Script

def estimate_modal_cost(
    daily_requests: int,
    avg_generation_time_seconds: float,
    gpu_type: str = "A10G"
) -> dict:
    GPU_HOURLY_RATES = {
        "T4": 0.59,
        "A10G": 1.10,
        "A100_40": 3.04,
        "A100_80": 3.72,
        "H100": 3.96,
    }

    rate = GPU_HOURLY_RATES.get(gpu_type, 1.10)
    daily_gpu_seconds = daily_requests * avg_generation_time_seconds
    daily_cost = (daily_gpu_seconds / 3600) * rate
    monthly_cost = daily_cost * 30

    return {
        "gpu_type": gpu_type,
        "daily_requests": daily_requests,
        "avg_generation_seconds": avg_generation_time_seconds,
        "daily_gpu_hours": round(daily_gpu_seconds / 3600, 2),
        "daily_cost_usd": round(daily_cost, 2),
        "monthly_cost_usd": round(monthly_cost, 2),
    }

# 10,000 requests/day, 3 seconds each on A10G
estimate = estimate_modal_cost(10000, 3.0, "A10G")
print(f"Monthly cost estimate: ${estimate['monthly_cost_usd']}")
# Monthly cost estimate: ~$275

Compare this to the equivalent OpenAI cost: 10,000 requests × 500 tokens each × $0.00015/token ≈ $750/month. The self-hosted Modal deployment saves approximately 63% at this volume.

For more on structuring LangChain for production, see LangChain tutorial 2025 and AI agents explained.

Frequently Asked Questions

What makes Modal different from AWS Lambda or Google Cloud Functions for AI workloads? Standard serverless platforms (Lambda, Cloud Functions) do not support GPUs and have strict memory limits. Modal is purpose-built for ML workloads: it provides GPU access (A10G, A100, H100), supports container images with large model weights, has no cold-start timeout issues for GPU containers, and bills by the millisecond of actual GPU compute used rather than charging for idle time.

How does Modal handle model weight loading on cold starts? Modal uses volume mounts and container image layers to cache model weights. You can pre-download model weights into the container image at build time, or use Modal volumes to persist weights across container restarts. A well-configured setup brings a 7B parameter model from cold start to first token in under 30 seconds.

Can I run LangChain with open-source models on Modal instead of OpenAI? Yes — that is one of Modal's primary use cases. You can run Llama 3, Mistral, Gemma 2, Qwen, or any Hugging Face model on Modal GPUs and expose it as a LangChain-compatible endpoint. This gives you full control over the model, avoids per-token costs for high-volume applications, and keeps your data on infrastructure you control.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Standard serverless platforms (Lambda, Cloud Functions) do not support GPUs and have strict memory limits. Modal is purpose-built for ML workloads: it provides GPU access (A10G, A100, H100), supports container images with large model weights, has no cold-start timeout issues for GPU containers, and bills by the millisecond of actual GPU compute used rather than charging for idle time.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesAI Agent Development Notes NotesRAG: Retrieval-Augmented Generation Guide BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide BookAWS for Developers CourseAI Agent Development Course

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

How to Use LangChain with Modal (Serverless GPU 2026)

⚡ Quick Answer

Deploy LangChain pipelines on Modal's serverless GPU infrastructure — run local LLMs, scale to zero, and cut inference costs with cold-start optimization.

AiTechWorlds Team May 31, 2026 10 min read

#LangChain #Modal #serverless #GPU #deployment

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Before working through this, review Deploy AI model to production for deployment fundamentals and RAG system tutorial for the RAG pipeline we will be deploying.

Traditional deployment options all have trade-offs:

OpenAI API: Easiest, but expensive at scale and subject to rate limits
AWS EC2 with GPU: Full control, but you pay for idle time and manage infrastructure
Kubernetes with GPU nodes: Powerful, but high operational overhead
Modal: Serverless GPUs, scale to zero, millisecond billing, minimal ops

Installation and Setup

pip install modal langchain langchain-openai
modal token new  # authenticates with Modal's cloud

Your Modal token is stored in ~/.modal.toml after running modal token new.

Start with a simple CPU function to understand the Modal programming model:

# app.py
import modal

app = modal.App("langchain-demo")

@app.function()
def greet(name: str) -> str:
    return f"Hello from Modal, {name}!"

@app.local_entrypoint()
def main():
    result = greet.remote("LangChain")
    print(result)

Run it:

modal run app.py
# Hello from Modal, LangChain!

The function runs in Modal's cloud, not locally. The @app.function() decorator is what makes this happen.

The simplest LangChain + Modal setup uses the OpenAI API from Modal's cloud:

import modal
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

app = modal.App("langchain-openai")

# Define the container image with required packages
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "langchain",
    "langchain-openai",
    "openai"
)

@app.function(
    image=image,
    secrets=[modal.Secret.from_name("openai-secret")],
    timeout=300
)
def run_langchain_chain(question: str) -> str:
    """
    Runs a LangChain chain on Modal's cloud infrastructure.
    """
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    prompt = ChatPromptTemplate.from_template(
        "Answer concisely: {question}"
    )
    chain = prompt | llm | StrOutputParser()
    return chain.invoke({"question": question})


@app.local_entrypoint()
def main():
    # Call the remote function
    answer = run_langchain_chain.remote("What is LangChain?")
    print(answer)

The modal.Secret.from_name("openai-secret") references a secret you created in Modal's dashboard with your OpenAI API key.

This is where Modal becomes genuinely powerful. Deploy Llama 3 8B on a GPU and use it as a LangChain-compatible LLM:

import modal
from typing import Iterator

app = modal.App("langchain-llama3")

# Container image with GPU ML stack
gpu_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "torch",
        "transformers",
        "accelerate",
        "bitsandbytes",
        "langchain",
        "langchain-community",
        "huggingface_hub"
    )
)

# Volume to cache model weights across container starts
model_volume = modal.Volume.from_name("llama3-weights", create_if_missing=True)
MODEL_DIR = "/models"
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"


@app.function(
    image=gpu_image,
    gpu="A10G",
    volumes={MODEL_DIR: model_volume},
    timeout=600,
    secrets=[modal.Secret.from_name("hf-secret")]  # Hugging Face token
)
def download_model():
    """
    Pre-download model weights to the volume.
    Run this once before deploying.
    """
    from huggingface_hub import snapshot_download
    import os

    snapshot_download(
        repo_id=MODEL_ID,
        local_dir=f"{MODEL_DIR}/{MODEL_ID}",
        token=os.environ["HF_TOKEN"]
    )
    print(f"Model downloaded to {MODEL_DIR}/{MODEL_ID}")


@app.cls(
    image=gpu_image,
    gpu="A10G",
    volumes={MODEL_DIR: model_volume},
    timeout=300,
    container_idle_timeout=120  # Keep container warm for 2 minutes
)
class LlamaLangChain:
    """
    A LangChain-compatible Llama 3 endpoint on Modal GPU.
    """

    @modal.build()
    def build(self):
        """Called at container build time."""
        pass

    @modal.enter()
    def load_model(self):
        """Called when container starts — loads model into GPU memory."""
        from transformers import AutoTokenizer, AutoModelForCausalLM
        import torch

        model_path = f"{MODEL_DIR}/{MODEL_ID}"
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        print("Llama 3 loaded into GPU memory")

    @modal.method()
    def generate(self, prompt: str, max_new_tokens: int = 512) -> str:
        """Generate text from a prompt."""
        import torch

        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        inputs = self.tokenizer(text, return_tensors="pt").to("cuda")

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id
            )

        generated = outputs[0][inputs["input_ids"].shape[1]:]
        return self.tokenizer.decode(generated, skip_special_tokens=True)


@app.local_entrypoint()
def main():
    llama = LlamaLangChain()
    response = llama.generate.remote(
        "Explain the difference between RAG and fine-tuning in one paragraph."
    )
    print(response)

To use the Modal-deployed LLM within a standard LangChain chain, wrap it in a custom BaseLLM:

from langchain_core.language_models.llms import BaseLLM
from langchain_core.outputs import LLMResult, Generation
from typing import Any, Optional
import modal


class ModalLlamaLLM(BaseLLM):
    """
    LangChain-compatible wrapper around a Modal-deployed Llama 3 endpoint.
    """
    model_name: str = "llama-3-8b"
    max_new_tokens: int = 512

    @property
    def _llm_type(self) -> str:
        return "modal_llama"

    def _generate(
        self,
        prompts: list[str],
        stop: Optional[list[str]] = None,
        **kwargs: Any
    ) -> LLMResult:
        # Get a reference to the deployed Modal class
        LlamaLangChain = modal.Cls.from_name("langchain-llama3", "LlamaLangChain")
        llama = LlamaLangChain()

        generations = []
        for prompt in prompts:
            response = llama.generate.remote(
                prompt,
                max_new_tokens=self.max_new_tokens
            )
            generations.append([Generation(text=response)])

        return LLMResult(generations=generations)

    async def _agenerate(
        self,
        prompts: list[str],
        stop: Optional[list[str]] = None,
        **kwargs: Any
    ) -> LLMResult:
        LlamaLangChain = modal.Cls.from_name("langchain-llama3", "LlamaLangChain")
        llama = LlamaLangChain()

        import asyncio
        tasks = [
            asyncio.to_thread(
                llama.generate.remote,
                prompt,
                self.max_new_tokens
            )
            for prompt in prompts
        ]
        responses = await asyncio.gather(*tasks)
        generations = [[Generation(text=r)] for r in responses]
        return LLMResult(generations=generations)


# Use in a standard LangChain pipeline
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ModalLlamaLLM(max_new_tokens=256)
prompt = ChatPromptTemplate.from_template("Answer this question: {question}")
chain = prompt | llm | StrOutputParser()

answer = chain.invoke({"question": "What is serverless computing?"})
print(answer)

A complete RAG API that runs entirely on Modal:

import modal
from fastapi import FastAPI
from pydantic import BaseModel

app = modal.App("langchain-rag-api")

rag_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "langchain",
        "langchain-openai",
        "langchain-chroma",
        "chromadb",
        "fastapi",
        "uvicorn"
    )
)

chroma_volume = modal.Volume.from_name("chroma-db", create_if_missing=True)
CHROMA_DIR = "/chroma"

web_app = FastAPI()


class QueryRequest(BaseModel):
    question: str
    k: int = 4


@app.function(
    image=rag_image,
    volumes={CHROMA_DIR: chroma_volume},
    secrets=[modal.Secret.from_name("openai-secret")],
    timeout=60,
    container_idle_timeout=300
)
@modal.asgi_app()
def rag_api():
    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
    from langchain_chroma import Chroma
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.runnables import RunnablePassthrough
    from langchain_core.output_parsers import StrOutputParser

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vector_store = Chroma(
        collection_name="knowledge_base",
        embedding_function=embeddings,
        persist_directory=CHROMA_DIR
    )
    retriever = vector_store.as_retriever(search_kwargs={"k": 4})
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    prompt = ChatPromptTemplate.from_template("""
Answer based on context only.
Context: {context}
Question: {question}
    """)

    rag_chain = (
        {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
         "question": RunnablePassthrough()}
        | prompt | llm | StrOutputParser()
    )

    @web_app.post("/query")
    async def query(request: QueryRequest) -> dict:
        answer = await rag_chain.ainvoke(request.question)
        return {"answer": answer, "question": request.question}

    @web_app.get("/health")
    async def health():
        return {"status": "ok"}

    return web_app

Deploy and call it:

modal deploy app.py
# Deployed: https://your-workspace--langchain-rag-api-rag-api.modal.run

import requests

response = requests.post(
    "https://your-workspace--langchain-rag-api-rag-api.modal.run/query",
    json={"question": "How does the indexing API work?"}
)
print(response.json()["answer"])

GPU Selection Guide

GPU	VRAM	Best for	Modal cost (approx)
T4	16 GB	7B models, embeddings	$0.59/hr
A10G	24 GB	13B models, production inference	$1.10/hr
A100 (40 GB)	40 GB	34B models, fine-tuning	$3.04/hr
A100 (80 GB)	80 GB	70B models, large batches	$3.72/hr
H100	80 GB	Maximum throughput	$3.96/hr

For most LangChain + 7B model deployments, the A10G is the sweet spot. It handles concurrent requests without GPU memory pressure and the 24GB VRAM comfortably fits Llama 3 8B in float16.

Serverless AI Platform Comparison

Platform	GPU support	Scale to zero	Cold start	Container image	Best for
Modal	Yes (T4–H100)	Yes	5–30s	Full control	ML inference, research
AWS Lambda	No	Yes	~100ms	Limited (10 GB)	CPU-only pipelines
Google Cloud Run	No	Yes	~500ms	Full control	CPU-only APIs
Replicate	Yes	Yes	10–60s	Pre-built models	Hosted model APIs
RunPod	Yes	Partial	Manual	Full control	Long-running training
Banana (defunct)	Yes	Yes	2–10s	Limited	Was: fast inference

Modal's combination of GPU support, true scale-to-zero, and container flexibility makes it uniquely suited for LangChain workloads that use local models.

Optimizing Cold Start Performance

Cold starts are the main operational challenge with serverless GPU deployments:

@app.cls(
    image=gpu_image,
    gpu="A10G",
    volumes={MODEL_DIR: model_volume},
    # Keep containers warm between requests
    container_idle_timeout=300,  # 5 minutes
    # Pre-scale to N containers to eliminate cold starts under load
    allow_concurrent_inputs=10,
    # Increase memory to avoid OOM on model loading
    memory=32768  # 32 GB
)
class OptimizedLlamaEndpoint:
    @modal.enter()
    def load(self):
        # Load model in float16 for faster load time
        # Use bitsandbytes for 4-bit quantization to halve VRAM usage
        from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
        import torch

        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_quant_type="nf4"
        )

        model_path = f"{MODEL_DIR}/{MODEL_ID}"
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            quantization_config=quantization_config,
            device_map="auto"
        )
        print("Model loaded with 4-bit quantization")

Cost Estimation Script

def estimate_modal_cost(
    daily_requests: int,
    avg_generation_time_seconds: float,
    gpu_type: str = "A10G"
) -> dict:
    GPU_HOURLY_RATES = {
        "T4": 0.59,
        "A10G": 1.10,
        "A100_40": 3.04,
        "A100_80": 3.72,
        "H100": 3.96,
    }

    rate = GPU_HOURLY_RATES.get(gpu_type, 1.10)
    daily_gpu_seconds = daily_requests * avg_generation_time_seconds
    daily_cost = (daily_gpu_seconds / 3600) * rate
    monthly_cost = daily_cost * 30

    return {
        "gpu_type": gpu_type,
        "daily_requests": daily_requests,
        "avg_generation_seconds": avg_generation_time_seconds,
        "daily_gpu_hours": round(daily_gpu_seconds / 3600, 2),
        "daily_cost_usd": round(daily_cost, 2),
        "monthly_cost_usd": round(monthly_cost, 2),
    }

# 10,000 requests/day, 3 seconds each on A10G
estimate = estimate_modal_cost(10000, 3.0, "A10G")
print(f"Monthly cost estimate: ${estimate['monthly_cost_usd']}")
# Monthly cost estimate: ~$275

Compare this to the equivalent OpenAI cost: 10,000 requests × 500 tokens each × $0.00015/token ≈ $750/month. The self-hosted Modal deployment saves approximately 63% at this volume.

For more on structuring LangChain for production, see LangChain tutorial 2025 and AI agents explained.

Frequently Asked Questions

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

How to Use LangChain with Modal (Serverless GPU 2026)

Installation and Setup

GPU Selection Guide

Serverless AI Platform Comparison

Optimizing Cold Start Performance

Cost Estimation Script

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

How to Use LangChain with Modal (Serverless GPU 2026)

Installation and Setup

GPU Selection Guide

Serverless AI Platform Comparison

Optimizing Cold Start Performance

Cost Estimation Script

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

How to Use LangChain with Modal (Serverless GPU 2026)

Why Modal for LangChain Deployments

Installation and Setup

Your First Modal Function

Running LangChain with OpenAI on Modal

Running a Local LLM on Modal GPU

Using the Modal LLM in a LangChain Pipeline

Building a Serverless RAG Endpoint on Modal

GPU Selection Guide

Serverless AI Platform Comparison

Optimizing Cold Start Performance

Cost Estimation Script

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

How to Use LangChain with Modal (Serverless GPU 2026)

Why Modal for LangChain Deployments

Installation and Setup

Your First Modal Function

Running LangChain with OpenAI on Modal

Running a Local LLM on Modal GPU

Using the Modal LLM in a LangChain Pipeline

Building a Serverless RAG Endpoint on Modal

GPU Selection Guide

Serverless AI Platform Comparison

Optimizing Cold Start Performance

Cost Estimation Script

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily