How to Use LangChain with Hugging Face (Local LLMs 2026)
Run open source LLMs locally with LangChain and Hugging Face. Complete guide covering HuggingFacePipeline, Llama, Mistral, and sentence-transformers embeddings.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
I spent a week running every major open source model through LangChain before fully appreciating why local LLMs matter. No API costs, no rate limits, no data leaving your machine. If you're building something with sensitive data or need to run thousands of inference calls without a bill that scales with usage, local models are worth the setup time.
This guide shows you exactly how to connect Hugging Face models to LangChain — from loading your first model to building a full RAG pipeline that runs entirely on your own hardware.
If you're newer to LangChain and want the hosted model path first, LangChain tutorial 2025 covers OpenAI integration before local models. Come back here once you're ready to go self-hosted.
Why Run LLMs Locally Through LangChain
The case for local LLMs has gotten much stronger over the past two years. Models like Llama 3, Mistral 7B, and Phi-3 have closed the quality gap with GPT-3.5 significantly. Running them locally through LangChain gives you:
- Zero marginal cost after hardware — no per-token charges
- Complete data privacy — nothing leaves your machine or network
- No rate limits — run as many requests as your hardware allows
- Offline capability — works without internet after initial model download
- Full control — fine-tune, quantize, or modify models as needed
According to the 2025 State of AI report from a16z, enterprises with strict data governance requirements are driving 60%+ of the demand for on-premise LLM deployments.
Setting Up Your Environment
You'll need either a Python virtual environment or conda. The dependencies are heavier than cloud-based setups:
pip install langchain langchain-community langchain-huggingface
pip install transformers accelerate bitsandbytes
pip install sentence-transformers
pip install torch --index-url https://download.pytorch.org/whl/cu121 # CUDA 12.1
# For CPU-only:
pip install torch
Check your hardware setup before loading any model:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
print("Running on CPU")
# MPS (Apple Silicon)
print(f"MPS available: {torch.backends.mps.is_available()}")
Loading Your First Model: HuggingFacePipeline
HuggingFacePipeline is the primary LangChain wrapper for local Hugging Face models. It loads a model with the transformers pipeline API and exposes it as a standard LangChain LLM.
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
model_id = "microsoft/Phi-3-mini-4k-instruct"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # float16 halves memory usage
device_map="auto", # automatically use GPU if available
trust_remote_code=True, # required for some models
)
# Create pipeline
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.1,
do_sample=True,
repetition_penalty=1.1,
)
# Wrap in LangChain
llm = HuggingFacePipeline(pipeline=pipe)
# Test it
response = llm.invoke("Explain what a transformer neural network is in 3 sentences.")
print(response)
The device_map="auto" setting is important — it automatically distributes model layers across available GPUs, or falls back to CPU if no GPU is available. For a model like Phi-3-mini on a machine with 8GB VRAM, this typically fits the entire model in GPU memory.
Loading Mistral 7B and Llama 3
For production-quality output, 7B+ models deliver a much better experience. Here's how to load Mistral 7B Instruct:
from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
load_in_4bit=True, # 4-bit quantization — cuts memory to ~5GB
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=1024,
temperature=0.1,
do_sample=True,
return_full_text=False, # return only new tokens, not the prompt
)
mistral_llm = HuggingFacePipeline(pipeline=pipe)
The load_in_4bit=True flag (powered by bitsandbytes) is a game changer for fitting larger models on consumer hardware. A 7B model that normally needs 14GB in float16 fits in about 5GB with 4-bit quantization, with only a small quality drop.
For Llama 3 8B, the process is nearly identical but you need to accept Meta's license on Hugging Face Hub first:
# After accepting license at huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# Same loading pattern as Mistral
tokenizer = AutoTokenizer.from_pretrained(model_id, token="hf_your_token")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # Llama 3 prefers bfloat16
device_map="auto",
token="hf_your_token",
)
Prompt Templates for Instruction Models
Chat-tuned models expect specific prompt formats. Using the wrong format gives noticeably worse output. Each model family has its own template:
from langchain_core.prompts import PromptTemplate
# Mistral instruct format
mistral_prompt = PromptTemplate.from_template(
"[INST] {question} [/INST]"
)
# Llama 3 instruct format
llama3_prompt = PromptTemplate.from_template(
"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
)
# Phi-3 instruct format
phi3_prompt = PromptTemplate.from_template(
"<|user|>\n{question}<|end|>\n<|assistant|>"
)
# Use in a chain (LCEL pipe syntax)
chain = mistral_prompt | mistral_llm
result = chain.invoke({"question": "What are the main differences between Python 3.11 and 3.12?"})
print(result)
The easiest way to get the right template is to use the tokenizer's apply_chat_template() method:
def format_prompt(tokenizer, user_message: str, system_message: str = "") -> str:
messages = []
if system_message:
messages.append({"role": "system", "content": system_message})
messages.append({"role": "user", "content": user_message})
return tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
formatted = format_prompt(
tokenizer,
"Explain quantum entanglement to a 10-year-old",
"You are a patient science teacher."
)
print(formatted)
Local Embeddings with sentence-transformers
For RAG pipelines, you need embeddings too. Running embeddings locally with sentence-transformers is fast and free:
from langchain_huggingface import HuggingFaceEmbeddings
# Fast and accurate for English
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
encode_kwargs={"normalize_embeddings": True},
)
# Test
texts = [
"LangChain is a framework for building LLM applications",
"Python is a popular programming language",
]
vectors = embeddings.embed_documents(texts)
print(f"Embedding dimension: {len(vectors[0])}") # 384 for MiniLM
For multilingual tasks:
multilingual_embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)
For the highest quality English embeddings locally:
high_quality_embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-large-en-v1.5",
encode_kwargs={"normalize_embeddings": True},
)
Building a Fully Local RAG Pipeline
Here's a complete RAG system using only local models — no API calls, no internet required after setup:
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import torch
# --- 1. Load and chunk documents ---
loader = TextLoader("company_docs.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")
# --- 2. Create local embeddings and vector store ---
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
)
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(
search_type="mmr", # maximum marginal relevance
search_kwargs={"k": 4, "fetch_k": 10},
)
# --- 3. Load local LLM ---
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
load_in_4bit=True,
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
return_full_text=False,
)
llm = HuggingFacePipeline(pipeline=pipe)
# --- 4. Build RAG chain ---
rag_prompt = PromptTemplate.from_template(
"[INST] Answer the question based ONLY on the following context. "
"If the answer is not in the context, say 'I don't know from the provided documents.'\n\n"
"Context:\n{context}\n\n"
"Question: {question} [/INST]"
)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
# --- 5. Query it ---
answer = rag_chain.invoke("What is the company's refund policy?")
print(answer)
This connects naturally with the vector database guide if you want to replace FAISS with a persistent vector store like ChromaDB or Qdrant.
Using GGUF Models with llama.cpp
For CPU inference or when you want the most memory-efficient option, GGUF-quantized models via llama.cpp are significantly faster than standard PyTorch inference on CPU:
pip install llama-cpp-python
# If you have a GPU:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
# Download a GGUF model first:
# huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
# mistral-7b-instruct-v0.2.Q4_K_M.gguf
llm = LlamaCpp(
model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf",
temperature=0.1,
max_tokens=512,
n_gpu_layers=35, # layers to offload to GPU (0 = CPU only)
n_ctx=4096, # context window
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
verbose=False,
)
# Use exactly like any other LangChain LLM
result = llm.invoke("What is the capital of France?")
The Q4_K_M quantization is a good balance — it cuts a 7B model from ~14GB to ~4.5GB with minimal quality loss.
Batching for Throughput
When you need to process many documents or queries, batching is much more efficient than calling the model one at a time:
from langchain_huggingface import HuggingFacePipeline
# Enable batching
llm_batch = HuggingFacePipeline(
pipeline=pipe,
batch_size=4, # process 4 inputs simultaneously
)
# Process multiple queries at once
questions = [
"What is machine learning?",
"Explain neural networks",
"What is backpropagation?",
"Define gradient descent",
"What is an activation function?",
"Explain attention mechanism",
]
# batch() processes all in optimal batches
responses = llm_batch.batch(questions)
for q, r in zip(questions, responses):
print(f"Q: {q}")
print(f"A: {r[:100]}...")
print()
Model Comparison Table
| Model | Parameters | VRAM (4-bit) | Speed (tok/s GPU) | Best For |
|---|---|---|---|---|
| TinyLlama 1.1B | 1.1B | ~1GB | 120+ | Experiments, CPU |
| Phi-3-mini | 3.8B | ~2.5GB | 80+ | Reasoning, low VRAM |
| Mistral 7B Instruct | 7B | ~4.5GB | 50+ | General tasks |
| Llama 3 8B Instruct | 8B | ~5GB | 45+ | Instruction following |
| Llama 3 70B Instruct | 70B | ~40GB | 12+ | Near-GPT-4 quality |
| Mixtral 8x7B | ~47B active | ~26GB | 18+ | Highest quality local |
Speeds are approximate on an RTX 3090 with 4-bit quantization. CPU speeds will be 5-10x slower.
Switching Between Local and Cloud Models
One of the most useful patterns is building your code so you can swap between local and cloud models with a single variable change:
import os
from enum import Enum
class ModelBackend(Enum):
OPENAI = "openai"
MISTRAL_LOCAL = "mistral_local"
LLAMA_LOCAL = "llama_local"
def get_llm(backend: ModelBackend):
if backend == ModelBackend.OPENAI:
from langchain_openai import ChatOpenAI
return ChatOpenAI(model="gpt-4o-mini", temperature=0)
elif backend == ModelBackend.MISTRAL_LOCAL:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.float16,
device_map="auto", load_in_4bit=True,
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
max_new_tokens=512, return_full_text=False)
from langchain_huggingface import HuggingFacePipeline
return HuggingFacePipeline(pipeline=pipe)
elif backend == ModelBackend.LLAMA_LOCAL:
from langchain_community.llms import LlamaCpp
return LlamaCpp(model_path="./llama-3-8b.Q4_K_M.gguf",
n_gpu_layers=35, n_ctx=4096)
# Swap backends with one line
BACKEND = os.getenv("LLM_BACKEND", "openai")
llm = get_llm(ModelBackend(BACKEND))
This pairs well with the approach in OpenAI API integration — start with OpenAI during prototyping, then switch to a local model for production data privacy.
Fine-tuning a Model with LoRA
For domain-specific tasks, a few hours of fine-tuning often beats prompt engineering alone:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
load_in_4bit=True,
)
# LoRA configuration — trains only a small adapter
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — higher = more parameters
lora_alpha=32, # scaling factor
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"], # which layers to adapt
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# "trainable params: 4,194,304 || all params: 3,756,974,080 || trainable%: 0.11%"
training_args = TrainingArguments(
output_dir="./fine-tuned-mistral",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=50,
save_strategy="epoch",
)
# Trainer handles the training loop
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=your_dataset, # HuggingFace dataset object
tokenizer=tokenizer,
max_seq_length=2048,
)
trainer.train()
trainer.save_model()
After training, load the adapter on top of the base model:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.float16, device_map="auto"
)
fine_tuned = PeftModel.from_pretrained(base_model, "./fine-tuned-mistral")
The Hugging Face transformers tutorial covers the full fine-tuning workflow in more depth.
Monitoring Performance and Memory
When running local models, keeping an eye on resource usage is important:
import torch
import psutil
import time
from functools import wraps
def profile_inference(func):
@wraps(func)
def wrapper(*args, **kwargs):
# Memory before
if torch.cuda.is_available():
torch.cuda.reset_peak_memory_stats()
mem_before = torch.cuda.memory_allocated() / 1e9
cpu_before = psutil.cpu_percent()
start = time.time()
result = func(*args, **kwargs)
elapsed = time.time() - start
if torch.cuda.is_available():
peak_mem = torch.cuda.max_memory_allocated() / 1e9
print(f"GPU peak memory: {peak_mem:.2f} GB")
print(f"Inference time: {elapsed:.2f}s")
return result
return wrapper
@profile_inference
def run_inference(prompt: str) -> str:
return llm.invoke(prompt)
result = run_inference("Summarize the history of artificial intelligence")
Conclusion
Running LangChain with local Hugging Face models is genuinely practical in 2026. Models like Mistral 7B and Llama 3 8B deliver quality that was GPT-3.5 territory just two years ago, and they run on hardware most developers already have access to.
Start with Phi-3-mini if you have limited VRAM or want something quick to test. Move to Mistral 7B or Llama 3 8B Instruct with 4-bit quantization for production-quality work. Use GGUF models via llama.cpp for CPU deployments or maximum memory efficiency.
The combination of local models, sentence-transformers embeddings, and FAISS gives you a complete AI stack that runs on-premise without any external API dependencies — and that's a real advantage for privacy-sensitive applications. Check out Build AI agent with LangChain to see how to wrap these local models in full agent architectures.
Frequently Asked Questions
Can I run a Hugging Face model locally without a GPU?
Yes, but with limitations. Smaller models like Phi-3-mini (3.8B parameters) and TinyLlama (1.1B) run acceptably on a modern CPU. Quantized versions (GGUF format via llama.cpp) help significantly. For anything above 7B parameters, you'll want at least an 8GB GPU for reasonable performance.
What is HuggingFacePipeline in LangChain?
HuggingFacePipeline is a LangChain wrapper around the Hugging Face transformers pipeline API. It loads a model locally and exposes it as a standard LangChain LLM, so you can use it anywhere you'd use ChatOpenAI or another hosted model — same chains, same agent code, no API calls.
Which local model is best for RAG and Q&A tasks?
Mistral 7B Instruct and Llama 3 8B Instruct are currently the strongest 7-8B models for instruction following and RAG. If you need something smaller, Phi-3-mini (3.8B) punches well above its weight on reasoning. For embeddings, all-MiniLM-L6-v2 is fast and accurate enough for most RAG pipelines.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.