AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

open source AI model running locally — LangChain Hugging Face local LLM

How to Use LangChain with Hugging Face (Local LLMs 2026)

⚡ Quick Answer

Run open source LLMs locally with LangChain and Hugging Face. Complete guide covering HuggingFacePipeline, Llama, Mistral, and sentence-transformers embeddings.

AiTechWorlds Team May 31, 2026 12 min read

#LangChain #Hugging Face #Local LLM #Open Source AI

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

I spent a week running every major open source model through LangChain before fully appreciating why local LLMs matter. No API costs, no rate limits, no data leaving your machine. If you're building something with sensitive data or need to run thousands of inference calls without a bill that scales with usage, local models are worth the setup time.

This guide shows you exactly how to connect Hugging Face models to LangChain — from loading your first model to building a full RAG pipeline that runs entirely on your own hardware.

If you're newer to LangChain and want the hosted model path first, LangChain tutorial 2025 covers OpenAI integration before local models. Come back here once you're ready to go self-hosted.

Why Run LLMs Locally Through LangChain

The case for local LLMs has gotten much stronger over the past two years. Models like Llama 3, Mistral 7B, and Phi-3 have closed the quality gap with GPT-3.5 significantly. Running them locally through LangChain gives you:

Zero marginal cost after hardware — no per-token charges
Complete data privacy — nothing leaves your machine or network
No rate limits — run as many requests as your hardware allows
Offline capability — works without internet after initial model download
Full control — fine-tune, quantize, or modify models as needed

According to the 2025 State of AI report from a16z, enterprises with strict data governance requirements are driving 60%+ of the demand for on-premise LLM deployments.

Setting Up Your Environment

You'll need either a Python virtual environment or conda. The dependencies are heavier than cloud-based setups:

pip install langchain langchain-community langchain-huggingface
pip install transformers accelerate bitsandbytes
pip install sentence-transformers
pip install torch --index-url https://download.pytorch.org/whl/cu121  # CUDA 12.1
# For CPU-only:
pip install torch

Check your hardware setup before loading any model:

import torch

print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("Running on CPU")

# MPS (Apple Silicon)
print(f"MPS available: {torch.backends.mps.is_available()}")

Loading Your First Model: HuggingFacePipeline

HuggingFacePipeline is the primary LangChain wrapper for local Hugging Face models. It loads a model with the transformers pipeline API and exposes it as a standard LangChain LLM.

from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "microsoft/Phi-3-mini-4k-instruct"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,    # float16 halves memory usage
    device_map="auto",             # automatically use GPU if available
    trust_remote_code=True,        # required for some models
)

# Create pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.1,
    do_sample=True,
    repetition_penalty=1.1,
)

# Wrap in LangChain
llm = HuggingFacePipeline(pipeline=pipe)

# Test it
response = llm.invoke("Explain what a transformer neural network is in 3 sentences.")
print(response)

The device_map="auto" setting is important — it automatically distributes model layers across available GPUs, or falls back to CPU if no GPU is available. For a model like Phi-3-mini on a machine with 8GB VRAM, this typically fits the entire model in GPU memory.

Loading Mistral 7B and Llama 3

For production-quality output, 7B+ models deliver a much better experience. Here's how to load Mistral 7B Instruct:

from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,  # 4-bit quantization — cuts memory to ~5GB
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=1024,
    temperature=0.1,
    do_sample=True,
    return_full_text=False,  # return only new tokens, not the prompt
)

mistral_llm = HuggingFacePipeline(pipeline=pipe)

The load_in_4bit=True flag (powered by bitsandbytes) is a game changer for fitting larger models on consumer hardware. A 7B model that normally needs 14GB in float16 fits in about 5GB with 4-bit quantization, with only a small quality drop.

For Llama 3 8B, the process is nearly identical but you need to accept Meta's license on Hugging Face Hub first:

# After accepting license at huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Same loading pattern as Mistral
tokenizer = AutoTokenizer.from_pretrained(model_id, token="hf_your_token")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # Llama 3 prefers bfloat16
    device_map="auto",
    token="hf_your_token",
)

Prompt Templates for Instruction Models

Chat-tuned models expect specific prompt formats. Using the wrong format gives noticeably worse output. Each model family has its own template:

from langchain_core.prompts import PromptTemplate

# Mistral instruct format
mistral_prompt = PromptTemplate.from_template(
    "[INST] {question} [/INST]"
)

# Llama 3 instruct format
llama3_prompt = PromptTemplate.from_template(
    "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
)

# Phi-3 instruct format
phi3_prompt = PromptTemplate.from_template(
    "<|user|>\n{question}<|end|>\n<|assistant|>"
)

# Use in a chain (LCEL pipe syntax)
chain = mistral_prompt | mistral_llm

result = chain.invoke({"question": "What are the main differences between Python 3.11 and 3.12?"})
print(result)

The easiest way to get the right template is to use the tokenizer's apply_chat_template() method:

def format_prompt(tokenizer, user_message: str, system_message: str = "") -> str:
    messages = []
    if system_message:
        messages.append({"role": "system", "content": system_message})
    messages.append({"role": "user", "content": user_message})
    
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

formatted = format_prompt(
    tokenizer,
    "Explain quantum entanglement to a 10-year-old",
    "You are a patient science teacher."
)
print(formatted)

Local Embeddings with sentence-transformers

For RAG pipelines, you need embeddings too. Running embeddings locally with sentence-transformers is fast and free:

from langchain_huggingface import HuggingFaceEmbeddings

# Fast and accurate for English
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
    encode_kwargs={"normalize_embeddings": True},
)

# Test
texts = [
    "LangChain is a framework for building LLM applications",
    "Python is a popular programming language",
]
vectors = embeddings.embed_documents(texts)
print(f"Embedding dimension: {len(vectors[0])}")  # 384 for MiniLM

For multilingual tasks:

multilingual_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)

For the highest quality English embeddings locally:

high_quality_embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    encode_kwargs={"normalize_embeddings": True},
)

Building a Fully Local RAG Pipeline

Here's a complete RAG system using only local models — no API calls, no internet required after setup:

from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import torch

# --- 1. Load and chunk documents ---
loader = TextLoader("company_docs.txt")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

# --- 2. Create local embeddings and vector store ---
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
)

vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(
    search_type="mmr",       # maximum marginal relevance
    search_kwargs={"k": 4, "fetch_k": 10},
)

# --- 3. Load local LLM ---
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,
)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    return_full_text=False,
)
llm = HuggingFacePipeline(pipeline=pipe)

# --- 4. Build RAG chain ---
rag_prompt = PromptTemplate.from_template(
    "[INST] Answer the question based ONLY on the following context. "
    "If the answer is not in the context, say 'I don't know from the provided documents.'\n\n"
    "Context:\n{context}\n\n"
    "Question: {question} [/INST]"
)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

# --- 5. Query it ---
answer = rag_chain.invoke("What is the company's refund policy?")
print(answer)

This connects naturally with the vector database guide if you want to replace FAISS with a persistent vector store like ChromaDB or Qdrant.

Using GGUF Models with llama.cpp

For CPU inference or when you want the most memory-efficient option, GGUF-quantized models via llama.cpp are significantly faster than standard PyTorch inference on CPU:

pip install llama-cpp-python
# If you have a GPU:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler

# Download a GGUF model first:
# huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
#   mistral-7b-instruct-v0.2.Q4_K_M.gguf

llm = LlamaCpp(
    model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    temperature=0.1,
    max_tokens=512,
    n_gpu_layers=35,       # layers to offload to GPU (0 = CPU only)
    n_ctx=4096,            # context window
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=False,
)

# Use exactly like any other LangChain LLM
result = llm.invoke("What is the capital of France?")

The Q4_K_M quantization is a good balance — it cuts a 7B model from ~14GB to ~4.5GB with minimal quality loss.

Batching for Throughput

When you need to process many documents or queries, batching is much more efficient than calling the model one at a time:

from langchain_huggingface import HuggingFacePipeline

# Enable batching
llm_batch = HuggingFacePipeline(
    pipeline=pipe,
    batch_size=4,  # process 4 inputs simultaneously
)

# Process multiple queries at once
questions = [
    "What is machine learning?",
    "Explain neural networks",
    "What is backpropagation?",
    "Define gradient descent",
    "What is an activation function?",
    "Explain attention mechanism",
]

# batch() processes all in optimal batches
responses = llm_batch.batch(questions)
for q, r in zip(questions, responses):
    print(f"Q: {q}")
    print(f"A: {r[:100]}...")
    print()

Model Comparison Table

Model	Parameters	VRAM (4-bit)	Speed (tok/s GPU)	Best For
TinyLlama 1.1B	1.1B	~1GB	120+	Experiments, CPU
Phi-3-mini	3.8B	~2.5GB	80+	Reasoning, low VRAM
Mistral 7B Instruct	7B	~4.5GB	50+	General tasks
Llama 3 8B Instruct	8B	~5GB	45+	Instruction following
Llama 3 70B Instruct	70B	~40GB	12+	Near-GPT-4 quality
Mixtral 8x7B	~47B active	~26GB	18+	Highest quality local

Speeds are approximate on an RTX 3090 with 4-bit quantization. CPU speeds will be 5-10x slower.

Switching Between Local and Cloud Models

One of the most useful patterns is building your code so you can swap between local and cloud models with a single variable change:

import os
from enum import Enum

class ModelBackend(Enum):
    OPENAI = "openai"
    MISTRAL_LOCAL = "mistral_local"
    LLAMA_LOCAL = "llama_local"

def get_llm(backend: ModelBackend):
    if backend == ModelBackend.OPENAI:
        from langchain_openai import ChatOpenAI
        return ChatOpenAI(model="gpt-4o-mini", temperature=0)
    
    elif backend == ModelBackend.MISTRAL_LOCAL:
        from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
        import torch
        model_id = "mistralai/Mistral-7B-Instruct-v0.3"
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        model = AutoModelForCausalLM.from_pretrained(
            model_id, torch_dtype=torch.float16,
            device_map="auto", load_in_4bit=True,
        )
        pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
                       max_new_tokens=512, return_full_text=False)
        from langchain_huggingface import HuggingFacePipeline
        return HuggingFacePipeline(pipeline=pipe)
    
    elif backend == ModelBackend.LLAMA_LOCAL:
        from langchain_community.llms import LlamaCpp
        return LlamaCpp(model_path="./llama-3-8b.Q4_K_M.gguf",
                       n_gpu_layers=35, n_ctx=4096)

# Swap backends with one line
BACKEND = os.getenv("LLM_BACKEND", "openai")
llm = get_llm(ModelBackend(BACKEND))

This pairs well with the approach in OpenAI API integration — start with OpenAI during prototyping, then switch to a local model for production data privacy.

Fine-tuning a Model with LoRA

For domain-specific tasks, a few hours of fine-tuning often beats prompt engineering alone:

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,
)

# LoRA configuration — trains only a small adapter
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,              # rank — higher = more parameters
    lora_alpha=32,     # scaling factor
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# "trainable params: 4,194,304 || all params: 3,756,974,080 || trainable%: 0.11%"

training_args = TrainingArguments(
    output_dir="./fine-tuned-mistral",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=50,
    save_strategy="epoch",
)

# Trainer handles the training loop
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,  # HuggingFace dataset object
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()
trainer.save_model()

After training, load the adapter on top of the base model:

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.float16, device_map="auto"
)
fine_tuned = PeftModel.from_pretrained(base_model, "./fine-tuned-mistral")

The Hugging Face transformers tutorial covers the full fine-tuning workflow in more depth.

Monitoring Performance and Memory

When running local models, keeping an eye on resource usage is important:

import torch
import psutil
import time
from functools import wraps

def profile_inference(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        # Memory before
        if torch.cuda.is_available():
            torch.cuda.reset_peak_memory_stats()
            mem_before = torch.cuda.memory_allocated() / 1e9
        
        cpu_before = psutil.cpu_percent()
        start = time.time()
        
        result = func(*args, **kwargs)
        
        elapsed = time.time() - start
        
        if torch.cuda.is_available():
            peak_mem = torch.cuda.max_memory_allocated() / 1e9
            print(f"GPU peak memory: {peak_mem:.2f} GB")
        
        print(f"Inference time: {elapsed:.2f}s")
        return result
    return wrapper

@profile_inference
def run_inference(prompt: str) -> str:
    return llm.invoke(prompt)

result = run_inference("Summarize the history of artificial intelligence")

Conclusion

Running LangChain with local Hugging Face models is genuinely practical in 2026. Models like Mistral 7B and Llama 3 8B deliver quality that was GPT-3.5 territory just two years ago, and they run on hardware most developers already have access to.

Start with Phi-3-mini if you have limited VRAM or want something quick to test. Move to Mistral 7B or Llama 3 8B Instruct with 4-bit quantization for production-quality work. Use GGUF models via llama.cpp for CPU deployments or maximum memory efficiency.

The combination of local models, sentence-transformers embeddings, and FAISS gives you a complete AI stack that runs on-premise without any external API dependencies — and that's a real advantage for privacy-sensitive applications. Check out Build AI agent with LangChain to see how to wrap these local models in full agent architectures.

Frequently Asked Questions

Can I run a Hugging Face model locally without a GPU?

Yes, but with limitations. Smaller models like Phi-3-mini (3.8B parameters) and TinyLlama (1.1B) run acceptably on a modern CPU. Quantized versions (GGUF format via llama.cpp) help significantly. For anything above 7B parameters, you'll want at least an 8GB GPU for reasonable performance.

What is HuggingFacePipeline in LangChain?

HuggingFacePipeline is a LangChain wrapper around the Hugging Face transformers pipeline API. It loads a model locally and exposes it as a standard LangChain LLM, so you can use it anywhere you'd use ChatOpenAI or another hosted model — same chains, same agent code, no API calls.

Which local model is best for RAG and Q&A tasks?

Mistral 7B Instruct and Llama 3 8B Instruct are currently the strongest 7-8B models for instruction following and RAG. If you need something smaller, Phi-3-mini (3.8B) punches well above its weight on reasoning. For embeddings, all-MiniLM-L6-v2 is fast and accurate enough for most RAG pipelines.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesAI Agent Development Notes NotesRAG: Retrieval-Augmented Generation Guide BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course ProjectAutonomous Multi-Agent System for Software Development

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

How to Use LangChain with Hugging Face (Local LLMs 2026)

⚡ Quick Answer

Run open source LLMs locally with LangChain and Hugging Face. Complete guide covering HuggingFacePipeline, Llama, Mistral, and sentence-transformers embeddings.

AiTechWorlds Team May 31, 2026 12 min read

#LangChain #Hugging Face #Local LLM #Open Source AI

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

This guide shows you exactly how to connect Hugging Face models to LangChain — from loading your first model to building a full RAG pipeline that runs entirely on your own hardware.

If you're newer to LangChain and want the hosted model path first, LangChain tutorial 2025 covers OpenAI integration before local models. Come back here once you're ready to go self-hosted.

Why Run LLMs Locally Through LangChain

Zero marginal cost after hardware — no per-token charges
Complete data privacy — nothing leaves your machine or network
No rate limits — run as many requests as your hardware allows
Offline capability — works without internet after initial model download
Full control — fine-tune, quantize, or modify models as needed

According to the 2025 State of AI report from a16z, enterprises with strict data governance requirements are driving 60%+ of the demand for on-premise LLM deployments.

Setting Up Your Environment

You'll need either a Python virtual environment or conda. The dependencies are heavier than cloud-based setups:

pip install langchain langchain-community langchain-huggingface
pip install transformers accelerate bitsandbytes
pip install sentence-transformers
pip install torch --index-url https://download.pytorch.org/whl/cu121  # CUDA 12.1
# For CPU-only:
pip install torch

Check your hardware setup before loading any model:

import torch

print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("Running on CPU")

# MPS (Apple Silicon)
print(f"MPS available: {torch.backends.mps.is_available()}")

Loading Your First Model: HuggingFacePipeline

HuggingFacePipeline is the primary LangChain wrapper for local Hugging Face models. It loads a model with the transformers pipeline API and exposes it as a standard LangChain LLM.

from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "microsoft/Phi-3-mini-4k-instruct"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,    # float16 halves memory usage
    device_map="auto",             # automatically use GPU if available
    trust_remote_code=True,        # required for some models
)

# Create pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.1,
    do_sample=True,
    repetition_penalty=1.1,
)

# Wrap in LangChain
llm = HuggingFacePipeline(pipeline=pipe)

# Test it
response = llm.invoke("Explain what a transformer neural network is in 3 sentences.")
print(response)

Loading Mistral 7B and Llama 3

For production-quality output, 7B+ models deliver a much better experience. Here's how to load Mistral 7B Instruct:

from langchain_huggingface import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,  # 4-bit quantization — cuts memory to ~5GB
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=1024,
    temperature=0.1,
    do_sample=True,
    return_full_text=False,  # return only new tokens, not the prompt
)

mistral_llm = HuggingFacePipeline(pipeline=pipe)

For Llama 3 8B, the process is nearly identical but you need to accept Meta's license on Hugging Face Hub first:

# After accepting license at huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Same loading pattern as Mistral
tokenizer = AutoTokenizer.from_pretrained(model_id, token="hf_your_token")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # Llama 3 prefers bfloat16
    device_map="auto",
    token="hf_your_token",
)

Prompt Templates for Instruction Models

Chat-tuned models expect specific prompt formats. Using the wrong format gives noticeably worse output. Each model family has its own template:

from langchain_core.prompts import PromptTemplate

# Mistral instruct format
mistral_prompt = PromptTemplate.from_template(
    "[INST] {question} [/INST]"
)

# Llama 3 instruct format
llama3_prompt = PromptTemplate.from_template(
    "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
)

# Phi-3 instruct format
phi3_prompt = PromptTemplate.from_template(
    "<|user|>\n{question}<|end|>\n<|assistant|>"
)

# Use in a chain (LCEL pipe syntax)
chain = mistral_prompt | mistral_llm

result = chain.invoke({"question": "What are the main differences between Python 3.11 and 3.12?"})
print(result)

The easiest way to get the right template is to use the tokenizer's apply_chat_template() method:

def format_prompt(tokenizer, user_message: str, system_message: str = "") -> str:
    messages = []
    if system_message:
        messages.append({"role": "system", "content": system_message})
    messages.append({"role": "user", "content": user_message})
    
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

formatted = format_prompt(
    tokenizer,
    "Explain quantum entanglement to a 10-year-old",
    "You are a patient science teacher."
)
print(formatted)

Local Embeddings with sentence-transformers

For RAG pipelines, you need embeddings too. Running embeddings locally with sentence-transformers is fast and free:

from langchain_huggingface import HuggingFaceEmbeddings

# Fast and accurate for English
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
    encode_kwargs={"normalize_embeddings": True},
)

# Test
texts = [
    "LangChain is a framework for building LLM applications",
    "Python is a popular programming language",
]
vectors = embeddings.embed_documents(texts)
print(f"Embedding dimension: {len(vectors[0])}")  # 384 for MiniLM

For multilingual tasks:

multilingual_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)

For the highest quality English embeddings locally:

high_quality_embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    encode_kwargs={"normalize_embeddings": True},
)

Building a Fully Local RAG Pipeline

Here's a complete RAG system using only local models — no API calls, no internet required after setup:

from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import torch

# --- 1. Load and chunk documents ---
loader = TextLoader("company_docs.txt")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

# --- 2. Create local embeddings and vector store ---
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
)

vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(
    search_type="mmr",       # maximum marginal relevance
    search_kwargs={"k": 4, "fetch_k": 10},
)

# --- 3. Load local LLM ---
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,
)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    return_full_text=False,
)
llm = HuggingFacePipeline(pipeline=pipe)

# --- 4. Build RAG chain ---
rag_prompt = PromptTemplate.from_template(
    "[INST] Answer the question based ONLY on the following context. "
    "If the answer is not in the context, say 'I don't know from the provided documents.'\n\n"
    "Context:\n{context}\n\n"
    "Question: {question} [/INST]"
)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

# --- 5. Query it ---
answer = rag_chain.invoke("What is the company's refund policy?")
print(answer)

This connects naturally with the vector database guide if you want to replace FAISS with a persistent vector store like ChromaDB or Qdrant.

Using GGUF Models with llama.cpp

For CPU inference or when you want the most memory-efficient option, GGUF-quantized models via llama.cpp are significantly faster than standard PyTorch inference on CPU:

pip install llama-cpp-python
# If you have a GPU:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler

# Download a GGUF model first:
# huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
#   mistral-7b-instruct-v0.2.Q4_K_M.gguf

llm = LlamaCpp(
    model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    temperature=0.1,
    max_tokens=512,
    n_gpu_layers=35,       # layers to offload to GPU (0 = CPU only)
    n_ctx=4096,            # context window
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=False,
)

# Use exactly like any other LangChain LLM
result = llm.invoke("What is the capital of France?")

The Q4_K_M quantization is a good balance — it cuts a 7B model from ~14GB to ~4.5GB with minimal quality loss.

Batching for Throughput

When you need to process many documents or queries, batching is much more efficient than calling the model one at a time:

from langchain_huggingface import HuggingFacePipeline

# Enable batching
llm_batch = HuggingFacePipeline(
    pipeline=pipe,
    batch_size=4,  # process 4 inputs simultaneously
)

# Process multiple queries at once
questions = [
    "What is machine learning?",
    "Explain neural networks",
    "What is backpropagation?",
    "Define gradient descent",
    "What is an activation function?",
    "Explain attention mechanism",
]

# batch() processes all in optimal batches
responses = llm_batch.batch(questions)
for q, r in zip(questions, responses):
    print(f"Q: {q}")
    print(f"A: {r[:100]}...")
    print()

Model Comparison Table

Model	Parameters	VRAM (4-bit)	Speed (tok/s GPU)	Best For
TinyLlama 1.1B	1.1B	~1GB	120+	Experiments, CPU
Phi-3-mini	3.8B	~2.5GB	80+	Reasoning, low VRAM
Mistral 7B Instruct	7B	~4.5GB	50+	General tasks
Llama 3 8B Instruct	8B	~5GB	45+	Instruction following
Llama 3 70B Instruct	70B	~40GB	12+	Near-GPT-4 quality
Mixtral 8x7B	~47B active	~26GB	18+	Highest quality local

Speeds are approximate on an RTX 3090 with 4-bit quantization. CPU speeds will be 5-10x slower.

Switching Between Local and Cloud Models

One of the most useful patterns is building your code so you can swap between local and cloud models with a single variable change:

import os
from enum import Enum

class ModelBackend(Enum):
    OPENAI = "openai"
    MISTRAL_LOCAL = "mistral_local"
    LLAMA_LOCAL = "llama_local"

def get_llm(backend: ModelBackend):
    if backend == ModelBackend.OPENAI:
        from langchain_openai import ChatOpenAI
        return ChatOpenAI(model="gpt-4o-mini", temperature=0)
    
    elif backend == ModelBackend.MISTRAL_LOCAL:
        from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
        import torch
        model_id = "mistralai/Mistral-7B-Instruct-v0.3"
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        model = AutoModelForCausalLM.from_pretrained(
            model_id, torch_dtype=torch.float16,
            device_map="auto", load_in_4bit=True,
        )
        pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
                       max_new_tokens=512, return_full_text=False)
        from langchain_huggingface import HuggingFacePipeline
        return HuggingFacePipeline(pipeline=pipe)
    
    elif backend == ModelBackend.LLAMA_LOCAL:
        from langchain_community.llms import LlamaCpp
        return LlamaCpp(model_path="./llama-3-8b.Q4_K_M.gguf",
                       n_gpu_layers=35, n_ctx=4096)

# Swap backends with one line
BACKEND = os.getenv("LLM_BACKEND", "openai")
llm = get_llm(ModelBackend(BACKEND))

This pairs well with the approach in OpenAI API integration — start with OpenAI during prototyping, then switch to a local model for production data privacy.

Fine-tuning a Model with LoRA

For domain-specific tasks, a few hours of fine-tuning often beats prompt engineering alone:

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,
)

# LoRA configuration — trains only a small adapter
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,              # rank — higher = more parameters
    lora_alpha=32,     # scaling factor
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# "trainable params: 4,194,304 || all params: 3,756,974,080 || trainable%: 0.11%"

training_args = TrainingArguments(
    output_dir="./fine-tuned-mistral",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=50,
    save_strategy="epoch",
)

# Trainer handles the training loop
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,  # HuggingFace dataset object
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()
trainer.save_model()

After training, load the adapter on top of the base model:

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.float16, device_map="auto"
)
fine_tuned = PeftModel.from_pretrained(base_model, "./fine-tuned-mistral")

The Hugging Face transformers tutorial covers the full fine-tuning workflow in more depth.

Monitoring Performance and Memory

When running local models, keeping an eye on resource usage is important:

import torch
import psutil
import time
from functools import wraps

def profile_inference(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        # Memory before
        if torch.cuda.is_available():
            torch.cuda.reset_peak_memory_stats()
            mem_before = torch.cuda.memory_allocated() / 1e9
        
        cpu_before = psutil.cpu_percent()
        start = time.time()
        
        result = func(*args, **kwargs)
        
        elapsed = time.time() - start
        
        if torch.cuda.is_available():
            peak_mem = torch.cuda.max_memory_allocated() / 1e9
            print(f"GPU peak memory: {peak_mem:.2f} GB")
        
        print(f"Inference time: {elapsed:.2f}s")
        return result
    return wrapper

@profile_inference
def run_inference(prompt: str) -> str:
    return llm.invoke(prompt)

result = run_inference("Summarize the history of artificial intelligence")

Conclusion

Frequently Asked Questions

Can I run a Hugging Face model locally without a GPU?

What is HuggingFacePipeline in LangChain?

Which local model is best for RAG and Q&A tasks?

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

How to Use LangChain with Hugging Face (Local LLMs 2026)

Why Run LLMs Locally Through LangChain

Setting Up Your Environment

Loading Your First Model: HuggingFacePipeline

Loading Mistral 7B and Llama 3

Prompt Templates for Instruction Models

Local Embeddings with sentence-transformers

Building a Fully Local RAG Pipeline

Using GGUF Models with llama.cpp

Batching for Throughput

Model Comparison Table

Switching Between Local and Cloud Models

Fine-tuning a Model with LoRA

Monitoring Performance and Memory

Conclusion

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

How to Use LangChain with Hugging Face (Local LLMs 2026)

Why Run LLMs Locally Through LangChain

Setting Up Your Environment

Loading Your First Model: HuggingFacePipeline

Loading Mistral 7B and Llama 3

Prompt Templates for Instruction Models

Local Embeddings with sentence-transformers

Building a Fully Local RAG Pipeline

Using GGUF Models with llama.cpp

Batching for Throughput

Model Comparison Table

Switching Between Local and Cloud Models

Fine-tuning a Model with LoRA

Monitoring Performance and Memory

Conclusion

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily