Hugging Face Transformers Tutorial: Complete Guide to Using Pretrained Models
Hugging Face Transformers tutorial — load, fine-tune, and deploy pretrained models for text classification, generation, summarization, and translation with practical Python examples.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Hugging Face Transformers Tutorial: Complete Guide to Using Pretrained Models
When I started working with language models, the barrier was enormous — implementing BERT from scratch, managing training loops, debugging tensor shapes. Hugging Face Transformers changed this entirely.
The library provides access to hundreds of thousands of pretrained models with a consistent API. Today, sentiment analysis on a custom dataset is 50 lines of Python. Fine-tuning BERT for text classification takes an afternoon. This guide covers the patterns you'll use in 90% of real projects.
Installation
pip install transformers datasets evaluate accelerate
pip install torch torchvision # Or tensorflow
pip install peft # For efficient fine-tuning
# Optional: GPU acceleration
pip install bitsandbytes # 4-bit/8-bit quantization
The Pipeline API: One-Line Inference
from transformers import pipeline
# Sentiment analysis
sentiment = pipeline("sentiment-analysis")
result = sentiment("I absolutely loved this product!")
print(result) # [{'label': 'POSITIVE', 'score': 0.9998}]
# Batch processing
texts = [
"This is great!",
"Not what I expected.",
"Completely useless.",
]
results = sentiment(texts)
# Named entity recognition
ner = pipeline("ner", grouped_entities=True)
entities = ner("Apple CEO Tim Cook announced new products in San Francisco.")
for e in entities:
print(f"{e['entity_group']}: {e['word']} ({e['score']:.2f})")
# Question answering
qa = pipeline("question-answering")
context = "LangChain is a framework for building LLM applications. It was created in 2022."
answer = qa(question="When was LangChain created?", context=context)
print(f"Answer: {answer['answer']} (score: {answer['score']:.3f})")
# Text generation
generator = pipeline("text-generation", model="gpt2", max_length=100)
output = generator("The future of artificial intelligence is")
print(output[0]["generated_text"])
# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
long_text = """[Your long article text here...]"""
summary = summarizer(long_text, max_length=130, min_length=30)
print(summary[0]["summary_text"])
# Translation
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
translated = translator("Hello, how are you doing today?")
print(translated[0]["translation_text"]) # "Bonjour, comment allez-vous aujourd'hui?"
# Zero-shot classification (no training needed)
classifier = pipeline("zero-shot-classification")
result = classifier(
"I need to cancel my subscription immediately.",
candidate_labels=["account management", "billing", "technical support", "complaint"]
)
print(result["labels"][0]) # Most likely label
AutoModel and AutoTokenizer
For more control:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Tokenization
text = "This movie was absolutely fantastic!"
inputs = tokenizer(
text,
return_tensors="pt", # PyTorch tensors
truncation=True, # Truncate to max_length
padding=True, # Pad to same length in batch
max_length=512
)
print(f"Input IDs shape: {inputs['input_ids'].shape}")
print(f"Token IDs: {inputs['input_ids'][0][:10].tolist()}")
# Forward pass
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probabilities = torch.softmax(logits, dim=1)
predicted_class = torch.argmax(logits, dim=1).item()
labels = model.config.id2label
print(f"Prediction: {labels[predicted_class]} ({probabilities[0][predicted_class]:.3f})")
# Batch processing efficiently
texts = ["I love this!", "I hate this.", "It's okay.", "Absolutely amazing!"]
inputs = tokenizer(
texts,
return_tensors="pt",
truncation=True,
padding=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.softmax(outputs.logits, dim=1)
for text, pred in zip(texts, predictions):
label = labels[torch.argmax(pred).item()]
confidence = torch.max(pred).item()
print(f"[{label} {confidence:.2f}] {text}")
Text Generation with LLaMA
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto", # Distribute across available GPUs
)
# Format as instruction following (LLaMA 3.1 chat format)
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to find prime numbers up to n."}
]
# Apply chat template (handles special tokens automatically)
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.3, # Lower = more deterministic for code
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.1 # Avoid repetitive output
)
# Decode only the generated tokens (not the prompt)
new_tokens = outputs[0][inputs['input_ids'].shape[-1]:]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)
print(response)
Fine-Tuning for Classification
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import Dataset
import evaluate
import numpy as np
# Prepare your data
train_data = {
"text": ["Great product!", "Terrible experience", "Works as expected", "Love it!", "Waste of money"],
"label": [1, 0, 1, 1, 0] # 1 = positive, 0 = negative
}
eval_data = {
"text": ["Really enjoyed it", "Not satisfied"],
"label": [1, 0]
}
train_dataset = Dataset.from_dict(train_data)
eval_dataset = Dataset.from_dict(eval_data)
# Load model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2,
id2label={0: "NEGATIVE", 1: "POSITIVE"},
label2id={"NEGATIVE": 0, "POSITIVE": 1}
)
# Tokenize dataset
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=128
)
train_tokenized = train_dataset.map(tokenize_function, batched=True)
eval_tokenized = eval_dataset.map(tokenize_function, batched=True)
# Data collator (handles padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Metrics
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
warmup_steps=100,
weight_decay=0.01,
learning_rate=2e-5,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
logging_dir="./logs",
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_tokenized,
eval_dataset=eval_tokenized,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model("./my-sentiment-model")
# Use fine-tuned model
fine_tuned = pipeline("sentiment-analysis", model="./my-sentiment-model")
print(fine_tuned("This is wonderful!"))
Efficient Fine-Tuning with PEFT/LoRA
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer
# Load a larger model that would normally be too big to fine-tune
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.SEQ_2_SEQ_LM,
r=16, # LoRA rank (lower = fewer parameters)
lora_alpha=32, # Scaling factor
lora_dropout=0.1,
target_modules=["q", "v"] # Which weight matrices to add LoRA to
)
# Wrap model with LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 2,359,296 || all params: 783,150,080 (0.30%)
# Only 0.3% of parameters are trained!
# Now train as normal — much smaller memory footprint
# ... (rest of training is identical to standard Trainer usage)
# Save LoRA adapter only (much smaller than full model)
model.save_pretrained("./lora-adapter")
# Later: load base model + LoRA adapter together
Pushing Models to Hugging Face Hub
from huggingface_hub import HfApi
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Login (get token from huggingface.co/settings/tokens)
from huggingface_hub import notebook_login
notebook_login() # Or: huggingface-cli login in terminal
# Push model to Hub
model.push_to_hub("your-username/my-sentiment-model")
tokenizer.push_to_hub("your-username/my-sentiment-model")
# Pull your model anywhere
model = AutoModelForSequenceClassification.from_pretrained("your-username/my-sentiment-model")
Conclusion
The Hugging Face ecosystem — Transformers, Datasets, PEFT, Evaluate, Hub — forms a complete ML platform. The pipeline API makes inference trivial; AutoModel gives you full control; Trainer handles the fine-tuning boilerplate; PEFT enables large model fine-tuning on consumer hardware.
The pattern that works in practice: start with a pretrained model from the Hub, fine-tune with LoRA on your domain data, evaluate with Evaluate metrics, and deploy via the Transformers pipeline or TGI server.
For using Hugging Face models in RAG pipelines, see our RAG system tutorial. For understanding the transformer architecture these models are built on, see our transformer architecture guide.
Frequently Asked Questions
What is the Hugging Face Transformers library?
An open-source library providing 500,000+ pretrained models with a consistent Python API. Supports BERT, GPT, LLaMA, T5, and hundreds of architectures for NLP, vision, and audio. Models download automatically by name. Essential for any ML engineer working with language models.
What is the difference between pipeline() and AutoModel?
Pipeline: one line for standard tasks, handles tokenization and output processing automatically. AutoModel: lower-level control, required for custom tasks, fine-tuning, or non-standard processing. Start with pipeline; use AutoModel when you need more control.
How do I fine-tune a Hugging Face model?
Load model + tokenizer, tokenize your dataset, define TrainingArguments, create Trainer, call trainer.train(). For memory efficiency: use PEFT/LoRA (trains <1% of parameters). BERT-base fine-tuning for classification takes 10-30 minutes on a single GPU.
What is PEFT and when should I use it?
Parameter-Efficient Fine-Tuning — trains <1% of parameters via LoRA, fitting large models on consumer GPUs. Use when the model is too large for full fine-tuning, or when you want multiple specialized adapters from one base model. QLoRA adds 4-bit quantization for even larger models.
How do I use Hugging Face models for text generation?
Load with AutoModelForCausalLM, use tokenizer.apply_chat_template() for instruction-tuned models, call model.generate() with temperature/top_p parameters. For production: use TGI (text-generation-inference) server for OpenAI-compatible API.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI API Cost Management: How to Cut LLM Costs by 80% Without Losing Quality
AI API cost management — practical strategies to reduce OpenAI, Claude, and Gemini API costs by 80% using model selection, caching, RAG, prompt optimization, and batch processing.
Build an AI Chatbot with Python: Complete Guide from Scratch to Deployment
Build an AI chatbot with Python — complete tutorial from OpenAI API integration to conversation memory, streaming responses, and deploying a production-ready chatbot application.
Build a Personal AI Assistant: Complete Python Project with Memory and Tools
Build a personal AI assistant in Python with persistent memory, web search, file access, and calendar integration — a complete project from architecture to working prototype.
CrewAI Tutorial: Build Multi-Agent AI Systems That Work Together
CrewAI tutorial — build multi-agent AI systems where specialized agents collaborate to complete complex tasks, with practical Python examples for research, coding, and content workflows.