Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

Hugging Face Transformers Tutorial: Complete Guide to Using Pretrained Models

Hugging Face Transformers tutorial — load, fine-tune, and deploy pretrained models for text classification, generation, summarization, and translation with practical Python examples.

A
AiTechWorlds Team
May 27, 2026 7 min read
📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Hugging Face Transformers Tutorial: Complete Guide to Using Pretrained Models

When I started working with language models, the barrier was enormous — implementing BERT from scratch, managing training loops, debugging tensor shapes. Hugging Face Transformers changed this entirely.

The library provides access to hundreds of thousands of pretrained models with a consistent API. Today, sentiment analysis on a custom dataset is 50 lines of Python. Fine-tuning BERT for text classification takes an afternoon. This guide covers the patterns you'll use in 90% of real projects.


Installation

pip install transformers datasets evaluate accelerate
pip install torch torchvision  # Or tensorflow
pip install peft  # For efficient fine-tuning

# Optional: GPU acceleration
pip install bitsandbytes  # 4-bit/8-bit quantization

The Pipeline API: One-Line Inference

from transformers import pipeline

# Sentiment analysis
sentiment = pipeline("sentiment-analysis")
result = sentiment("I absolutely loved this product!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]

# Batch processing
texts = [
    "This is great!",
    "Not what I expected.",
    "Completely useless.",
]
results = sentiment(texts)

# Named entity recognition
ner = pipeline("ner", grouped_entities=True)
entities = ner("Apple CEO Tim Cook announced new products in San Francisco.")
for e in entities:
    print(f"{e['entity_group']}: {e['word']} ({e['score']:.2f})")

# Question answering
qa = pipeline("question-answering")
context = "LangChain is a framework for building LLM applications. It was created in 2022."
answer = qa(question="When was LangChain created?", context=context)
print(f"Answer: {answer['answer']} (score: {answer['score']:.3f})")

# Text generation
generator = pipeline("text-generation", model="gpt2", max_length=100)
output = generator("The future of artificial intelligence is")
print(output[0]["generated_text"])

# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
long_text = """[Your long article text here...]"""
summary = summarizer(long_text, max_length=130, min_length=30)
print(summary[0]["summary_text"])

# Translation
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
translated = translator("Hello, how are you doing today?")
print(translated[0]["translation_text"])  # "Bonjour, comment allez-vous aujourd'hui?"

# Zero-shot classification (no training needed)
classifier = pipeline("zero-shot-classification")
result = classifier(
    "I need to cancel my subscription immediately.",
    candidate_labels=["account management", "billing", "technical support", "complaint"]
)
print(result["labels"][0])  # Most likely label

AutoModel and AutoTokenizer

For more control:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenization
text = "This movie was absolutely fantastic!"
inputs = tokenizer(
    text,
    return_tensors="pt",       # PyTorch tensors
    truncation=True,            # Truncate to max_length
    padding=True,               # Pad to same length in batch
    max_length=512
)

print(f"Input IDs shape: {inputs['input_ids'].shape}")
print(f"Token IDs: {inputs['input_ids'][0][:10].tolist()}")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
probabilities = torch.softmax(logits, dim=1)
predicted_class = torch.argmax(logits, dim=1).item()

labels = model.config.id2label
print(f"Prediction: {labels[predicted_class]} ({probabilities[0][predicted_class]:.3f})")

# Batch processing efficiently
texts = ["I love this!", "I hate this.", "It's okay.", "Absolutely amazing!"]

inputs = tokenizer(
    texts,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.softmax(outputs.logits, dim=1)
for text, pred in zip(texts, predictions):
    label = labels[torch.argmax(pred).item()]
    confidence = torch.max(pred).item()
    print(f"[{label} {confidence:.2f}] {text}")

Text Generation with LLaMA

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",        # Distribute across available GPUs
)

# Format as instruction following (LLaMA 3.1 chat format)
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to find prime numbers up to n."}
]

# Apply chat template (handles special tokens automatically)
input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,        # Lower = more deterministic for code
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1   # Avoid repetitive output
)

# Decode only the generated tokens (not the prompt)
new_tokens = outputs[0][inputs['input_ids'].shape[-1]:]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)
print(response)

Fine-Tuning for Classification

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import Dataset
import evaluate
import numpy as np

# Prepare your data
train_data = {
    "text": ["Great product!", "Terrible experience", "Works as expected", "Love it!", "Waste of money"],
    "label": [1, 0, 1, 1, 0]  # 1 = positive, 0 = negative
}
eval_data = {
    "text": ["Really enjoyed it", "Not satisfied"],
    "label": [1, 0]
}

train_dataset = Dataset.from_dict(train_data)
eval_dataset = Dataset.from_dict(eval_data)

# Load model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)

# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=128
    )

train_tokenized = train_dataset.map(tokenize_function, batched=True)
eval_tokenized = eval_dataset.map(tokenize_function, batched=True)

# Data collator (handles padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Metrics
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=100,
    weight_decay=0.01,
    learning_rate=2e-5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_dir="./logs",
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("./my-sentiment-model")

# Use fine-tuned model
fine_tuned = pipeline("sentiment-analysis", model="./my-sentiment-model")
print(fine_tuned("This is wonderful!"))

Efficient Fine-Tuning with PEFT/LoRA

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer

# Load a larger model that would normally be too big to fine-tune
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=16,                    # LoRA rank (lower = fewer parameters)
    lora_alpha=32,           # Scaling factor
    lora_dropout=0.1,
    target_modules=["q", "v"]  # Which weight matrices to add LoRA to
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 2,359,296 || all params: 783,150,080 (0.30%)
# Only 0.3% of parameters are trained!

# Now train as normal — much smaller memory footprint
# ... (rest of training is identical to standard Trainer usage)

# Save LoRA adapter only (much smaller than full model)
model.save_pretrained("./lora-adapter")
# Later: load base model + LoRA adapter together

Pushing Models to Hugging Face Hub

from huggingface_hub import HfApi
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Login (get token from huggingface.co/settings/tokens)
from huggingface_hub import notebook_login
notebook_login()  # Or: huggingface-cli login in terminal

# Push model to Hub
model.push_to_hub("your-username/my-sentiment-model")
tokenizer.push_to_hub("your-username/my-sentiment-model")

# Pull your model anywhere
model = AutoModelForSequenceClassification.from_pretrained("your-username/my-sentiment-model")

Conclusion

The Hugging Face ecosystem — Transformers, Datasets, PEFT, Evaluate, Hub — forms a complete ML platform. The pipeline API makes inference trivial; AutoModel gives you full control; Trainer handles the fine-tuning boilerplate; PEFT enables large model fine-tuning on consumer hardware.

The pattern that works in practice: start with a pretrained model from the Hub, fine-tune with LoRA on your domain data, evaluate with Evaluate metrics, and deploy via the Transformers pipeline or TGI server.

For using Hugging Face models in RAG pipelines, see our RAG system tutorial. For understanding the transformer architecture these models are built on, see our transformer architecture guide.


Frequently Asked Questions

What is the Hugging Face Transformers library?

An open-source library providing 500,000+ pretrained models with a consistent Python API. Supports BERT, GPT, LLaMA, T5, and hundreds of architectures for NLP, vision, and audio. Models download automatically by name. Essential for any ML engineer working with language models.

What is the difference between pipeline() and AutoModel?

Pipeline: one line for standard tasks, handles tokenization and output processing automatically. AutoModel: lower-level control, required for custom tasks, fine-tuning, or non-standard processing. Start with pipeline; use AutoModel when you need more control.

How do I fine-tune a Hugging Face model?

Load model + tokenizer, tokenize your dataset, define TrainingArguments, create Trainer, call trainer.train(). For memory efficiency: use PEFT/LoRA (trains <1% of parameters). BERT-base fine-tuning for classification takes 10-30 minutes on a single GPU.

What is PEFT and when should I use it?

Parameter-Efficient Fine-Tuning — trains <1% of parameters via LoRA, fitting large models on consumer GPUs. Use when the model is too large for full fine-tuning, or when you want multiple specialized adapters from one base model. QLoRA adds 4-bit quantization for even larger models.

How do I use Hugging Face models for text generation?

Load with AutoModelForCausalLM, use tokenizer.apply_chat_template() for instruction-tuned models, call model.generate() with temperature/top_p parameters. For production: use TGI (text-generation-inference) server for OpenAI-compatible API.

Share this article:

Frequently Asked Questions

Transformers is an open-source Python library providing access to 500,000+ pretrained models for NLP, computer vision, audio, and multimodal tasks. It abstracts the complexity of running BERT, GPT, LLaMA, T5, and hundreds of other architectures behind a consistent API. Core abstractions: Pipeline (simplest — one-line inference), AutoModel/AutoTokenizer (flexible loading), Trainer (fine-tuning). The Hugging Face Hub is the model repository — you reference models by 'organization/model-name' and the library downloads them automatically. Essential for any Python ML engineer working with language models.
A

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

Related Articles

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources
Join Free Channel

No spam. Leave anytime.

!