Web Scraping with Python: A Gentle Introduction for Beginners
A beginner-friendly Python web scraping guide using requests and BeautifulSoup: extract data from websites, handle pagination, and store results in 2025.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Web Scraping with Python: A Gentle Introduction for Beginners
I remember the moment web scraping became real for me: I needed a list of every Python package on PyPI with their download counts. The data existed on a webpage. The alternative was copying 4,000 rows manually.
I wrote a scraper in 45 minutes. It ran in 30 seconds. I had my data.
Web scraping is that kind of tool — once you know the fundamentals, it unlocks data that would otherwise require hours of manual copying. This guide covers the fundamentals: making requests, parsing HTML, handling the common challenges, and storing what you find.
What Web Scraping Actually Is
When you visit a website, your browser sends an HTTP request and receives HTML, CSS, and JavaScript in response. Your browser renders that into a visual page.
Web scraping does the same thing programmatically: send a request, receive the HTML, and extract the specific data you want from it.
The basic pipeline:
- Use
requeststo download the HTML page - Use
BeautifulSoupto parse the HTML - Find the HTML elements containing your data
- Extract the data and clean it
- Store it (CSV, database, JSON)
Setup
pip install requests beautifulsoup4
Step 1: Making Your First Request
import requests
url = "https://books.toscrape.com/" # A practice website for scraping
response = requests.get(url)
print(response.status_code) # 200 means success
print(response.text[:500]) # First 500 characters of HTML
Adding Headers (Important)
Some websites block requests without a proper User-Agent header. Always include one:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
Step 2: Parsing HTML with BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# Find elements
title = soup.find("title")
print(title.text)
# Find all elements of a type
all_links = soup.find_all("a")
print(f"Found {len(all_links)} links")
# Find by CSS class
highlighted_items = soup.find_all("div", class_="highlight")
# Find by ID
header = soup.find("div", id="header")
The Two Most Important Methods
# .find() — returns the first match
first_paragraph = soup.find("p")
# .find_all() — returns ALL matches as a list
all_paragraphs = soup.find_all("p")
# CSS selector approach (often more precise)
# select_one() returns first match, select() returns all
book_titles = soup.select("article.product_pod h3 a")
Step 3: Extracting Data
# Get element text
element = soup.find("h1")
text = element.text # Raw text with whitespace
text = element.text.strip() # Cleaned text
# Get element attributes
link = soup.find("a")
href = link["href"] # Get attribute value
href = link.get("href", "#") # Safe get with default
# Get nested data
product = soup.find("article", class_="product_pod")
if product:
name = product.find("h3").find("a")["title"]
price = product.find("p", class_="price_color").text
rating = product.find("p", class_="star-rating")["class"][1]
Full Example: Scraping a Book Catalog
Let's scrape book titles, prices, and ratings from books.toscrape.com — a practice website designed for scraping.
import requests
from bs4 import BeautifulSoup
import csv
import time
def scrape_books(base_url: str = "https://books.toscrape.com/") -> list[dict]:
books = []
page_url = base_url
while page_url:
print(f"Scraping: {page_url}")
response = requests.get(page_url, headers={
"User-Agent": "Mozilla/5.0 (educational scraper)"
})
soup = BeautifulSoup(response.text, "html.parser")
# Extract books from this page
for article in soup.select("article.product_pod"):
books.append({
"title": article.find("h3").find("a")["title"],
"price": article.find("p", class_="price_color").text.strip(),
"rating": article.find("p", class_="star-rating")["class"][1],
"availability": article.find("p", class_="availability").text.strip(),
})
# Get the next page URL
next_btn = soup.select_one("li.next a")
if next_btn:
# Handle relative URLs
page_url = base_url + next_btn["href"]
else:
page_url = None
time.sleep(1) # Be polite — wait between requests
return books
def save_to_csv(books: list[dict], filename: str):
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "rating", "availability"])
writer.writeheader()
writer.writerows(books)
print(f"Saved {len(books)} books to {filename}")
if __name__ == "__main__":
books = scrape_books()
save_to_csv(books, "books.csv")
print(f"Total books scraped: {len(books)}")
Run this and you'll have a CSV of 1,000 books with prices and ratings.
Step 4: Handling Common Challenges
Finding the Right CSS Selectors
Right-click on the data you want in Chrome/Firefox → "Inspect" → find the HTML element. Look for:
- The element type (
div,span,p,a,li) - The
classattribute - The
idattribute
The browser DevTools → Console → $$("your.selector") lets you test CSS selectors before writing code.
Handling Pages That Require Scrolling (JavaScript)
If the data isn't in the page source (right-click → View Page Source and search for your data), it's loaded by JavaScript. Use Playwright instead:
from playwright.sync_api import sync_playwright
def scrape_js_page(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
page.wait_for_load_state("networkidle") # Wait for JS to finish
content = page.content()
browser.close()
return content
Install: pip install playwright && playwright install chromium
Handling Authentication (Login Required)
session = requests.Session()
# Login
session.post("https://example.com/login", data={
"username": "your_username",
"password": "your_password"
})
# Now make authenticated requests
response = session.get("https://example.com/protected-page")
Rate Limiting and Error Handling
import time
def get_with_retry(url: str, max_retries: int = 3) -> requests.Response:
for attempt in range(max_retries):
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
if response.status_code == 200:
return response
elif response.status_code == 429: # Too Many Requests
wait_time = 2 ** attempt # Exponential backoff: 1s, 2s, 4s
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
else:
print(f"Error {response.status_code}. Attempt {attempt + 1}/{max_retries}")
raise Exception(f"Failed to fetch {url} after {max_retries} attempts")
Storing Scraped Data
CSV (Simple, Universal)
import csv
with open("data.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["name", "price", "url"])
writer.writeheader()
writer.writerows(data)
SQLite (For Larger Datasets)
import sqlite3
conn = sqlite3.connect("scraped_data.db")
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY,
name TEXT,
price REAL,
url TEXT UNIQUE
)
""")
for item in data:
cursor.execute(
"INSERT OR IGNORE INTO products (name, price, url) VALUES (?, ?, ?)",
(item["name"], item["price"], item["url"])
)
conn.commit()
conn.close()
Scraping Ethics and Best Practices
- Check robots.txt first:
https://example.com/robots.txt— if it says don't scrape, don't - Don't overload servers: Add 1–3 second delays between requests
- Identify yourself: A descriptive User-Agent is courteous
- Use the API if it exists: Many sites have official APIs that are better than scraping
- Scrape public data: Don't scrape data behind login unless you're scraping your own data
For a portfolio project that applies these skills, see our guide on Python projects that get developer jobs.
Frequently Asked Questions
Is web scraping legal?
Public data with proper rate limiting is generally fine. Check robots.txt, respect Terms of Service, don't scrape personal data. Use official APIs when available.
requests vs. Playwright?
requests + BeautifulSoup for static HTML (faster, simpler). Playwright for JavaScript-rendered content.
How do I avoid getting blocked?
Add User-Agent headers, add delays between requests (2s minimum), handle 429 errors with backoff.
What's the best scraping library?
Beginners: requests + BeautifulSoup. JS sites: Playwright. Large-scale: Scrapy.
Final Thoughts
Web scraping is a practical superpower for data collection. Once you understand the request → parse → extract pipeline, you can pull data from almost any public website.
The most important habits: always add a User-Agent, always add delays, and always check whether an API exists before scraping. These three practices keep you ethical and keep your scraper running.
For applying your scraped data with pandas analysis, see our Python data science roadmap. And for building automation pipelines that use these scraping scripts on a schedule, our Python automation scripts guide covers scheduling and pipeline construction.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
The Python Libraries Every Developer Must Know in 2025
The essential Python libraries for 2025: from requests and pandas to FastAPI and LangChain — what each does, when to use it, and how to get started quickly.
Django vs Flask in 2025: Which Framework Should You Learn?
An honest Django vs Flask comparison for 2025 — which Python framework to learn first, when each excels, and why FastAPI has changed the equation.
FastAPI Tutorial: Building Your First REST API in 30 Minutes
A hands-on FastAPI tutorial for beginners: build a fully functional REST API in 30 minutes with CRUD endpoints, request validation, and automatic docs.
Jupyter Notebook Guide: The Data Scientist's Favorite Tool
A complete Jupyter Notebook guide for 2025: installation, essential shortcuts, best practices, and how data scientists use Jupyter for exploration, analysis, and sharing.