Python Web Scraping Guide 2026 — BeautifulSoup, Requests & Playwright
Learn web scraping with Python from scratch. Master BeautifulSoup, Requests, and Playwright to extract data from any website. Complete 2026 guide with real projects.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Python Web Scraping Guide 2026 — Extract Data from Any Website
Here is a skill that changes everything. Once you know how to scrape the web with Python, you can pull prices from e-commerce sites, monitor job boards, collect research data, track sports scores, or build your own news aggregator — all automatically, while you sleep.
Web scraping is one of those Python superpowers that opens doors everywhere, from data science to automation to freelance projects. And in 2026, with tools like Playwright making it easier than ever to handle even complex JavaScript-heavy sites, there has never been a better time to learn it.
This guide walks you from absolute beginner to building real scraping projects.
What Is Web Scraping?
Web scraping is the process of automatically extracting data from websites. Instead of manually copying information, you write a Python script that:
- Sends an HTTP request to a webpage
- Downloads the HTML content
- Parses the HTML to find specific data
- Saves the data in a usable format (CSV, JSON, database)
Think of it as teaching Python to read a website the way you do — but a thousand times faster.
The Python Scraping Toolkit
Before diving in, understand which tool to reach for:
| Tool | Best For | Handles JavaScript? |
|---|---|---|
requests | Fetching HTML pages | No |
BeautifulSoup | Parsing HTML structure | No |
lxml | Fast HTML/XML parsing | No |
Playwright | Modern JavaScript SPAs | Yes |
Selenium | Browser automation, legacy JS | Yes |
Scrapy | Large-scale crawling pipelines | No |
For most projects: requests + BeautifulSoup for static sites, Playwright for dynamic sites.
Setup
pip install requests beautifulsoup4 lxml playwright
playwright install chromium
Part 1: Scraping Static Websites
Static websites serve the full HTML content in the initial response. Most news sites, Wikipedia, e-commerce product pages, and blog sites are static.
Your First Scraper
import requests
from bs4 import BeautifulSoup
def scrape_page(url: str) -> BeautifulSoup:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # Raises exception for 4xx/5xx errors
return BeautifulSoup(response.content, "lxml")
# Example: scrape a Wikipedia page
soup = scrape_page("https://en.wikipedia.org/wiki/Python_(programming_language)")
title = soup.find("h1", id="firstHeading").text
print(f"Title: {title}")
Always set a User-Agent header. Many sites block requests that don't look like real browsers.
Navigating HTML with BeautifulSoup
soup = scrape_page("https://books.toscrape.com")
# Find by tag
all_h1 = soup.find_all("h1")
# Find by CSS class
books = soup.find_all("article", class_="product_pod")
# Find by attribute
link = soup.find("a", href=True)
# CSS selector (most flexible)
prices = soup.select("p.price_color")
titles = soup.select("h3 > a")
for i, (title, price) in enumerate(zip(titles[:5], prices[:5])):
print(f"{i+1}. {title['title']} — {price.text.strip()}")
Real Project: Scrape Book Prices
import requests
from bs4 import BeautifulSoup
import csv
import time
BASE_URL = "https://books.toscrape.com/catalogue/"
def scrape_books(max_pages: int = 5) -> list[dict]:
books = []
for page in range(1, max_pages + 1):
url = f"https://books.toscrape.com/catalogue/page-{page}.html"
soup = scrape_page(url)
for article in soup.select("article.product_pod"):
title = article.select_one("h3 > a")["title"]
price = article.select_one("p.price_color").text.strip()
rating_word = article.select_one("p.star-rating")["class"][1]
books.append({
"title": title,
"price": price,
"rating": rating_word,
})
print(f"Scraped page {page} — {len(books)} books so far")
time.sleep(1) # Be polite — don't hammer the server
return books
def save_to_csv(books: list[dict], filename: str) -> None:
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "rating"])
writer.writeheader()
writer.writerows(books)
print(f"Saved {len(books)} books to {filename}")
books = scrape_books(max_pages=3)
save_to_csv(books, "books.csv")
Part 2: Scraping Dynamic JavaScript Websites
Modern web apps use React, Vue, or Angular. The HTML served initially is mostly empty — data loads via JavaScript after the page loads. requests only sees that empty shell.
Playwright solves this by controlling a real browser (Chromium/Firefox/WebKit).
Playwright Setup
from playwright.sync_api import sync_playwright
def scrape_dynamic(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Set realistic viewport and user agent
page.set_viewport_size({"width": 1280, "height": 720})
page.goto(url, wait_until="networkidle") # Wait for all requests to finish
content = page.content() # Get full rendered HTML
browser.close()
return content
Waiting for Dynamic Content
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
def scrape_spa(url: str, wait_selector: str) -> BeautifulSoup:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
# Wait until the specific element appears
page.wait_for_selector(wait_selector, timeout=10000)
html = page.content()
browser.close()
return BeautifulSoup(html, "lxml")
# Example: wait for a product grid to load
soup = scrape_spa("https://example-shop.com/products", ".product-grid")
products = soup.select(".product-card")
Interacting with Pages
from playwright.sync_api import sync_playwright
def search_and_scrape(query: str) -> list[dict]:
results = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://books.toscrape.com")
# Type in a search box
page.fill("input[name='q']", query)
page.press("input[name='q']", "Enter")
page.wait_for_load_state("networkidle")
# Extract results
for item in page.query_selector_all(".product_pod"):
title = item.query_selector("h3 > a").get_attribute("title")
price = item.query_selector(".price_color").inner_text()
results.append({"title": title, "price": price})
browser.close()
return results
Part 3: Handling Pagination
Most real scrapers need to follow pagination — going through page 1, 2, 3... until all data is collected.
import requests
from bs4 import BeautifulSoup
import time
def scrape_all_pages(base_url: str) -> list[dict]:
items = []
page = 1
while True:
url = f"{base_url}?page={page}"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
if response.status_code == 404:
print(f"Reached end at page {page}")
break
soup = BeautifulSoup(response.content, "lxml")
products = soup.select(".product-item")
if not products:
break # No more items
for product in products:
items.append({
"name": product.select_one(".name").text.strip(),
"price": product.select_one(".price").text.strip(),
})
print(f"Page {page}: {len(products)} items")
page += 1
time.sleep(0.5) # Rate limiting
return items
Part 4: Storing Scraped Data
Save to CSV
import csv
def save_csv(data: list[dict], filename: str) -> None:
if not data:
return
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
Save to JSON
import json
def save_json(data: list[dict], filename: str) -> None:
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
Save to SQLite
import sqlite3
def save_to_db(data: list[dict], db_path: str, table: str) -> None:
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
if data:
columns = ", ".join(data[0].keys())
placeholders = ", ".join(["?" for _ in data[0]])
cursor.execute(f"CREATE TABLE IF NOT EXISTS {table} ({columns})")
rows = [tuple(row.values()) for row in data]
cursor.executemany(f"INSERT INTO {table} VALUES ({placeholders})", rows)
conn.commit()
conn.close()
print(f"Saved {len(data)} rows to {db_path}")
Part 5: Being a Responsible Scraper
Bad scraping gets your IP banned and can harm small websites. Follow these rules:
Always check robots.txt:
import urllib.robotparser
def is_allowed(url: str, user_agent: str = "*") -> bool:
rp = urllib.robotparser.RobotFileParser()
from urllib.parse import urljoin, urlparse
base = f"{urlparse(url).scheme}://{urlparse(url).netloc}"
rp.set_url(urljoin(base, "/robots.txt"))
rp.read()
return rp.can_fetch(user_agent, url)
Rate limiting with exponential backoff:
import time
import random
def polite_get(url: str, min_delay: float = 1.0, max_delay: float = 3.0) -> requests.Response:
time.sleep(random.uniform(min_delay, max_delay))
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
return response
Rules to follow:
- Add
time.sleep()between requests (minimum 1 second) - Identify yourself with a meaningful User-Agent
- Respect
robots.txt - Never scrape login-protected content without authorization
- Don't scrape personal data (names, emails, phone numbers) without clear legal basis
Common Scraping Problems and Solutions
| Problem | Cause | Solution |
|---|---|---|
| 403 Forbidden | No User-Agent / bot detection | Add realistic headers |
| Empty results | JavaScript rendering | Switch to Playwright |
| IP banned | Too many requests | Add delays, use proxies |
| Data missing | Page not fully loaded | Use wait_for_selector |
| Encoding errors | Non-UTF-8 content | Use response.content not .text |
What to Build Next
Web scraping is most powerful when combined with data analysis. Once you can collect data, learn how to analyze it — our Python Pandas tutorial shows you how to process CSV files and find insights.
For automating scraping jobs to run on a schedule, check out our Python automation scripts guide — it covers scheduling tasks with schedule and deploying scripts to run 24/7.
If you are still new to Python, start with the Python beginners roadmap first to build a solid foundation before tackling scraping projects.
Your Scraping Project Roadmap
| Level | Project | Skills Learned |
|---|---|---|
| Beginner | Scrape book titles + prices | requests, BeautifulSoup, CSV |
| Intermediate | Multi-page news aggregator | Pagination, error handling, JSON |
| Advanced | E-commerce price tracker | Playwright, SQLite, scheduling |
| Pro | Social media monitor | Authentication, rate limiting, async |
Start with books.toscrape.com — it is specifically designed for scraping practice. Build your first working scraper today, and you will be amazed what you can build from there.
Get Python scraping templates and cheat sheets in the AiTechWorlds Telegram channel — free for members!
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Python Async Programming Guide 2026 — asyncio, aiohttp & Concurrency
Master async programming in Python with asyncio. Learn concurrent programming, aiohttp for async HTTP, async database operations, and build high-performance Python applications.
Python OOP Complete Guide 2026 — Object-Oriented Programming Mastery
Master Python object-oriented programming from basics to advanced. Classes, inheritance, polymorphism, SOLID principles, dataclasses — everything you need to write professional Python.
Python Error Handling & Debugging 2026 — Write Bulletproof Code
Master Python error handling and debugging techniques. Learn try/except, custom exceptions, logging, pdb, and professional debugging strategies to write robust Python code.
Python Decorators and Generators — Advanced Python Made Simple 2026
Master Python decorators and generators — two of Python's most powerful features. Clear explanations, real-world examples, and practical patterns you'll actually use.