Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

Python Web Scraping Guide 2026 — BeautifulSoup, Requests & Playwright

Learn web scraping with Python from scratch. Master BeautifulSoup, Requests, and Playwright to extract data from any website. Complete 2026 guide with real projects.

A
AiTechWorlds Team
May 1, 2026 8 min readUpdated May 15, 2026
📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Python Web Scraping Guide 2026 — Extract Data from Any Website

Here is a skill that changes everything. Once you know how to scrape the web with Python, you can pull prices from e-commerce sites, monitor job boards, collect research data, track sports scores, or build your own news aggregator — all automatically, while you sleep.

Web scraping is one of those Python superpowers that opens doors everywhere, from data science to automation to freelance projects. And in 2026, with tools like Playwright making it easier than ever to handle even complex JavaScript-heavy sites, there has never been a better time to learn it.

This guide walks you from absolute beginner to building real scraping projects.


What Is Web Scraping?

Web scraping is the process of automatically extracting data from websites. Instead of manually copying information, you write a Python script that:

  1. Sends an HTTP request to a webpage
  2. Downloads the HTML content
  3. Parses the HTML to find specific data
  4. Saves the data in a usable format (CSV, JSON, database)

Think of it as teaching Python to read a website the way you do — but a thousand times faster.


The Python Scraping Toolkit

Before diving in, understand which tool to reach for:

ToolBest ForHandles JavaScript?
requestsFetching HTML pagesNo
BeautifulSoupParsing HTML structureNo
lxmlFast HTML/XML parsingNo
PlaywrightModern JavaScript SPAsYes
SeleniumBrowser automation, legacy JSYes
ScrapyLarge-scale crawling pipelinesNo

For most projects: requests + BeautifulSoup for static sites, Playwright for dynamic sites.


Setup

pip install requests beautifulsoup4 lxml playwright
playwright install chromium

Part 1: Scraping Static Websites

Static websites serve the full HTML content in the initial response. Most news sites, Wikipedia, e-commerce product pages, and blog sites are static.

Your First Scraper

import requests
from bs4 import BeautifulSoup

def scrape_page(url: str) -> BeautifulSoup:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()  # Raises exception for 4xx/5xx errors
    return BeautifulSoup(response.content, "lxml")

# Example: scrape a Wikipedia page
soup = scrape_page("https://en.wikipedia.org/wiki/Python_(programming_language)")
title = soup.find("h1", id="firstHeading").text
print(f"Title: {title}")

Always set a User-Agent header. Many sites block requests that don't look like real browsers.

soup = scrape_page("https://books.toscrape.com")

# Find by tag
all_h1 = soup.find_all("h1")

# Find by CSS class
books = soup.find_all("article", class_="product_pod")

# Find by attribute
link = soup.find("a", href=True)

# CSS selector (most flexible)
prices = soup.select("p.price_color")
titles = soup.select("h3 > a")

for i, (title, price) in enumerate(zip(titles[:5], prices[:5])):
    print(f"{i+1}. {title['title']} — {price.text.strip()}")

Real Project: Scrape Book Prices

import requests
from bs4 import BeautifulSoup
import csv
import time

BASE_URL = "https://books.toscrape.com/catalogue/"

def scrape_books(max_pages: int = 5) -> list[dict]:
    books = []
    
    for page in range(1, max_pages + 1):
        url = f"https://books.toscrape.com/catalogue/page-{page}.html"
        soup = scrape_page(url)
        
        for article in soup.select("article.product_pod"):
            title = article.select_one("h3 > a")["title"]
            price = article.select_one("p.price_color").text.strip()
            rating_word = article.select_one("p.star-rating")["class"][1]
            
            books.append({
                "title": title,
                "price": price,
                "rating": rating_word,
            })
        
        print(f"Scraped page {page} — {len(books)} books so far")
        time.sleep(1)  # Be polite — don't hammer the server
    
    return books

def save_to_csv(books: list[dict], filename: str) -> None:
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["title", "price", "rating"])
        writer.writeheader()
        writer.writerows(books)
    print(f"Saved {len(books)} books to {filename}")

books = scrape_books(max_pages=3)
save_to_csv(books, "books.csv")

Part 2: Scraping Dynamic JavaScript Websites

Modern web apps use React, Vue, or Angular. The HTML served initially is mostly empty — data loads via JavaScript after the page loads. requests only sees that empty shell.

Playwright solves this by controlling a real browser (Chromium/Firefox/WebKit).

Playwright Setup

from playwright.sync_api import sync_playwright

def scrape_dynamic(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        # Set realistic viewport and user agent
        page.set_viewport_size({"width": 1280, "height": 720})
        
        page.goto(url, wait_until="networkidle")  # Wait for all requests to finish
        content = page.content()  # Get full rendered HTML
        
        browser.close()
        return content

Waiting for Dynamic Content

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_spa(url: str, wait_selector: str) -> BeautifulSoup:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        
        # Wait until the specific element appears
        page.wait_for_selector(wait_selector, timeout=10000)
        
        html = page.content()
        browser.close()
    
    return BeautifulSoup(html, "lxml")

# Example: wait for a product grid to load
soup = scrape_spa("https://example-shop.com/products", ".product-grid")
products = soup.select(".product-card")

Interacting with Pages

from playwright.sync_api import sync_playwright

def search_and_scrape(query: str) -> list[dict]:
    results = []
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto("https://books.toscrape.com")
        
        # Type in a search box
        page.fill("input[name='q']", query)
        page.press("input[name='q']", "Enter")
        page.wait_for_load_state("networkidle")
        
        # Extract results
        for item in page.query_selector_all(".product_pod"):
            title = item.query_selector("h3 > a").get_attribute("title")
            price = item.query_selector(".price_color").inner_text()
            results.append({"title": title, "price": price})
        
        browser.close()
    
    return results

Part 3: Handling Pagination

Most real scrapers need to follow pagination — going through page 1, 2, 3... until all data is collected.

import requests
from bs4 import BeautifulSoup
import time

def scrape_all_pages(base_url: str) -> list[dict]:
    items = []
    page = 1
    
    while True:
        url = f"{base_url}?page={page}"
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
        
        if response.status_code == 404:
            print(f"Reached end at page {page}")
            break
        
        soup = BeautifulSoup(response.content, "lxml")
        products = soup.select(".product-item")
        
        if not products:
            break  # No more items
        
        for product in products:
            items.append({
                "name": product.select_one(".name").text.strip(),
                "price": product.select_one(".price").text.strip(),
            })
        
        print(f"Page {page}: {len(products)} items")
        page += 1
        time.sleep(0.5)  # Rate limiting
    
    return items

Part 4: Storing Scraped Data

Save to CSV

import csv

def save_csv(data: list[dict], filename: str) -> None:
    if not data:
        return
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

Save to JSON

import json

def save_json(data: list[dict], filename: str) -> None:
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

Save to SQLite

import sqlite3

def save_to_db(data: list[dict], db_path: str, table: str) -> None:
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    
    if data:
        columns = ", ".join(data[0].keys())
        placeholders = ", ".join(["?" for _ in data[0]])
        cursor.execute(f"CREATE TABLE IF NOT EXISTS {table} ({columns})")
        
        rows = [tuple(row.values()) for row in data]
        cursor.executemany(f"INSERT INTO {table} VALUES ({placeholders})", rows)
    
    conn.commit()
    conn.close()
    print(f"Saved {len(data)} rows to {db_path}")

Part 5: Being a Responsible Scraper

Bad scraping gets your IP banned and can harm small websites. Follow these rules:

Always check robots.txt:

import urllib.robotparser

def is_allowed(url: str, user_agent: str = "*") -> bool:
    rp = urllib.robotparser.RobotFileParser()
    from urllib.parse import urljoin, urlparse
    base = f"{urlparse(url).scheme}://{urlparse(url).netloc}"
    rp.set_url(urljoin(base, "/robots.txt"))
    rp.read()
    return rp.can_fetch(user_agent, url)

Rate limiting with exponential backoff:

import time
import random

def polite_get(url: str, min_delay: float = 1.0, max_delay: float = 3.0) -> requests.Response:
    time.sleep(random.uniform(min_delay, max_delay))
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
    return response

Rules to follow:

  • Add time.sleep() between requests (minimum 1 second)
  • Identify yourself with a meaningful User-Agent
  • Respect robots.txt
  • Never scrape login-protected content without authorization
  • Don't scrape personal data (names, emails, phone numbers) without clear legal basis

Common Scraping Problems and Solutions

ProblemCauseSolution
403 ForbiddenNo User-Agent / bot detectionAdd realistic headers
Empty resultsJavaScript renderingSwitch to Playwright
IP bannedToo many requestsAdd delays, use proxies
Data missingPage not fully loadedUse wait_for_selector
Encoding errorsNon-UTF-8 contentUse response.content not .text

What to Build Next

Web scraping is most powerful when combined with data analysis. Once you can collect data, learn how to analyze it — our Python Pandas tutorial shows you how to process CSV files and find insights.

For automating scraping jobs to run on a schedule, check out our Python automation scripts guide — it covers scheduling tasks with schedule and deploying scripts to run 24/7.

If you are still new to Python, start with the Python beginners roadmap first to build a solid foundation before tackling scraping projects.


Your Scraping Project Roadmap

LevelProjectSkills Learned
BeginnerScrape book titles + pricesrequests, BeautifulSoup, CSV
IntermediateMulti-page news aggregatorPagination, error handling, JSON
AdvancedE-commerce price trackerPlaywright, SQLite, scheduling
ProSocial media monitorAuthentication, rate limiting, async

Start with books.toscrape.com — it is specifically designed for scraping practice. Build your first working scraper today, and you will be amazed what you can build from there.

Get Python scraping templates and cheat sheets in the AiTechWorlds Telegram channel — free for members!

Share this article:

Frequently Asked Questions

Web scraping is legal for publicly available data. Always check a site's robots.txt file and Terms of Service. Avoid scraping personal data, login-protected pages without permission, or at rates that harm the server.
A

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

Related Articles

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources
Join Free Channel

No spam. Leave anytime.

!