Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

Web Scraping with Python: A Gentle Introduction for Beginners

A beginner-friendly Python web scraping guide using requests and BeautifulSoup: extract data from websites, handle pagination, and store results in 2025.

A
AiTechWorlds Team
May 27, 2026 7 min read
📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Web Scraping with Python: A Gentle Introduction for Beginners

I remember the moment web scraping became real for me: I needed a list of every Python package on PyPI with their download counts. The data existed on a webpage. The alternative was copying 4,000 rows manually.

I wrote a scraper in 45 minutes. It ran in 30 seconds. I had my data.

Web scraping is that kind of tool — once you know the fundamentals, it unlocks data that would otherwise require hours of manual copying. This guide covers the fundamentals: making requests, parsing HTML, handling the common challenges, and storing what you find.


What Web Scraping Actually Is

When you visit a website, your browser sends an HTTP request and receives HTML, CSS, and JavaScript in response. Your browser renders that into a visual page.

Web scraping does the same thing programmatically: send a request, receive the HTML, and extract the specific data you want from it.

The basic pipeline:

  1. Use requests to download the HTML page
  2. Use BeautifulSoup to parse the HTML
  3. Find the HTML elements containing your data
  4. Extract the data and clean it
  5. Store it (CSV, database, JSON)

Setup

pip install requests beautifulsoup4

Step 1: Making Your First Request

import requests

url = "https://books.toscrape.com/"  # A practice website for scraping
response = requests.get(url)

print(response.status_code)  # 200 means success
print(response.text[:500])   # First 500 characters of HTML

Adding Headers (Important)

Some websites block requests without a proper User-Agent header. Always include one:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)

Step 2: Parsing HTML with BeautifulSoup

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")

# Find elements
title = soup.find("title")
print(title.text)

# Find all elements of a type
all_links = soup.find_all("a")
print(f"Found {len(all_links)} links")

# Find by CSS class
highlighted_items = soup.find_all("div", class_="highlight")

# Find by ID
header = soup.find("div", id="header")

The Two Most Important Methods

# .find() — returns the first match
first_paragraph = soup.find("p")

# .find_all() — returns ALL matches as a list
all_paragraphs = soup.find_all("p")

# CSS selector approach (often more precise)
# select_one() returns first match, select() returns all
book_titles = soup.select("article.product_pod h3 a")

Step 3: Extracting Data

# Get element text
element = soup.find("h1")
text = element.text          # Raw text with whitespace
text = element.text.strip()  # Cleaned text

# Get element attributes
link = soup.find("a")
href = link["href"]          # Get attribute value
href = link.get("href", "#") # Safe get with default

# Get nested data
product = soup.find("article", class_="product_pod")
if product:
    name = product.find("h3").find("a")["title"]
    price = product.find("p", class_="price_color").text
    rating = product.find("p", class_="star-rating")["class"][1]

Full Example: Scraping a Book Catalog

Let's scrape book titles, prices, and ratings from books.toscrape.com — a practice website designed for scraping.

import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_books(base_url: str = "https://books.toscrape.com/") -> list[dict]:
    books = []
    page_url = base_url
    
    while page_url:
        print(f"Scraping: {page_url}")
        response = requests.get(page_url, headers={
            "User-Agent": "Mozilla/5.0 (educational scraper)"
        })
        soup = BeautifulSoup(response.text, "html.parser")
        
        # Extract books from this page
        for article in soup.select("article.product_pod"):
            books.append({
                "title": article.find("h3").find("a")["title"],
                "price": article.find("p", class_="price_color").text.strip(),
                "rating": article.find("p", class_="star-rating")["class"][1],
                "availability": article.find("p", class_="availability").text.strip(),
            })
        
        # Get the next page URL
        next_btn = soup.select_one("li.next a")
        if next_btn:
            # Handle relative URLs
            page_url = base_url + next_btn["href"]
        else:
            page_url = None
        
        time.sleep(1)  # Be polite — wait between requests
    
    return books

def save_to_csv(books: list[dict], filename: str):
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["title", "price", "rating", "availability"])
        writer.writeheader()
        writer.writerows(books)
    print(f"Saved {len(books)} books to {filename}")

if __name__ == "__main__":
    books = scrape_books()
    save_to_csv(books, "books.csv")
    print(f"Total books scraped: {len(books)}")

Run this and you'll have a CSV of 1,000 books with prices and ratings.


Step 4: Handling Common Challenges

Finding the Right CSS Selectors

Right-click on the data you want in Chrome/Firefox → "Inspect" → find the HTML element. Look for:

  • The element type (div, span, p, a, li)
  • The class attribute
  • The id attribute

The browser DevTools → Console → $$("your.selector") lets you test CSS selectors before writing code.

Handling Pages That Require Scrolling (JavaScript)

If the data isn't in the page source (right-click → View Page Source and search for your data), it's loaded by JavaScript. Use Playwright instead:

from playwright.sync_api import sync_playwright

def scrape_js_page(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        page.wait_for_load_state("networkidle")  # Wait for JS to finish
        content = page.content()
        browser.close()
    return content

Install: pip install playwright && playwright install chromium

Handling Authentication (Login Required)

session = requests.Session()

# Login
session.post("https://example.com/login", data={
    "username": "your_username",
    "password": "your_password"
})

# Now make authenticated requests
response = session.get("https://example.com/protected-page")

Rate Limiting and Error Handling

import time

def get_with_retry(url: str, max_retries: int = 3) -> requests.Response:
    for attempt in range(max_retries):
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        
        if response.status_code == 200:
            return response
        elif response.status_code == 429:  # Too Many Requests
            wait_time = 2 ** attempt  # Exponential backoff: 1s, 2s, 4s
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
        else:
            print(f"Error {response.status_code}. Attempt {attempt + 1}/{max_retries}")
    
    raise Exception(f"Failed to fetch {url} after {max_retries} attempts")

Storing Scraped Data

CSV (Simple, Universal)

import csv
with open("data.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "price", "url"])
    writer.writeheader()
    writer.writerows(data)

SQLite (For Larger Datasets)

import sqlite3

conn = sqlite3.connect("scraped_data.db")
cursor = conn.cursor()

cursor.execute("""
    CREATE TABLE IF NOT EXISTS products (
        id INTEGER PRIMARY KEY,
        name TEXT,
        price REAL,
        url TEXT UNIQUE
    )
""")

for item in data:
    cursor.execute(
        "INSERT OR IGNORE INTO products (name, price, url) VALUES (?, ?, ?)",
        (item["name"], item["price"], item["url"])
    )

conn.commit()
conn.close()

Scraping Ethics and Best Practices

  1. Check robots.txt first: https://example.com/robots.txt — if it says don't scrape, don't
  2. Don't overload servers: Add 1–3 second delays between requests
  3. Identify yourself: A descriptive User-Agent is courteous
  4. Use the API if it exists: Many sites have official APIs that are better than scraping
  5. Scrape public data: Don't scrape data behind login unless you're scraping your own data

For a portfolio project that applies these skills, see our guide on Python projects that get developer jobs.


Frequently Asked Questions

Is web scraping legal?

Public data with proper rate limiting is generally fine. Check robots.txt, respect Terms of Service, don't scrape personal data. Use official APIs when available.

requests vs. Playwright?

requests + BeautifulSoup for static HTML (faster, simpler). Playwright for JavaScript-rendered content.

How do I avoid getting blocked?

Add User-Agent headers, add delays between requests (2s minimum), handle 429 errors with backoff.

What's the best scraping library?

Beginners: requests + BeautifulSoup. JS sites: Playwright. Large-scale: Scrapy.


Final Thoughts

Web scraping is a practical superpower for data collection. Once you understand the request → parse → extract pipeline, you can pull data from almost any public website.

The most important habits: always add a User-Agent, always add delays, and always check whether an API exists before scraping. These three practices keep you ethical and keep your scraper running.

For applying your scraped data with pandas analysis, see our Python data science roadmap. And for building automation pipelines that use these scraping scripts on a schedule, our Python automation scripts guide covers scheduling and pipeline construction.

Share this article:

Frequently Asked Questions

Web scraping legality depends on what you're scraping and how. Legal: scraping publicly available information (no login required), respecting robots.txt, not scraping personal data. Potentially problematic: ignoring Terms of Service that prohibit scraping, commercial use of scraped data, overwhelming servers with requests. The safest approach: check robots.txt before scraping, add delays between requests, don't scrape personal data, and check if the site offers an API instead. Many sites (Twitter, Reddit, GitHub) provide official APIs that are the preferred way to access their data.
A

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

Related Articles

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources
Join Free Channel

No spam. Leave anytime.

!