AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

web browser automation code running — LangChain web browsing agent Playwright

How to Build a LangChain Agent That Browses the Web (2026)

⚡ Quick Answer

Learn to build a LangChain web browsing agent using Playwright, newspaper3k, and FireCrawl with rate limiting, multi-page crawling, and real code examples.

AiTechWorlds Team May 31, 2026 12 min read

#LangChain #Playwright #Web Scraping #AI Agent #Python

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Most LangChain agents I see in tutorials are search-and-summarize setups. They call a search API, get some text back, and write a response. That works for simple questions. But when you need an agent that can actually navigate a site, click through pagination, fill forms, or extract structured data from dynamic JavaScript pages — that's a different problem entirely.

I built a research assistant that needed to pull data from sites that don't have APIs. Some rendered JavaScript, some required navigation, one needed scroll-to-load content. Playwright handled all of it. This guide covers how to wire Playwright into a LangChain agent properly, including the rate limiting patterns that keep you from getting IP-banned within five minutes.

If you haven't built a basic LangChain agent yet, start with Build AI agent with LangChain first. If you're interested in retrieval over the content you scrape, RAG system tutorial pairs naturally with this guide.

Why Web Browsing Agents Are Harder Than They Look

Search tools give you preprocessed text. A web browsing agent has to deal with the raw internet — JavaScript rendering, cookie consent popups, lazy-loaded content, infinite scroll, login walls, and sites that actively detect and block scrapers. The gap between "I'll just use requests" and "this actually works on modern websites" is substantial.

There are three main approaches:

requests + BeautifulSoup — Fast, lightweight, works on static HTML. Fails on JS-rendered content.
Playwright / Selenium — Full browser automation. Handles everything. Slower and heavier.
FireCrawl / Jina Reader — Managed scraping services. They deal with the hard parts for you.

For a LangChain agent, you usually want a combination: FireCrawl for general browsing, Playwright for sites that need interactive navigation.

Setting Up Playwright for LangChain

pip install playwright langchain langchain-openai langchain-community
playwright install chromium

Here's a basic Playwright-based scraping tool:

import asyncio
from playwright.async_api import async_playwright
from langchain.tools import tool
from bs4 import BeautifulSoup
import re

async def scrape_page_async(url: str, wait_for: str = None, timeout: int = 30000) -> dict:
    """Core async scraping function using Playwright."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--no-sandbox", "--disable-dev-shm-usage"]
        )
        
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
            viewport={"width": 1280, "height": 800}
        )
        
        page = await context.new_page()
        
        try:
            await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
            
            # Wait for specific element if needed
            if wait_for:
                await page.wait_for_selector(wait_for, timeout=timeout)
            
            # Handle cookie consent banners
            consent_selectors = [
                "button[id*='accept']",
                "button[class*='accept']",
                "[data-testid='cookie-accept']"
            ]
            for selector in consent_selectors:
                try:
                    btn = await page.query_selector(selector)
                    if btn:
                        await btn.click()
                        await page.wait_for_timeout(500)
                        break
                except:
                    pass
            
            # Get the full rendered HTML
            html = await page.content()
            title = await page.title()
            
            # Extract text using BeautifulSoup
            soup = BeautifulSoup(html, "html.parser")
            
            # Remove noise
            for tag in soup(["script", "style", "nav", "footer", "aside", "header"]):
                tag.decompose()
            
            text = soup.get_text(separator="\n", strip=True)
            # Clean up excessive whitespace
            text = re.sub(r'\n{3,}', '\n\n', text)
            
            return {
                "url": url,
                "title": title,
                "content": text[:8000],  # limit to ~8k chars
                "success": True
            }
            
        except Exception as e:
            return {"url": url, "error": str(e), "success": False}
        finally:
            await browser.close()

@tool
def playwright_scrape(url: str) -> str:
    """
    Fetches and extracts text content from any web page, including JavaScript-rendered sites.
    Use when you need to read a specific URL. Returns the page title and main content.
    
    Args:
        url: The full URL to scrape (must start with http:// or https://)
    """
    if not url.startswith(("http://", "https://")):
        return "Error: URL must start with http:// or https://"
    
    result = asyncio.run(scrape_page_async(url))
    
    if result["success"]:
        return f"Title: {result['title']}\n\nContent:\n{result['content']}"
    else:
        return f"Failed to scrape {url}: {result['error']}"

Multi-Page Crawling Pattern

Single-page scraping is just the start. Research agents often need to follow links, paginate through results, or crawl a site's structure. Here's a controlled crawling tool:

import time
import random
from urllib.parse import urljoin, urlparse
from typing import List

async def crawl_site_async(
    start_url: str,
    max_pages: int = 5,
    same_domain_only: bool = True,
    delay_range: tuple = (1.5, 3.5)
) -> List[dict]:
    """Crawls multiple pages with rate limiting."""
    
    visited = set()
    results = []
    queue = [start_url]
    base_domain = urlparse(start_url).netloc
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (compatible; ResearchBot/1.0)"
        )
        
        while queue and len(visited) < max_pages:
            url = queue.pop(0)
            
            if url in visited:
                continue
            
            visited.add(url)
            
            # Rate limiting — critical to avoid blocks
            delay = random.uniform(*delay_range)
            await asyncio.sleep(delay)
            
            page = await context.new_page()
            
            try:
                await page.goto(url, wait_until="domcontentloaded", timeout=20000)
                html = await page.content()
                title = await page.title()
                
                soup = BeautifulSoup(html, "html.parser")
                for tag in soup(["script", "style", "nav", "footer"]):
                    tag.decompose()
                
                text = soup.get_text(separator="\n", strip=True)[:5000]
                
                results.append({
                    "url": url,
                    "title": title,
                    "content": text
                })
                
                # Find links for the queue
                links = soup.find_all("a", href=True)
                for link in links:
                    href = urljoin(url, link["href"])
                    parsed = urlparse(href)
                    
                    # Filter conditions
                    if not href.startswith("http"):
                        continue
                    if same_domain_only and parsed.netloc != base_domain:
                        continue
                    if href in visited:
                        continue
                    if any(ext in href for ext in [".pdf", ".jpg", ".png", ".zip"]):
                        continue
                    
                    queue.append(href)
                    
            except Exception as e:
                results.append({"url": url, "error": str(e)})
            finally:
                await page.close()
        
        await browser.close()
    
    return results

@tool
def crawl_website(start_url: str, max_pages: int = 3) -> str:
    """
    Crawls a website starting from the given URL, following internal links.
    Respects rate limits. Use for researching a website's content across multiple pages.
    Max pages capped at 10 for safety.
    
    Args:
        start_url: The URL to start crawling from
        max_pages: Maximum number of pages to visit (default 3, max 10)
    """
    max_pages = min(max_pages, 10)  # safety cap
    
    results = asyncio.run(crawl_site_async(start_url, max_pages=max_pages))
    
    output = []
    for i, r in enumerate(results, 1):
        if "error" in r:
            output.append(f"Page {i}: ERROR - {r['url']}: {r['error']}")
        else:
            output.append(f"Page {i}: {r['title']}\nURL: {r['url']}\n{r['content'][:1000]}\n---")
    
    return "\n".join(output)

newspaper3k for Article Extraction

When you're specifically scraping news articles or blog posts, newspaper3k does a much cleaner job than raw HTML parsing. It's purpose-built for article content extraction.

pip install newspaper3k lxml[html_clean]

from newspaper import Article
from langchain.tools import tool

@tool
def extract_article(url: str) -> str:
    """
    Extracts the main article content from a news article or blog post URL.
    Returns the title, author, publish date, and cleaned article text.
    Much cleaner than raw scraping for article URLs.
    
    Args:
        url: URL of the news article or blog post
    """
    try:
        article = Article(url)
        article.download()
        article.parse()
        article.nlp()  # generates summary and keywords
        
        output = []
        output.append(f"Title: {article.title}")
        
        if article.authors:
            output.append(f"Authors: {', '.join(article.authors)}")
        
        if article.publish_date:
            output.append(f"Published: {article.publish_date.strftime('%Y-%m-%d')}")
        
        output.append(f"Summary: {article.summary}")
        output.append(f"Keywords: {', '.join(article.keywords[:10])}")
        output.append(f"\nFull Text:\n{article.text[:4000]}")
        
        return "\n".join(output)
        
    except Exception as e:
        return f"Failed to extract article from {url}: {str(e)}"

FireCrawl Integration

FireCrawl is a managed scraping API that handles JavaScript rendering, anti-bot measures, and content cleaning. It's the right choice when you want scraping to just work without maintaining browser infrastructure.

pip install firecrawl-py

from firecrawl import FirecrawlApp
from langchain.tools import tool
import os

os.environ["FIRECRAWL_API_KEY"] = "your-firecrawl-key"

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

@tool
def firecrawl_scrape(url: str) -> str:
    """
    Scrapes a web page using FireCrawl — handles JavaScript, anti-bot measures automatically.
    Returns clean markdown content. Best for complex or protected sites.
    
    Args:
        url: The URL to scrape
    """
    try:
        result = app.scrape_url(
            url,
            params={
                "formats": ["markdown"],
                "onlyMainContent": True,
                "waitFor": 2000
            }
        )
        
        if result.get("markdown"):
            return result["markdown"][:6000]
        else:
            return f"FireCrawl returned no content for {url}"
            
    except Exception as e:
        return f"FireCrawl error for {url}: {str(e)}"

@tool
def firecrawl_crawl(start_url: str, max_pages: int = 5) -> str:
    """
    Crawls multiple pages starting from start_url using FireCrawl.
    Handles JavaScript and bot protection automatically.
    
    Args:
        start_url: URL to start crawling from
        max_pages: Maximum pages to crawl (default 5)
    """
    try:
        crawl_result = app.crawl_url(
            start_url,
            params={
                "crawlerOptions": {
                    "excludes": ["blog/*"],
                    "limit": min(max_pages, 10)
                },
                "pageOptions": {
                    "onlyMainContent": True
                }
            }
        )
        
        pages = crawl_result.get("data", [])
        output = []
        
        for page in pages:
            output.append(f"URL: {page.get('url', 'unknown')}")
            output.append(f"Content: {page.get('markdown', '')[:1000]}")
            output.append("---")
        
        return "\n".join(output)
        
    except Exception as e:
        return f"FireCrawl crawl error: {str(e)}"

Rate Limiting Best Practices

Getting IP-banned is a real issue. Here's a rate limiting wrapper you should use around any scraping tool:

import time
import random
from functools import wraps
from collections import defaultdict
from threading import Lock

class RateLimiter:
    """Domain-specific rate limiter to avoid overwhelming any single site."""
    
    def __init__(self, min_delay: float = 1.0, max_delay: float = 3.0):
        self.min_delay = min_delay
        self.max_delay = max_delay
        self.last_request = defaultdict(float)
        self.lock = Lock()
    
    def wait(self, domain: str):
        with self.lock:
            now = time.time()
            last = self.last_request[domain]
            
            if last > 0:
                elapsed = now - last
                required_delay = random.uniform(self.min_delay, self.max_delay)
                
                if elapsed < required_delay:
                    sleep_time = required_delay - elapsed
                    time.sleep(sleep_time)
            
            self.last_request[domain] = time.time()

rate_limiter = RateLimiter(min_delay=1.5, max_delay=4.0)

def rate_limited_scrape(url: str) -> str:
    """Scrape with automatic rate limiting per domain."""
    from urllib.parse import urlparse
    domain = urlparse(url).netloc
    rate_limiter.wait(domain)
    
    result = asyncio.run(scrape_page_async(url))
    return result.get("content", result.get("error", "Unknown error"))

Building the Complete Web Browsing Agent

import os
from langchain_openai import ChatOpenAI
from langchain_community.tools.tavily_search import TavilySearchResults
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver

os.environ["OPENAI_API_KEY"] = "your-key"
os.environ["TAVILY_API_KEY"] = "your-key"

# Combine search + browsing tools
tavily_search = TavilySearchResults(max_results=5)

tools = [
    tavily_search,
    playwright_scrape,
    extract_article,
    crawl_website,
    # firecrawl_scrape,  # uncomment if you have a FireCrawl key
]

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Add memory so agent remembers what it scraped
memory = MemorySaver()
agent = create_react_agent(llm, tools, checkpointer=memory)

config = {"configurable": {"thread_id": "research-session-1"}}

# Example: multi-step research task
messages = [("human", """
Research the latest developments in AI agent frameworks. 
1. Search for recent news about LangChain and LangGraph
2. Visit the LangChain blog and extract the 3 most recent posts
3. Summarize the key themes across what you find
""")]

result = agent.invoke({"messages": messages}, config=config)
print(result["messages"][-1].content)

Comparison Table: Playwright vs Selenium vs requests for AI Agents

Feature	Playwright	Selenium	requests
JavaScript rendering	Full support	Full support	None
Async support	Native	Limited (undetected-chromedriver)	Yes (aiohttp)
Speed	Fast	Slow	Very fast
Memory usage	Medium	High	Low
Anti-detection	Moderate	Poor	Good with headers
Setup complexity	Low	Medium	None
LangChain integration	Good	Possible	Easy
Best for	Modern JS sites	Legacy compatibility	Static HTML
Headless stability	Excellent	Moderate	N/A

According to Playwright's benchmarks, it's approximately 3x faster than Selenium for typical page interactions. For AI agents that need to process dozens of pages per research task, this adds up.

Also worth knowing: most anti-bot services detect Selenium more readily than Playwright, because Selenium leaves more browser fingerprint artifacts. If your agent is getting blocked, switching from Selenium to Playwright often helps.

Handling Common Edge Cases

Infinite scroll pages: The agent needs to scroll before the content loads.

async def scrape_infinite_scroll(url: str, scroll_count: int = 3) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)
        
        for _ in range(scroll_count):
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(2000)  # wait for content to load
        
        html = await page.content()
        await browser.close()
        
        soup = BeautifulSoup(html, "html.parser")
        return soup.get_text(separator="\n", strip=True)[:8000]

Login-protected pages:

async def scrape_with_login(url: str, login_url: str, username: str, password: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        # Log in first
        await page.goto(login_url)
        await page.fill('input[type="email"]', username)
        await page.fill('input[type="password"]', password)
        await page.click('button[type="submit"]')
        await page.wait_for_navigation()
        
        # Now navigate to the protected page
        await page.goto(url)
        content = await page.content()
        await browser.close()
        
        soup = BeautifulSoup(content, "html.parser")
        return soup.get_text(separator="\n", strip=True)[:8000]

For more on building complete research agents, see AI research agent build and AI agent memory and planning. If you want to store and search what your agent scrapes, Vector database guide covers setting up a retrieval layer over scraped content.

The LangChain tutorial 2025 also has a section on tool chaining that applies directly here — you'll often want your agent to search, then scrape the top results, then synthesize.

Conclusion

Building a web browsing agent that actually works on the modern web requires more than wrapping requests.get() in a tool. Playwright handles JavaScript-rendered content, FireCrawl handles the anti-bot complexity, and newspaper3k cleans up article extraction. Put them together with proper rate limiting and you have a research agent that can genuinely browse the web.

The rate limiting piece is non-negotiable. Scrape too fast and you'll get blocked, then your agent silently fails with no useful data. Build rate limiting in from the start, respect robots.txt, and your agent will stay operational long-term.

Try the CrewAI tutorial if you want to build a multi-agent research system where one agent browses while another synthesizes the findings.

Frequently Asked Questions

Is Playwright better than Selenium for LangChain agents?

Playwright generally outperforms Selenium for LangChain agents because it handles modern JavaScript-heavy sites better, has async support built-in, and is faster. Selenium has broader browser compatibility but its synchronous API makes it awkward in async agent loops.

How do I avoid getting blocked when scraping with a LangChain agent?

Use randomized delays between requests (1-5 seconds), rotate user agents, respect robots.txt, and consider using FireCrawl which handles all anti-bot measures for you. Never make more than 1 request per second to the same domain.

Can a LangChain web browsing agent handle login-protected pages?

Yes, with Playwright you can automate form fills, click login buttons, and maintain session cookies across requests. Store credentials securely in environment variables and implement proper session management to avoid repeated logins.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

InterviewPython BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizPython Basics QuizPython OOP Concepts

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

How to Build a LangChain Agent That Browses the Web (2026)

⚡ Quick Answer

Learn to build a LangChain web browsing agent using Playwright, newspaper3k, and FireCrawl with rate limiting, multi-page crawling, and real code examples.

AiTechWorlds Team May 31, 2026 12 min read

#LangChain #Playwright #Web Scraping #AI Agent #Python

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Why Web Browsing Agents Are Harder Than They Look

There are three main approaches:

requests + BeautifulSoup — Fast, lightweight, works on static HTML. Fails on JS-rendered content.
Playwright / Selenium — Full browser automation. Handles everything. Slower and heavier.
FireCrawl / Jina Reader — Managed scraping services. They deal with the hard parts for you.

For a LangChain agent, you usually want a combination: FireCrawl for general browsing, Playwright for sites that need interactive navigation.

Setting Up Playwright for LangChain

pip install playwright langchain langchain-openai langchain-community
playwright install chromium

Here's a basic Playwright-based scraping tool:

import asyncio
from playwright.async_api import async_playwright
from langchain.tools import tool
from bs4 import BeautifulSoup
import re

async def scrape_page_async(url: str, wait_for: str = None, timeout: int = 30000) -> dict:
    """Core async scraping function using Playwright."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--no-sandbox", "--disable-dev-shm-usage"]
        )
        
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
            viewport={"width": 1280, "height": 800}
        )
        
        page = await context.new_page()
        
        try:
            await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
            
            # Wait for specific element if needed
            if wait_for:
                await page.wait_for_selector(wait_for, timeout=timeout)
            
            # Handle cookie consent banners
            consent_selectors = [
                "button[id*='accept']",
                "button[class*='accept']",
                "[data-testid='cookie-accept']"
            ]
            for selector in consent_selectors:
                try:
                    btn = await page.query_selector(selector)
                    if btn:
                        await btn.click()
                        await page.wait_for_timeout(500)
                        break
                except:
                    pass
            
            # Get the full rendered HTML
            html = await page.content()
            title = await page.title()
            
            # Extract text using BeautifulSoup
            soup = BeautifulSoup(html, "html.parser")
            
            # Remove noise
            for tag in soup(["script", "style", "nav", "footer", "aside", "header"]):
                tag.decompose()
            
            text = soup.get_text(separator="\n", strip=True)
            # Clean up excessive whitespace
            text = re.sub(r'\n{3,}', '\n\n', text)
            
            return {
                "url": url,
                "title": title,
                "content": text[:8000],  # limit to ~8k chars
                "success": True
            }
            
        except Exception as e:
            return {"url": url, "error": str(e), "success": False}
        finally:
            await browser.close()

@tool
def playwright_scrape(url: str) -> str:
    """
    Fetches and extracts text content from any web page, including JavaScript-rendered sites.
    Use when you need to read a specific URL. Returns the page title and main content.
    
    Args:
        url: The full URL to scrape (must start with http:// or https://)
    """
    if not url.startswith(("http://", "https://")):
        return "Error: URL must start with http:// or https://"
    
    result = asyncio.run(scrape_page_async(url))
    
    if result["success"]:
        return f"Title: {result['title']}\n\nContent:\n{result['content']}"
    else:
        return f"Failed to scrape {url}: {result['error']}"

Multi-Page Crawling Pattern

Single-page scraping is just the start. Research agents often need to follow links, paginate through results, or crawl a site's structure. Here's a controlled crawling tool:

import time
import random
from urllib.parse import urljoin, urlparse
from typing import List

async def crawl_site_async(
    start_url: str,
    max_pages: int = 5,
    same_domain_only: bool = True,
    delay_range: tuple = (1.5, 3.5)
) -> List[dict]:
    """Crawls multiple pages with rate limiting."""
    
    visited = set()
    results = []
    queue = [start_url]
    base_domain = urlparse(start_url).netloc
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (compatible; ResearchBot/1.0)"
        )
        
        while queue and len(visited) < max_pages:
            url = queue.pop(0)
            
            if url in visited:
                continue
            
            visited.add(url)
            
            # Rate limiting — critical to avoid blocks
            delay = random.uniform(*delay_range)
            await asyncio.sleep(delay)
            
            page = await context.new_page()
            
            try:
                await page.goto(url, wait_until="domcontentloaded", timeout=20000)
                html = await page.content()
                title = await page.title()
                
                soup = BeautifulSoup(html, "html.parser")
                for tag in soup(["script", "style", "nav", "footer"]):
                    tag.decompose()
                
                text = soup.get_text(separator="\n", strip=True)[:5000]
                
                results.append({
                    "url": url,
                    "title": title,
                    "content": text
                })
                
                # Find links for the queue
                links = soup.find_all("a", href=True)
                for link in links:
                    href = urljoin(url, link["href"])
                    parsed = urlparse(href)
                    
                    # Filter conditions
                    if not href.startswith("http"):
                        continue
                    if same_domain_only and parsed.netloc != base_domain:
                        continue
                    if href in visited:
                        continue
                    if any(ext in href for ext in [".pdf", ".jpg", ".png", ".zip"]):
                        continue
                    
                    queue.append(href)
                    
            except Exception as e:
                results.append({"url": url, "error": str(e)})
            finally:
                await page.close()
        
        await browser.close()
    
    return results

@tool
def crawl_website(start_url: str, max_pages: int = 3) -> str:
    """
    Crawls a website starting from the given URL, following internal links.
    Respects rate limits. Use for researching a website's content across multiple pages.
    Max pages capped at 10 for safety.
    
    Args:
        start_url: The URL to start crawling from
        max_pages: Maximum number of pages to visit (default 3, max 10)
    """
    max_pages = min(max_pages, 10)  # safety cap
    
    results = asyncio.run(crawl_site_async(start_url, max_pages=max_pages))
    
    output = []
    for i, r in enumerate(results, 1):
        if "error" in r:
            output.append(f"Page {i}: ERROR - {r['url']}: {r['error']}")
        else:
            output.append(f"Page {i}: {r['title']}\nURL: {r['url']}\n{r['content'][:1000]}\n---")
    
    return "\n".join(output)

newspaper3k for Article Extraction

When you're specifically scraping news articles or blog posts, newspaper3k does a much cleaner job than raw HTML parsing. It's purpose-built for article content extraction.

pip install newspaper3k lxml[html_clean]

from newspaper import Article
from langchain.tools import tool

@tool
def extract_article(url: str) -> str:
    """
    Extracts the main article content from a news article or blog post URL.
    Returns the title, author, publish date, and cleaned article text.
    Much cleaner than raw scraping for article URLs.
    
    Args:
        url: URL of the news article or blog post
    """
    try:
        article = Article(url)
        article.download()
        article.parse()
        article.nlp()  # generates summary and keywords
        
        output = []
        output.append(f"Title: {article.title}")
        
        if article.authors:
            output.append(f"Authors: {', '.join(article.authors)}")
        
        if article.publish_date:
            output.append(f"Published: {article.publish_date.strftime('%Y-%m-%d')}")
        
        output.append(f"Summary: {article.summary}")
        output.append(f"Keywords: {', '.join(article.keywords[:10])}")
        output.append(f"\nFull Text:\n{article.text[:4000]}")
        
        return "\n".join(output)
        
    except Exception as e:
        return f"Failed to extract article from {url}: {str(e)}"

FireCrawl Integration

pip install firecrawl-py

from firecrawl import FirecrawlApp
from langchain.tools import tool
import os

os.environ["FIRECRAWL_API_KEY"] = "your-firecrawl-key"

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

@tool
def firecrawl_scrape(url: str) -> str:
    """
    Scrapes a web page using FireCrawl — handles JavaScript, anti-bot measures automatically.
    Returns clean markdown content. Best for complex or protected sites.
    
    Args:
        url: The URL to scrape
    """
    try:
        result = app.scrape_url(
            url,
            params={
                "formats": ["markdown"],
                "onlyMainContent": True,
                "waitFor": 2000
            }
        )
        
        if result.get("markdown"):
            return result["markdown"][:6000]
        else:
            return f"FireCrawl returned no content for {url}"
            
    except Exception as e:
        return f"FireCrawl error for {url}: {str(e)}"

@tool
def firecrawl_crawl(start_url: str, max_pages: int = 5) -> str:
    """
    Crawls multiple pages starting from start_url using FireCrawl.
    Handles JavaScript and bot protection automatically.
    
    Args:
        start_url: URL to start crawling from
        max_pages: Maximum pages to crawl (default 5)
    """
    try:
        crawl_result = app.crawl_url(
            start_url,
            params={
                "crawlerOptions": {
                    "excludes": ["blog/*"],
                    "limit": min(max_pages, 10)
                },
                "pageOptions": {
                    "onlyMainContent": True
                }
            }
        )
        
        pages = crawl_result.get("data", [])
        output = []
        
        for page in pages:
            output.append(f"URL: {page.get('url', 'unknown')}")
            output.append(f"Content: {page.get('markdown', '')[:1000]}")
            output.append("---")
        
        return "\n".join(output)
        
    except Exception as e:
        return f"FireCrawl crawl error: {str(e)}"

Rate Limiting Best Practices

Getting IP-banned is a real issue. Here's a rate limiting wrapper you should use around any scraping tool:

import time
import random
from functools import wraps
from collections import defaultdict
from threading import Lock

class RateLimiter:
    """Domain-specific rate limiter to avoid overwhelming any single site."""
    
    def __init__(self, min_delay: float = 1.0, max_delay: float = 3.0):
        self.min_delay = min_delay
        self.max_delay = max_delay
        self.last_request = defaultdict(float)
        self.lock = Lock()
    
    def wait(self, domain: str):
        with self.lock:
            now = time.time()
            last = self.last_request[domain]
            
            if last > 0:
                elapsed = now - last
                required_delay = random.uniform(self.min_delay, self.max_delay)
                
                if elapsed < required_delay:
                    sleep_time = required_delay - elapsed
                    time.sleep(sleep_time)
            
            self.last_request[domain] = time.time()

rate_limiter = RateLimiter(min_delay=1.5, max_delay=4.0)

def rate_limited_scrape(url: str) -> str:
    """Scrape with automatic rate limiting per domain."""
    from urllib.parse import urlparse
    domain = urlparse(url).netloc
    rate_limiter.wait(domain)
    
    result = asyncio.run(scrape_page_async(url))
    return result.get("content", result.get("error", "Unknown error"))

Building the Complete Web Browsing Agent

import os
from langchain_openai import ChatOpenAI
from langchain_community.tools.tavily_search import TavilySearchResults
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver

os.environ["OPENAI_API_KEY"] = "your-key"
os.environ["TAVILY_API_KEY"] = "your-key"

# Combine search + browsing tools
tavily_search = TavilySearchResults(max_results=5)

tools = [
    tavily_search,
    playwright_scrape,
    extract_article,
    crawl_website,
    # firecrawl_scrape,  # uncomment if you have a FireCrawl key
]

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Add memory so agent remembers what it scraped
memory = MemorySaver()
agent = create_react_agent(llm, tools, checkpointer=memory)

config = {"configurable": {"thread_id": "research-session-1"}}

# Example: multi-step research task
messages = [("human", """
Research the latest developments in AI agent frameworks. 
1. Search for recent news about LangChain and LangGraph
2. Visit the LangChain blog and extract the 3 most recent posts
3. Summarize the key themes across what you find
""")]

result = agent.invoke({"messages": messages}, config=config)
print(result["messages"][-1].content)

Comparison Table: Playwright vs Selenium vs requests for AI Agents

Feature	Playwright	Selenium	requests
JavaScript rendering	Full support	Full support	None
Async support	Native	Limited (undetected-chromedriver)	Yes (aiohttp)
Speed	Fast	Slow	Very fast
Memory usage	Medium	High	Low
Anti-detection	Moderate	Poor	Good with headers
Setup complexity	Low	Medium	None
LangChain integration	Good	Possible	Easy
Best for	Modern JS sites	Legacy compatibility	Static HTML
Headless stability	Excellent	Moderate	N/A

According to Playwright's benchmarks, it's approximately 3x faster than Selenium for typical page interactions. For AI agents that need to process dozens of pages per research task, this adds up.

Handling Common Edge Cases

Infinite scroll pages: The agent needs to scroll before the content loads.

async def scrape_infinite_scroll(url: str, scroll_count: int = 3) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)
        
        for _ in range(scroll_count):
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(2000)  # wait for content to load
        
        html = await page.content()
        await browser.close()
        
        soup = BeautifulSoup(html, "html.parser")
        return soup.get_text(separator="\n", strip=True)[:8000]

Login-protected pages:

async def scrape_with_login(url: str, login_url: str, username: str, password: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        # Log in first
        await page.goto(login_url)
        await page.fill('input[type="email"]', username)
        await page.fill('input[type="password"]', password)
        await page.click('button[type="submit"]')
        await page.wait_for_navigation()
        
        # Now navigate to the protected page
        await page.goto(url)
        content = await page.content()
        await browser.close()
        
        soup = BeautifulSoup(content, "html.parser")
        return soup.get_text(separator="\n", strip=True)[:8000]

The LangChain tutorial 2025 also has a section on tool chaining that applies directly here — you'll often want your agent to search, then scrape the top results, then synthesize.

Conclusion

Try the CrewAI tutorial if you want to build a multi-agent research system where one agent browses while another synthesizes the findings.

Frequently Asked Questions

Is Playwright better than Selenium for LangChain agents?

How do I avoid getting blocked when scraping with a LangChain agent?

Can a LangChain web browsing agent handle login-protected pages?

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

InterviewPython BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizPython Basics QuizPython OOP Concepts

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

How to Build a LangChain Agent That Browses the Web (2026)

Why Web Browsing Agents Are Harder Than They Look

Setting Up Playwright for LangChain

Multi-Page Crawling Pattern

newspaper3k for Article Extraction

FireCrawl Integration

Rate Limiting Best Practices

Building the Complete Web Browsing Agent

Comparison Table: Playwright vs Selenium vs requests for AI Agents

Handling Common Edge Cases

Conclusion

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

How to Build a LangChain Agent That Browses the Web (2026)

Why Web Browsing Agents Are Harder Than They Look

Setting Up Playwright for LangChain

Multi-Page Crawling Pattern

newspaper3k for Article Extraction

FireCrawl Integration

Rate Limiting Best Practices

Building the Complete Web Browsing Agent

Comparison Table: Playwright vs Selenium vs requests for AI Agents

Handling Common Edge Cases

Conclusion

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily