How to Build a LangChain Agent That Browses the Web (2026)
Learn to build a LangChain web browsing agent using Playwright, newspaper3k, and FireCrawl with rate limiting, multi-page crawling, and real code examples.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Most LangChain agents I see in tutorials are search-and-summarize setups. They call a search API, get some text back, and write a response. That works for simple questions. But when you need an agent that can actually navigate a site, click through pagination, fill forms, or extract structured data from dynamic JavaScript pages — that's a different problem entirely.
I built a research assistant that needed to pull data from sites that don't have APIs. Some rendered JavaScript, some required navigation, one needed scroll-to-load content. Playwright handled all of it. This guide covers how to wire Playwright into a LangChain agent properly, including the rate limiting patterns that keep you from getting IP-banned within five minutes.
If you haven't built a basic LangChain agent yet, start with Build AI agent with LangChain first. If you're interested in retrieval over the content you scrape, RAG system tutorial pairs naturally with this guide.
Why Web Browsing Agents Are Harder Than They Look
Search tools give you preprocessed text. A web browsing agent has to deal with the raw internet — JavaScript rendering, cookie consent popups, lazy-loaded content, infinite scroll, login walls, and sites that actively detect and block scrapers. The gap between "I'll just use requests" and "this actually works on modern websites" is substantial.
There are three main approaches:
- requests + BeautifulSoup — Fast, lightweight, works on static HTML. Fails on JS-rendered content.
- Playwright / Selenium — Full browser automation. Handles everything. Slower and heavier.
- FireCrawl / Jina Reader — Managed scraping services. They deal with the hard parts for you.
For a LangChain agent, you usually want a combination: FireCrawl for general browsing, Playwright for sites that need interactive navigation.
Setting Up Playwright for LangChain
pip install playwright langchain langchain-openai langchain-community
playwright install chromium
Here's a basic Playwright-based scraping tool:
import asyncio
from playwright.async_api import async_playwright
from langchain.tools import tool
from bs4 import BeautifulSoup
import re
async def scrape_page_async(url: str, wait_for: str = None, timeout: int = 30000) -> dict:
"""Core async scraping function using Playwright."""
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=["--no-sandbox", "--disable-dev-shm-usage"]
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800}
)
page = await context.new_page()
try:
await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
# Wait for specific element if needed
if wait_for:
await page.wait_for_selector(wait_for, timeout=timeout)
# Handle cookie consent banners
consent_selectors = [
"button[id*='accept']",
"button[class*='accept']",
"[data-testid='cookie-accept']"
]
for selector in consent_selectors:
try:
btn = await page.query_selector(selector)
if btn:
await btn.click()
await page.wait_for_timeout(500)
break
except:
pass
# Get the full rendered HTML
html = await page.content()
title = await page.title()
# Extract text using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Remove noise
for tag in soup(["script", "style", "nav", "footer", "aside", "header"]):
tag.decompose()
text = soup.get_text(separator="\n", strip=True)
# Clean up excessive whitespace
text = re.sub(r'\n{3,}', '\n\n', text)
return {
"url": url,
"title": title,
"content": text[:8000], # limit to ~8k chars
"success": True
}
except Exception as e:
return {"url": url, "error": str(e), "success": False}
finally:
await browser.close()
@tool
def playwright_scrape(url: str) -> str:
"""
Fetches and extracts text content from any web page, including JavaScript-rendered sites.
Use when you need to read a specific URL. Returns the page title and main content.
Args:
url: The full URL to scrape (must start with http:// or https://)
"""
if not url.startswith(("http://", "https://")):
return "Error: URL must start with http:// or https://"
result = asyncio.run(scrape_page_async(url))
if result["success"]:
return f"Title: {result['title']}\n\nContent:\n{result['content']}"
else:
return f"Failed to scrape {url}: {result['error']}"
Multi-Page Crawling Pattern
Single-page scraping is just the start. Research agents often need to follow links, paginate through results, or crawl a site's structure. Here's a controlled crawling tool:
import time
import random
from urllib.parse import urljoin, urlparse
from typing import List
async def crawl_site_async(
start_url: str,
max_pages: int = 5,
same_domain_only: bool = True,
delay_range: tuple = (1.5, 3.5)
) -> List[dict]:
"""Crawls multiple pages with rate limiting."""
visited = set()
results = []
queue = [start_url]
base_domain = urlparse(start_url).netloc
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (compatible; ResearchBot/1.0)"
)
while queue and len(visited) < max_pages:
url = queue.pop(0)
if url in visited:
continue
visited.add(url)
# Rate limiting — critical to avoid blocks
delay = random.uniform(*delay_range)
await asyncio.sleep(delay)
page = await context.new_page()
try:
await page.goto(url, wait_until="domcontentloaded", timeout=20000)
html = await page.content()
title = await page.title()
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
text = soup.get_text(separator="\n", strip=True)[:5000]
results.append({
"url": url,
"title": title,
"content": text
})
# Find links for the queue
links = soup.find_all("a", href=True)
for link in links:
href = urljoin(url, link["href"])
parsed = urlparse(href)
# Filter conditions
if not href.startswith("http"):
continue
if same_domain_only and parsed.netloc != base_domain:
continue
if href in visited:
continue
if any(ext in href for ext in [".pdf", ".jpg", ".png", ".zip"]):
continue
queue.append(href)
except Exception as e:
results.append({"url": url, "error": str(e)})
finally:
await page.close()
await browser.close()
return results
@tool
def crawl_website(start_url: str, max_pages: int = 3) -> str:
"""
Crawls a website starting from the given URL, following internal links.
Respects rate limits. Use for researching a website's content across multiple pages.
Max pages capped at 10 for safety.
Args:
start_url: The URL to start crawling from
max_pages: Maximum number of pages to visit (default 3, max 10)
"""
max_pages = min(max_pages, 10) # safety cap
results = asyncio.run(crawl_site_async(start_url, max_pages=max_pages))
output = []
for i, r in enumerate(results, 1):
if "error" in r:
output.append(f"Page {i}: ERROR - {r['url']}: {r['error']}")
else:
output.append(f"Page {i}: {r['title']}\nURL: {r['url']}\n{r['content'][:1000]}\n---")
return "\n".join(output)
newspaper3k for Article Extraction
When you're specifically scraping news articles or blog posts, newspaper3k does a much cleaner job than raw HTML parsing. It's purpose-built for article content extraction.
pip install newspaper3k lxml[html_clean]
from newspaper import Article
from langchain.tools import tool
@tool
def extract_article(url: str) -> str:
"""
Extracts the main article content from a news article or blog post URL.
Returns the title, author, publish date, and cleaned article text.
Much cleaner than raw scraping for article URLs.
Args:
url: URL of the news article or blog post
"""
try:
article = Article(url)
article.download()
article.parse()
article.nlp() # generates summary and keywords
output = []
output.append(f"Title: {article.title}")
if article.authors:
output.append(f"Authors: {', '.join(article.authors)}")
if article.publish_date:
output.append(f"Published: {article.publish_date.strftime('%Y-%m-%d')}")
output.append(f"Summary: {article.summary}")
output.append(f"Keywords: {', '.join(article.keywords[:10])}")
output.append(f"\nFull Text:\n{article.text[:4000]}")
return "\n".join(output)
except Exception as e:
return f"Failed to extract article from {url}: {str(e)}"
FireCrawl Integration
FireCrawl is a managed scraping API that handles JavaScript rendering, anti-bot measures, and content cleaning. It's the right choice when you want scraping to just work without maintaining browser infrastructure.
pip install firecrawl-py
from firecrawl import FirecrawlApp
from langchain.tools import tool
import os
os.environ["FIRECRAWL_API_KEY"] = "your-firecrawl-key"
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
@tool
def firecrawl_scrape(url: str) -> str:
"""
Scrapes a web page using FireCrawl — handles JavaScript, anti-bot measures automatically.
Returns clean markdown content. Best for complex or protected sites.
Args:
url: The URL to scrape
"""
try:
result = app.scrape_url(
url,
params={
"formats": ["markdown"],
"onlyMainContent": True,
"waitFor": 2000
}
)
if result.get("markdown"):
return result["markdown"][:6000]
else:
return f"FireCrawl returned no content for {url}"
except Exception as e:
return f"FireCrawl error for {url}: {str(e)}"
@tool
def firecrawl_crawl(start_url: str, max_pages: int = 5) -> str:
"""
Crawls multiple pages starting from start_url using FireCrawl.
Handles JavaScript and bot protection automatically.
Args:
start_url: URL to start crawling from
max_pages: Maximum pages to crawl (default 5)
"""
try:
crawl_result = app.crawl_url(
start_url,
params={
"crawlerOptions": {
"excludes": ["blog/*"],
"limit": min(max_pages, 10)
},
"pageOptions": {
"onlyMainContent": True
}
}
)
pages = crawl_result.get("data", [])
output = []
for page in pages:
output.append(f"URL: {page.get('url', 'unknown')}")
output.append(f"Content: {page.get('markdown', '')[:1000]}")
output.append("---")
return "\n".join(output)
except Exception as e:
return f"FireCrawl crawl error: {str(e)}"
Rate Limiting Best Practices
Getting IP-banned is a real issue. Here's a rate limiting wrapper you should use around any scraping tool:
import time
import random
from functools import wraps
from collections import defaultdict
from threading import Lock
class RateLimiter:
"""Domain-specific rate limiter to avoid overwhelming any single site."""
def __init__(self, min_delay: float = 1.0, max_delay: float = 3.0):
self.min_delay = min_delay
self.max_delay = max_delay
self.last_request = defaultdict(float)
self.lock = Lock()
def wait(self, domain: str):
with self.lock:
now = time.time()
last = self.last_request[domain]
if last > 0:
elapsed = now - last
required_delay = random.uniform(self.min_delay, self.max_delay)
if elapsed < required_delay:
sleep_time = required_delay - elapsed
time.sleep(sleep_time)
self.last_request[domain] = time.time()
rate_limiter = RateLimiter(min_delay=1.5, max_delay=4.0)
def rate_limited_scrape(url: str) -> str:
"""Scrape with automatic rate limiting per domain."""
from urllib.parse import urlparse
domain = urlparse(url).netloc
rate_limiter.wait(domain)
result = asyncio.run(scrape_page_async(url))
return result.get("content", result.get("error", "Unknown error"))
Building the Complete Web Browsing Agent
import os
from langchain_openai import ChatOpenAI
from langchain_community.tools.tavily_search import TavilySearchResults
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver
os.environ["OPENAI_API_KEY"] = "your-key"
os.environ["TAVILY_API_KEY"] = "your-key"
# Combine search + browsing tools
tavily_search = TavilySearchResults(max_results=5)
tools = [
tavily_search,
playwright_scrape,
extract_article,
crawl_website,
# firecrawl_scrape, # uncomment if you have a FireCrawl key
]
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Add memory so agent remembers what it scraped
memory = MemorySaver()
agent = create_react_agent(llm, tools, checkpointer=memory)
config = {"configurable": {"thread_id": "research-session-1"}}
# Example: multi-step research task
messages = [("human", """
Research the latest developments in AI agent frameworks.
1. Search for recent news about LangChain and LangGraph
2. Visit the LangChain blog and extract the 3 most recent posts
3. Summarize the key themes across what you find
""")]
result = agent.invoke({"messages": messages}, config=config)
print(result["messages"][-1].content)
Comparison Table: Playwright vs Selenium vs requests for AI Agents
| Feature | Playwright | Selenium | requests |
|---|---|---|---|
| JavaScript rendering | Full support | Full support | None |
| Async support | Native | Limited (undetected-chromedriver) | Yes (aiohttp) |
| Speed | Fast | Slow | Very fast |
| Memory usage | Medium | High | Low |
| Anti-detection | Moderate | Poor | Good with headers |
| Setup complexity | Low | Medium | None |
| LangChain integration | Good | Possible | Easy |
| Best for | Modern JS sites | Legacy compatibility | Static HTML |
| Headless stability | Excellent | Moderate | N/A |
According to Playwright's benchmarks, it's approximately 3x faster than Selenium for typical page interactions. For AI agents that need to process dozens of pages per research task, this adds up.
Also worth knowing: most anti-bot services detect Selenium more readily than Playwright, because Selenium leaves more browser fingerprint artifacts. If your agent is getting blocked, switching from Selenium to Playwright often helps.
Handling Common Edge Cases
Infinite scroll pages: The agent needs to scroll before the content loads.
async def scrape_infinite_scroll(url: str, scroll_count: int = 3) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url)
for _ in range(scroll_count):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000) # wait for content to load
html = await page.content()
await browser.close()
soup = BeautifulSoup(html, "html.parser")
return soup.get_text(separator="\n", strip=True)[:8000]
Login-protected pages:
async def scrape_with_login(url: str, login_url: str, username: str, password: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
# Log in first
await page.goto(login_url)
await page.fill('input[type="email"]', username)
await page.fill('input[type="password"]', password)
await page.click('button[type="submit"]')
await page.wait_for_navigation()
# Now navigate to the protected page
await page.goto(url)
content = await page.content()
await browser.close()
soup = BeautifulSoup(content, "html.parser")
return soup.get_text(separator="\n", strip=True)[:8000]
For more on building complete research agents, see AI research agent build and AI agent memory and planning. If you want to store and search what your agent scrapes, Vector database guide covers setting up a retrieval layer over scraped content.
The LangChain tutorial 2025 also has a section on tool chaining that applies directly here — you'll often want your agent to search, then scrape the top results, then synthesize.
Conclusion
Building a web browsing agent that actually works on the modern web requires more than wrapping requests.get() in a tool. Playwright handles JavaScript-rendered content, FireCrawl handles the anti-bot complexity, and newspaper3k cleans up article extraction. Put them together with proper rate limiting and you have a research agent that can genuinely browse the web.
The rate limiting piece is non-negotiable. Scrape too fast and you'll get blocked, then your agent silently fails with no useful data. Build rate limiting in from the start, respect robots.txt, and your agent will stay operational long-term.
Try the CrewAI tutorial if you want to build a multi-agent research system where one agent browses while another synthesizes the findings.
Frequently Asked Questions
Is Playwright better than Selenium for LangChain agents?
Playwright generally outperforms Selenium for LangChain agents because it handles modern JavaScript-heavy sites better, has async support built-in, and is faster. Selenium has broader browser compatibility but its synchronous API makes it awkward in async agent loops.
How do I avoid getting blocked when scraping with a LangChain agent?
Use randomized delays between requests (1-5 seconds), rotate user agents, respect robots.txt, and consider using FireCrawl which handles all anti-bot measures for you. Never make more than 1 request per second to the same domain.
Can a LangChain web browsing agent handle login-protected pages?
Yes, with Playwright you can automate form fills, click login buttons, and maintain session cookies across requests. Store credentials securely in environment variables and implement proper session management to avoid repeated logins.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
7 AutoGen Termination Conditions (Max Rounds, Human Approval)
Master all 7 AutoGen termination conditions including is_termination_msg, max_turns, and human approval patterns to stop agent loops reliably and safely.
AutoGen Tutorial: Microsoft's Multi-Agent Framework (2026)
Learn Microsoft AutoGen from scratch in 2026 — install, first agent conversation, GroupChat, and a full comparison of AutoGen 0.2 vs 0.4 features.
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
How to Use AutoGen with Tools (Web Scraper, Calculator, File)
Learn how to equip AutoGen agents with custom tools like web scrapers, calculators, and file handlers using register_for_llm and register_for_execution.