Skip to main content

Web Scraping Best Practices 2026: Architecture, Ethics, and Production Patterns

· 6 min read
Yassine El Haddad
Software Developer & Automation Specialist

I build production AI agents, web scrapers, and automation pipelines. Most of what I publish here comes from the actual problems they run into: proxies that get banned, anti-bot stacks that fingerprint your client, RAG that drifts when the underlying data moves. Stack: Python, TypeScript, Go, FastAPI, LangChain, Crawlee, Playwright, deployed on AWS, GCP, and Cloudflare.

Quick answer: Production-grade web scraping requires (1) checking for APIs first, (2) choosing the right tool for the site's rendering complexity, (3) respecting rate limits with delays and backoff, (4) using proxy rotation for blocked targets, (5) building schema-validated extraction with retries, and (6) logging structured metrics for monitoring. The practices below are ranked by impact.

Production-grade web scraping is engineering, not just writing selectors. This guide covers the patterns that separate throwaway scripts from scrapers that run reliably for months.

1. Check for APIs First

Before writing a scraper, verify the site doesn't have a public API:

# Check for API endpoints by watching network tab or:
curl "https://target.com/api/products" -H "Accept: application/json"

API-first extraction is:

  • 10–100× faster (no HTML parsing)
  • More reliable (structured output, versioned)
  • Often allowed by ToS when scraping isn't

2. Use Resilient Selector Strategy

Avoid selectors that break when the site updates:

Selector TypeStabilityExample
idHigh#product-price
data-* attributesHigh[data-testid="price"]
Semantic HTMLMediumh1, article, time
Class namesLow.css-3xk23f
XPath with textMedium//span[contains(text(),"Price")]

Write multiple fallback selectors:

function extractPrice(doc) {
return (
doc.querySelector('#priceblock_ourprice')?.textContent ??
doc.querySelector('[data-asin-price]')?.getAttribute('data-asin-price') ??
doc.querySelector('.price .a-offscreen')?.textContent ??
null
);
}

3. Implement Retry with Exponential Backoff

import time, random, logging

def fetch_with_retry(url, session, max_retries=5):
for attempt in range(max_retries):
try:
resp = session.get(url, timeout=30)
if resp.status_code == 200:
return resp
if resp.status_code in (429, 503):
wait = (2 ** attempt) + random.uniform(0, 1)
logging.warning(f"Rate limited. Retrying in {wait:.1f}s ({attempt+1}/{max_retries})")
time.sleep(wait)
continue
resp.raise_for_status()
except Exception as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
return None

4. Rate Limit Yourself

Respect the site. Aggressive scraping can cause service degradation, legal exposure, and immediate IP bans.

Guidance by site type:

Site TypeRequests/secondDelay Between Requests
Small sites (under 10K pages/day traffic)0.5–1 req/s1–2s
Medium sites1–3 req/s0.3–1s
Large sites with public data3–10 req/s0.1–0.3s
API endpoints (documented)Per rate limitPer docs

5. Rotate Proxies Correctly

Never assign one IP per session when rotating proxies — rotate per request or per small batch:

import itertools

proxy_pool = itertools.cycle([
"http://user:pass@proxy1:7777",
"http://user:pass@proxy2:7777",
"http://user:pass@proxy3:7777",
])

def get_proxy():
return {"http": next(proxy_pool), "https": next(proxy_pool)}

For sticky sessions (login flows), pin to one IP for the session duration.

IPRoyal rotating proxies → | Bright Data →


6. Handle Dynamic Content Properly

Content TypeBest Approach
Lazy-loaded imagesScroll to trigger load before extracting
Infinite scrollScroll in loop, detect end by count or element
Load-on-clickSimulate click, wait for network idle
AJAX/XHR contentIntercept network requests directly
Login-gated contentUse session cookies, not per-request auth
// Scroll to bottom for infinite scroll
async function scrollToEnd(page) {
let prevHeight = 0;
while (true) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1500);
const newHeight = await page.evaluate(() => document.body.scrollHeight);
if (newHeight === prevHeight) break;
prevHeight = newHeight;
}
}

7. Store Incrementally, Not at the End

Write to storage after each item or batch — never buffer all results in memory:

// Good: push immediately
await Actor.pushData(item);

// Bad: buffer and push at the end (loses data on crash)
const results = [];
results.push(item);
await Actor.pushData(results); // Only at the end

Use Apify's Dataset or Crawlee's built-in pushData() for crash-safe incremental storage.


8. Deduplication

Track what you've scraped to avoid reprocessing:

import { RequestQueue } from 'crawlee';

const queue = await RequestQueue.open();

// Crawlee deduplicates by URL automatically
// For custom dedup key:
await queue.addRequest({
url: 'https://example.com/product/123',
uniqueKey: 'product-123', // Custom key
});

For database-backed dedup:

seen_urls = set()

def is_new(url):
if url in seen_urls:
return False
seen_urls.add(url)
return True

9. Monitor and Alert

Production scrapers need monitoring:

// Log structured data for monitoring
const metrics = {
url,
statusCode: response.status,
itemsExtracted: items.length,
durationMs: Date.now() - startTime,
proxyCountry: proxy.country,
};
console.log(JSON.stringify(metrics));

Set up alerts for:

  • 0 items extracted (site layout changed)
  • Error rate > 10% (IP or anti-bot issue)
  • Run duration > 2× baseline (performance regression)

10. Respect robots.txt and Terms of Service

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://target.com/robots.txt")
rp.read()

if rp.can_fetch("*", url):
scrape(url)
else:
print(f"Blocked by robots.txt: {url}")

Key rules:

  • Check robots.txt for Disallow: directives
  • Review ToS for scraping restrictions
  • Never store or share personal data beyond your stated purpose (GDPR/CCPA)
  • Only scrape publicly available data

Full legal compliance guide →


FAQ

Frequently Asked Questions

The highest-impact practices are: (1) check for public APIs before scraping, (2) respect rate limits with delays and backoff, (3) use proxy rotation for blocked or high-volume targets, (4) validate extracted data against a schema, (5) implement retries with exponential backoff, and (6) monitor for extraction failures and layout changes. These six practices cover 90% of production failure modes.

Rotate User-Agent strings, add random delays between requests (1–5 seconds per domain), use residential proxies for sites with strict IP policies, and don't send concurrent requests from a single IP. For sites with JavaScript challenges (Cloudflare, DataDome), use a stealth headless browser or a managed unblocking service.

Use a managed scraping API (Apify, Bright Data, Firecrawl) if: you're scraping protected sites (Cloudflare, CAPTCHA), you need scale above 10K pages/day, or you want clean Markdown/JSON output for AI pipelines. Build your own with Playwright/Crawlee/Scrapy if: you need full control, you're scraping unprotected HTML, or you have specific parsing logic that APIs don't support.

Use CSS selectors (faster) for stable, well-structured HTML. Use XPath for complex hierarchies or when CSS selectors aren't expressive enough. For LLM pipelines, convert to Markdown first using a library like readability-lxml or send HTML to Firecrawl's /scrape endpoint. Always validate output against a schema (Zod in JS, Pydantic in Python) before storage.

Common mistakes and fixes

Scrapers work in development but fail randomly in production.

Add explicit waits instead of time.sleep() or setTimeout(). Use selector fallbacks (try multiple selectors). Implement retry logic with exponential backoff. Log all failures with request URL and response status.

Scraper output is inconsistent — same URL returns different data.

Sites A/B test layouts. Use multiple extraction strategies: try primary selector, fall back to secondary, log to a 'needs_review' queue if both fail.