Web Scraping Best Practices 2026: Architecture, Ethics, and Production Patterns

March 19, 2026 · 6 min read

Software Developer & Automation Specialist

I build production AI agents, web scrapers, and automation pipelines. Most of what I publish here comes from the actual problems they run into: proxies that get banned, anti-bot stacks that fingerprint your client, RAG that drifts when the underlying data moves. Stack: Python, TypeScript, Go, FastAPI, LangChain, Crawlee, Playwright, deployed on AWS, GCP, and Cloudflare.

Quick answer: Production-grade web scraping requires (1) checking for APIs first, (2) choosing the right tool for the site's rendering complexity, (3) respecting rate limits with delays and backoff, (4) using proxy rotation for blocked targets, (5) building schema-validated extraction with retries, and (6) logging structured metrics for monitoring. The practices below are ranked by impact.

Production-grade web scraping is engineering, not just writing selectors. This guide covers the patterns that separate throwaway scripts from scrapers that run reliably for months.

1. Check for APIs First

Before writing a scraper, verify the site doesn't have a public API:

# Check for API endpoints by watching network tab or:
curl "https://target.com/api/products" -H "Accept: application/json"

API-first extraction is:

10–100× faster (no HTML parsing)
More reliable (structured output, versioned)
Often allowed by ToS when scraping isn't

2. Use Resilient Selector Strategy

Avoid selectors that break when the site updates:

Selector Type	Stability	Example
`id`	High	`#product-price`
`data-*` attributes	High	`[data-testid="price"]`
Semantic HTML	Medium	`h1`, `article`, `time`
Class names	Low	`.css-3xk23f`
XPath with text	Medium	`//span[contains(text(),"Price")]`

Write multiple fallback selectors:

function extractPrice(doc) {
  return (
    doc.querySelector('#priceblock_ourprice')?.textContent ??
    doc.querySelector('[data-asin-price]')?.getAttribute('data-asin-price') ??
    doc.querySelector('.price .a-offscreen')?.textContent ??
    null
  );
}

3. Implement Retry with Exponential Backoff

import time, random, logging

def fetch_with_retry(url, session, max_retries=5):
    for attempt in range(max_retries):
        try:
            resp = session.get(url, timeout=30)
            if resp.status_code == 200:
                return resp
            if resp.status_code in (429, 503):
                wait = (2 ** attempt) + random.uniform(0, 1)
                logging.warning(f"Rate limited. Retrying in {wait:.1f}s ({attempt+1}/{max_retries})")
                time.sleep(wait)
                continue
            resp.raise_for_status()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)
    return None

4. Rate Limit Yourself

Respect the site. Aggressive scraping can cause service degradation, legal exposure, and immediate IP bans.

Guidance by site type:

Site Type	Requests/second	Delay Between Requests
Small sites (under 10K pages/day traffic)	0.5–1 req/s	1–2s
Medium sites	1–3 req/s	0.3–1s
Large sites with public data	3–10 req/s	0.1–0.3s
API endpoints (documented)	Per rate limit	Per docs

5. Rotate Proxies Correctly

Never assign one IP per session when rotating proxies — rotate per request or per small batch:

import itertools

proxy_pool = itertools.cycle([
    "http://user:pass@proxy1:7777",
    "http://user:pass@proxy2:7777",
    "http://user:pass@proxy3:7777",
])

def get_proxy():
    return {"http": next(proxy_pool), "https": next(proxy_pool)}

For sticky sessions (login flows), pin to one IP for the session duration.

IPRoyal rotating proxies → | Bright Data →

6. Handle Dynamic Content Properly

Content Type	Best Approach
Lazy-loaded images	Scroll to trigger load before extracting
Infinite scroll	Scroll in loop, detect end by count or element
Load-on-click	Simulate click, wait for network idle
AJAX/XHR content	Intercept network requests directly
Login-gated content	Use session cookies, not per-request auth

// Scroll to bottom for infinite scroll
async function scrollToEnd(page) {
  let prevHeight = 0;
  while (true) {
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    await page.waitForTimeout(1500);
    const newHeight = await page.evaluate(() => document.body.scrollHeight);
    if (newHeight === prevHeight) break;
    prevHeight = newHeight;
  }
}

7. Store Incrementally, Not at the End

Write to storage after each item or batch — never buffer all results in memory:

// Good: push immediately
await Actor.pushData(item);

// Bad: buffer and push at the end (loses data on crash)
const results = [];
results.push(item);
await Actor.pushData(results); // Only at the end

Use Apify's Dataset or Crawlee's built-in pushData() for crash-safe incremental storage.

8. Deduplication

Track what you've scraped to avoid reprocessing:

import { RequestQueue } from 'crawlee';

const queue = await RequestQueue.open();

// Crawlee deduplicates by URL automatically
// For custom dedup key:
await queue.addRequest({
  url: 'https://example.com/product/123',
  uniqueKey: 'product-123', // Custom key
});

For database-backed dedup:

seen_urls = set()

def is_new(url):
    if url in seen_urls:
        return False
    seen_urls.add(url)
    return True

9. Monitor and Alert

Production scrapers need monitoring:

// Log structured data for monitoring
const metrics = {
  url,
  statusCode: response.status,
  itemsExtracted: items.length,
  durationMs: Date.now() - startTime,
  proxyCountry: proxy.country,
};
console.log(JSON.stringify(metrics));

Set up alerts for:

0 items extracted (site layout changed)
Error rate > 10% (IP or anti-bot issue)
Run duration > 2× baseline (performance regression)

10. Respect robots.txt and Terms of Service

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://target.com/robots.txt")
rp.read()

if rp.can_fetch("*", url):
    scrape(url)
else:
    print(f"Blocked by robots.txt: {url}")

Key rules:

Check robots.txt for Disallow: directives
Review ToS for scraping restrictions
Never store or share personal data beyond your stated purpose (GDPR/CCPA)
Only scrape publicly available data

Full legal compliance guide →

FAQ

Frequently Asked Questions

The highest-impact practices are: (1) check for public APIs before scraping, (2) respect rate limits with delays and backoff, (3) use proxy rotation for blocked or high-volume targets, (4) validate extracted data against a schema, (5) implement retries with exponential backoff, and (6) monitor for extraction failures and layout changes. These six practices cover 90% of production failure modes.

Rotate User-Agent strings, add random delays between requests (1–5 seconds per domain), use residential proxies for sites with strict IP policies, and don't send concurrent requests from a single IP. For sites with JavaScript challenges (Cloudflare, DataDome), use a stealth headless browser or a managed unblocking service.

Use a managed scraping API (Apify, Bright Data, Firecrawl) if: you're scraping protected sites (Cloudflare, CAPTCHA), you need scale above 10K pages/day, or you want clean Markdown/JSON output for AI pipelines. Build your own with Playwright/Crawlee/Scrapy if: you need full control, you're scraping unprotected HTML, or you have specific parsing logic that APIs don't support.

Use CSS selectors (faster) for stable, well-structured HTML. Use XPath for complex hierarchies or when CSS selectors aren't expressive enough. For LLM pipelines, convert to Markdown first using a library like readability-lxml or send HTML to Firecrawl's /scrape endpoint. Always validate output against a schema (Zod in JS, Pydantic in Python) before storage.

1. Check for APIs First​

2. Use Resilient Selector Strategy​

3. Implement Retry with Exponential Backoff​

4. Rate Limit Yourself​

5. Rotate Proxies Correctly​

6. Handle Dynamic Content Properly​

7. Store Incrementally, Not at the End​

8. Deduplication​

9. Monitor and Alert​

10. Respect robots.txt and Terms of Service​

FAQ​

Common mistakes and fixes