Web Scraping Best Practices 2026: Architecture, Ethics, and Production Patterns
Quick answer: Production-grade web scraping requires (1) checking for APIs first, (2) choosing the right tool for the site's rendering complexity, (3) respecting rate limits with delays and backoff, (4) using proxy rotation for blocked targets, (5) building schema-validated extraction with retries, and (6) logging structured metrics for monitoring. The practices below are ranked by impact.
Production-grade web scraping is engineering, not just writing selectors. This guide covers the patterns that separate throwaway scripts from scrapers that run reliably for months.
1. Check for APIs First
Before writing a scraper, verify the site doesn't have a public API:
# Check for API endpoints by watching network tab or:
curl "https://target.com/api/products" -H "Accept: application/json"
API-first extraction is:
- 10–100× faster (no HTML parsing)
- More reliable (structured output, versioned)
- Often allowed by ToS when scraping isn't
2. Use Resilient Selector Strategy
Avoid selectors that break when the site updates:
| Selector Type | Stability | Example |
|---|---|---|
id | High | #product-price |
data-* attributes | High | [data-testid="price"] |
| Semantic HTML | Medium | h1, article, time |
| Class names | Low | .css-3xk23f |
| XPath with text | Medium | //span[contains(text(),"Price")] |
Write multiple fallback selectors:
function extractPrice(doc) {
return (
doc.querySelector('#priceblock_ourprice')?.textContent ??
doc.querySelector('[data-asin-price]')?.getAttribute('data-asin-price') ??
doc.querySelector('.price .a-offscreen')?.textContent ??
null
);
}
3. Implement Retry with Exponential Backoff
import time, random, logging
def fetch_with_retry(url, session, max_retries=5):
for attempt in range(max_retries):
try:
resp = session.get(url, timeout=30)
if resp.status_code == 200:
return resp
if resp.status_code in (429, 503):
wait = (2 ** attempt) + random.uniform(0, 1)
logging.warning(f"Rate limited. Retrying in {wait:.1f}s ({attempt+1}/{max_retries})")
time.sleep(wait)
continue
resp.raise_for_status()
except Exception as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
return None
4. Rate Limit Yourself
Respect the site. Aggressive scraping can cause service degradation, legal exposure, and immediate IP bans.
Guidance by site type:
| Site Type | Requests/second | Delay Between Requests |
|---|---|---|
| Small sites (under 10K pages/day traffic) | 0.5–1 req/s | 1–2s |
| Medium sites | 1–3 req/s | 0.3–1s |
| Large sites with public data | 3–10 req/s | 0.1–0.3s |
| API endpoints (documented) | Per rate limit | Per docs |
5. Rotate Proxies Correctly
Never assign one IP per session when rotating proxies — rotate per request or per small batch:
import itertools
proxy_pool = itertools.cycle([
"http://user:pass@proxy1:7777",
"http://user:pass@proxy2:7777",
"http://user:pass@proxy3:7777",
])
def get_proxy():
return {"http": next(proxy_pool), "https": next(proxy_pool)}
For sticky sessions (login flows), pin to one IP for the session duration.
IPRoyal rotating proxies → | Bright Data →
6. Handle Dynamic Content Properly
| Content Type | Best Approach |
|---|---|
| Lazy-loaded images | Scroll to trigger load before extracting |
| Infinite scroll | Scroll in loop, detect end by count or element |
| Load-on-click | Simulate click, wait for network idle |
| AJAX/XHR content | Intercept network requests directly |
| Login-gated content | Use session cookies, not per-request auth |
// Scroll to bottom for infinite scroll
async function scrollToEnd(page) {
let prevHeight = 0;
while (true) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1500);
const newHeight = await page.evaluate(() => document.body.scrollHeight);
if (newHeight === prevHeight) break;
prevHeight = newHeight;
}
}
7. Store Incrementally, Not at the End
Write to storage after each item or batch — never buffer all results in memory:
// Good: push immediately
await Actor.pushData(item);
// Bad: buffer and push at the end (loses data on crash)
const results = [];
results.push(item);
await Actor.pushData(results); // Only at the end
Use Apify's Dataset or Crawlee's built-in pushData() for crash-safe incremental storage.
8. Deduplication
Track what you've scraped to avoid reprocessing:
import { RequestQueue } from 'crawlee';
const queue = await RequestQueue.open();
// Crawlee deduplicates by URL automatically
// For custom dedup key:
await queue.addRequest({
url: 'https://example.com/product/123',
uniqueKey: 'product-123', // Custom key
});
For database-backed dedup:
seen_urls = set()
def is_new(url):
if url in seen_urls:
return False
seen_urls.add(url)
return True
9. Monitor and Alert
Production scrapers need monitoring:
// Log structured data for monitoring
const metrics = {
url,
statusCode: response.status,
itemsExtracted: items.length,
durationMs: Date.now() - startTime,
proxyCountry: proxy.country,
};
console.log(JSON.stringify(metrics));
Set up alerts for:
- 0 items extracted (site layout changed)
- Error rate > 10% (IP or anti-bot issue)
- Run duration > 2× baseline (performance regression)
10. Respect robots.txt and Terms of Service
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://target.com/robots.txt")
rp.read()
if rp.can_fetch("*", url):
scrape(url)
else:
print(f"Blocked by robots.txt: {url}")
Key rules:
- Check
robots.txtforDisallow:directives - Review ToS for scraping restrictions
- Never store or share personal data beyond your stated purpose (GDPR/CCPA)
- Only scrape publicly available data
FAQ
The highest-impact practices are: (1) check for public APIs before scraping, (2) respect rate limits with delays and backoff, (3) use proxy rotation for blocked or high-volume targets, (4) validate extracted data against a schema, (5) implement retries with exponential backoff, and (6) monitor for extraction failures and layout changes. These six practices cover 90% of production failure modes.
Rotate User-Agent strings, add random delays between requests (1–5 seconds per domain), use residential proxies for sites with strict IP policies, and don't send concurrent requests from a single IP. For sites with JavaScript challenges (Cloudflare, DataDome), use a stealth headless browser or a managed unblocking service.
Use a managed scraping API (Apify, Bright Data, Firecrawl) if: you're scraping protected sites (Cloudflare, CAPTCHA), you need scale above 10K pages/day, or you want clean Markdown/JSON output for AI pipelines. Build your own with Playwright/Crawlee/Scrapy if: you need full control, you're scraping unprotected HTML, or you have specific parsing logic that APIs don't support.
Use CSS selectors (faster) for stable, well-structured HTML. Use XPath for complex hierarchies or when CSS selectors aren't expressive enough. For LLM pipelines, convert to Markdown first using a library like readability-lxml or send HTML to Firecrawl's /scrape endpoint. Always validate output against a schema (Zod in JS, Pydantic in Python) before storage.
