Anti-Scraping Techniques: How Websites Detect and Block Bots
Anti-scraping techniques are the detection and blocking methods websites use to tell automated bots apart from real users. They span IP reputation and rate limiting, TLS and HTTP fingerprinting, CAPTCHAs, JavaScript browser fingerprinting, behavioral analysis, and honeypot traps, usually layered together so a request must pass every check to succeed.
Modern anti-bot systems no longer rely on a single signal. They run dozens of checks in parallel across the network, TLS, HTTP, JavaScript, and behavioral layers, and a scraper that passes nine out of ten will still be blocked on the tenth. This page is a working reference to the detection stack in 2026 and the countermeasures that actually move the needle.
If you have not read a provider's threat report in the last six months, assume your mental model is stale. Cloudflare, DataDome, Kasada, and Akamai ship detection updates weekly.
The detection stack, in order of execution
Checks fire in sequence. Fail an early check and later layers never run. That is why curl sometimes gets a cleaner response than a misconfigured headless Chrome.
| Layer | What it inspects | Who does this well |
|---|---|---|
| Network | IP reputation, ASN classification, proxy/VPN detection | Cloudflare IP Reputation, Akamai Client Reputation |
| TLS | JA3/JA4 hash, cipher suite order, ALPN, extensions | Cloudflare, Akamai Bot Manager, F5 Shape |
| HTTP/2 | SETTINGS frame order, HPACK dynamic table, pseudo-header order (Akamai's akamai_fingerprint) | Akamai, Cloudflare |
| HTTP/1.1 | Header casing, order, Sec-Fetch-*, Sec-Ch-Ua-* client hints | DataDome, PerimeterX (now HUMAN) |
| JS environment | navigator.webdriver, CDP leaks, canvas/WebGL/audio hashes, font enumeration | Cloudflare Turnstile, DataDome, Kasada |
| Behavioral | Mouse deltas, scroll cadence, keystroke timing, session depth | Kasada, HUMAN, Shape Security |
The practical consequence: you cannot fix a TLS fingerprint problem with a better proxy, and you cannot fix a behavioral problem with a better browser. Diagnose the layer first.
Detection technique, signal, and mitigation at a glance
Use this as a triage table. Match the symptom you see (block page, silent failure, CAPTCHA) to the technique, then jump to the section that explains the responsible countermeasure.
| Detection technique | Signal it watches for | Responsible mitigation |
|---|---|---|
| Rate limiting | Request volume and frequency per IP, burst patterns | Conservative concurrency, randomized delays, respect Retry-After and robots.txt crawl limits |
| IP reputation / ASN blocking | Datacenter ASN, known proxy ranges, threat-intel scores | Residential or mobile IPs via Apify Proxy, sticky sessions, geo-matched exits |
| CAPTCHA / invisible challenge | Ambiguous risk score on a request or session | Clean browser session, realistic pacing, solver or managed unblocker as last resort |
| TLS fingerprinting (JA3/JA4) | ClientHello cipher and extension order that no real browser emits | curl-cffi, got-scraping, or real browser automation |
| HTTP/2 and header fingerprinting | Frame settings, pseudo-header order, header casing and order | Browser automation or impersonation libraries, never hand-reordered headers |
| Browser fingerprinting | navigator.webdriver, canvas/WebGL/audio hashes, CDP leaks | Patchright, Nodriver, Camoufox, Crawlee fingerprint injection |
| Behavioral analysis | Mouse path linearity, scroll cadence, time-to-first-click | Bezier mouse motion, human pacing, one coherent session per profile |
| Honeypots and traps | Following hidden links and filling hidden fields | Visibility-aware crawling, target real user-visible selectors only |
1. IP reputation and rate limiting
What it catches. Request volume per IP, ASN classification (AWS/GCP/DigitalOcean are instantly datacenter-flagged), and threat-intel reputation scores. A single IP hitting /products/?page=1…500 in sequence is trivial to block.
What works in 2026:
- Residential or mobile IPs for any target with serious protection. Cloudflare, DataDome, and Akamai all maintain ASN allow/deny lists; datacenter IPs are pre-classified and cannot be rehabilitated by slowing down.
- Sticky sessions for flows that span multiple requests (login, cart, checkout). Session-consistency checks flag sudden IP changes mid-session as proxy hopping.
- Concurrency that matches the target, not your rig. High-traffic e-commerce can absorb 10–20 parallel sessions per IP tier; a lightly trafficked directory site blocks at two.
- Apify Proxy with Crawlee's
SessionPoolhandles rotation, session binding, and automatic retiring of burned sessions without hand-rolling state machines.
For choosing between datacenter, residential, and mobile pools and sizing your rotation, see the proxy strategy learning path.
2. CAPTCHAs and invisible challenges
Visual CAPTCHAs are now the fallback, not the first line. The major systems score you invisibly first and only escalate to a puzzle when the invisible signals are ambiguous.
The 2026 landscape:
- Cloudflare Turnstile: non-interactive in most cases. Runs a chain of proof-of-work puzzles, proof-of-space, web-API probing, and browser-quirk checks before deciding whether to surface a checkbox (per Cloudflare's own docs). Turnstile can be embedded on non-Cloudflare sites, which is why you now see it on targets that aren't behind the CDN.
- reCAPTCHA v3: no puzzle, ever. Scores each request 0.0–1.0 based on behavioral telemetry accumulated across Google properties. Below the site's threshold, requests just fail.
- hCaptcha Enterprise: behavioral risk score plus optional puzzle, with machine-learning risk classifiers.
- DataDome: server-side scoring of every request, with a challenge page ("Please verify you are a human") when the score crosses the threshold. Ships new detection rules on roughly a weekly cadence.
- Kasada: client-side JS payload that is deliberately obfuscated and re-minted. It targets anti-bot tooling directly by detecting known bypass libraries.
What works:
- Avoid triggering them in the first place. Every one of these systems prefers to let traffic through, because false positives hurt the protected site too. Real browser TLS, real headers, realistic pacing, and a clean IP get you past the invisible scoring layer on most sites.
- Solver services (2Captcha, CapSolver, Anti-Captcha) for reCAPTCHA v2 image challenges and hCaptcha puzzles. Solvers for Turnstile exist but are a moving target. Cloudflare ships mitigations, and services scramble to keep up.
- Managed unblockers (Bright Data Web Unlocker, Zyte API, Apify's pre-built Store Actors) when the above doesn't cut it. You pay per successful request instead of managing the arms race yourself.
For architecture-level patterns, see Engineering Bypass Architecture for Deep-Packet WAFs.
3. TLS fingerprinting (JA3/JA4)
This is the single most common reason a "working in curl, failing in Python" scraper exists.
How it works. JA3 hashes the ClientHello (TLS version, cipher suites, extensions, elliptic curves, elliptic curve point formats) in the order they appear. JA4 (Foxy-IO/FoxIO's 2023 successor, now widely deployed) adds ALPN, SNI presence, and splits TLS and QUIC into distinct signatures. Every HTTP library has a distinctive fingerprint:
python-requests+urllib3→ recognized JA3 instantly flagged.aiohttp,httpx→ distinct fingerprints, also classified.- Node's built-in
http/https→ classified. - Real Chrome 120+, Firefox 120+ → what you want to look like.
Cloudflare and Akamai Bot Manager both inspect TLS fingerprints at the edge, before your request reaches origin. If your JA3/JA4 doesn't match your claimed User-Agent (Chrome UA + Python JA3), you are blocked before JavaScript runs.
What works:
curl-cffi(Python): binds to a patched curl that emulates Chrome, Firefox, and Safari TLS stacks. The current gold standard for HTTP-only scraping of TLS-fingerprinted targets.got-scraping(Node): what Crawlee uses under the hood. Ships Chrome-like TLS and HTTP/2 fingerprints.- Real browser automation (Playwright, Patchright, Camoufox, Nodriver): uses Chromium/Firefox's actual TLS stack, so the fingerprint matches by default.
- Do not use
requests,urllib3, oraiohttpagainst TLS-fingerprinted targets. There is no config that fixes it without swapping the TLS layer.
4. HTTP/2 and header fingerprinting
Beyond TLS, the HTTP/2 handshake itself is fingerprintable. Akamai's akamai_fingerprint hashes the SETTINGS frame values, WINDOW_UPDATE increment, priority frame structure, and pseudo-header order (:method :authority :scheme :path vs. Chrome's actual ordering). Hand-written HTTP/2 clients rarely replicate Chrome's exact sequence.
On HTTP/1.1, the giveaways are:
- Header casing. Real browsers send
User-Agent, notuser-agent; Python'srequestslowercases everything. - Header order. Chrome sends
Host,Connection,sec-ch-ua,sec-ch-ua-mobile,sec-ch-ua-platform,Upgrade-Insecure-Requests,User-Agent,Accept,Sec-Fetch-Site,Sec-Fetch-Mode,Sec-Fetch-User,Sec-Fetch-Dest,Accept-Encoding,Accept-Languagein that order. Most libraries emit a different order. - Missing client hints.
Sec-Ch-Ua,Sec-Ch-Ua-Mobile,Sec-Ch-Ua-Platformare now mandatory on Chromium; their absence with a Chrome UA string is a contradiction. Sec-Fetch-*semantics. These describe the request context (navigation vs. subresource, cross-site vs. same-origin). SendingSec-Fetch-Site: noneon what claims to be a navigation following a link is a contradiction.
What works: full browser automation or libraries specifically built for impersonation (curl-cffi, got-scraping). Do not try to manually reorder headers in requests, because the underlying urllib3 normalizes them.
5. Browser and environment fingerprinting
Once JavaScript runs, the site collects a signature from the runtime. The canonical checks in 2026:
navigator.webdriver:truein unpatched Playwright/Puppeteer, so it must be deleted or shadowed.- CDP leaks: Playwright and Puppeteer talk to Chrome via the Chrome DevTools Protocol, which exposes detectable symbols (
window.cdc_*,__playwright__, Runtime.enable side-effects). Patchright and Nodriver patch these; stock Playwright does not. - Canvas hash: rendering text with specific fonts produces a pixel pattern that varies by OS/GPU. Headless Chrome on Linux produces a fingerprint that's shared across millions of scrapers.
- WebGL renderer string:
SwiftShaderin headless mode vs.ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0, D3D11)on a real Windows machine. - Audio context fingerprint:
OfflineAudioContextproduces a deterministic sample that's OS/browser-specific. - Font enumeration: headless Chrome ships with a limited font set, while real desktops have 200+ installed fonts.
- Screen, viewport, and
devicePixelRatio: headless defaults (1280×720, DPR 1) are rare on real users. - Timezone/locale mismatch:
Intl.DateTimeFormat().resolvedOptions().timeZonemust match your proxy's geo-IP.
What works:
- Patchright or Nodriver (Python): Playwright/Chromium forks that patch CDP leaks and webdriver flags at the binary level, not via JS injection. JS-based stealth plugins are increasingly detected because the patching itself is observable.
- Camoufox: patched Firefox for scraping, less commonly fingerprinted than Chromium forks.
- Managed browser services like Bright Data Scraping Browser and Apify's browser pool handle fingerprint rotation and patching as a service.
- Crawlee's fingerprint injection: integrates with
fingerprint-suiteto rotate realistic, internally consistent fingerprints across sessions.
Test your own browser at browserleaks.com or creepjs before deploying at scale.
If your target also renders content client-side, the browser you use to clear fingerprinting doubles as your rendering engine. See scraping dynamic websites for handling JavaScript-loaded data.
6. Behavioral analysis
This is where Kasada, HUMAN (formerly PerimeterX), and Shape Security earn their keep. Instead of a single-request check, they observe a session over time: mouse trajectory curves (real mouse movement has jitter and acceleration; page.mouse.move() is linear), scroll velocity, time-to-first-click, keystroke inter-arrival times, and focus/blur patterns.
Static fingerprint randomization does not help here. The signal is in the dynamics.
What works:
- Human-like motion: libraries like
playwright-mouse-helperorghost-cursorgenerate Bezier-curve mouse paths with jitter. - Realistic pacing: don't click within 50ms of DOM ready. Real users take 500–3000ms to orient before the first interaction.
- Scroll before click: most elements are scrolled into view by a real user before being clicked.
- Session discipline: one session = one coherent behavioral profile. Mixing rapid-fire requests with human-like ones in the same cookie jar is itself a signal.
- Accept the cost: against Kasada-tier protection, managed services (Apify Store Actors, Bright Data Web Unlocker, Zyte API) are often cheaper than the engineering hours needed to stay ahead.
7. Honeypots and trap requests
Hidden links styled with display:none, visibility:hidden, zero opacity, or moved off-screen catch crawlers that follow every <a href>. Some sites go further: hidden form fields that should remain empty, or API endpoints advertised only in robots.txt as Disallow entries.
What works: visibility-aware crawling. Crawlee's enqueueLinks honors CSS visibility by default when using a browser crawler. For HTTP-only crawlers, checking computed styles server-side is impossible, so stick to selectors that describe real user-visible navigation (nav a, main a[href^="/product/"]) rather than blanket a[href].
Choosing a countermeasure stack by target difficulty
| Target difficulty | Typical signals in play | Stack that usually works |
|---|---|---|
| Easy: small sites, open APIs | Rate limiting only | httpx or got-scraping + datacenter proxies, random delays |
| Medium: mainstream e-commerce, SaaS marketing sites | Rate limiting + TLS fingerprinting + basic Cloudflare | curl-cffi or Crawlee's HttpCrawler + residential proxies + sticky sessions |
| Hard: LinkedIn, Amazon product pages, Instagram, Cloudflare Managed Challenge | Everything above + full browser fingerprinting + behavioral scoring | Patchright/Nodriver + residential proxies + Crawlee's PlaywrightCrawler with fingerprint injection, or a managed Scraping Browser |
| Very hard: Kasada/HUMAN/DataDome-protected, banking, ticketing | Behavioral detection, constant rule updates | Apify Store Actor (maintained against updates), Bright Data Web Unlocker, Zyte API, or commercial solver-as-a-service |
The economic decision at the "very hard" tier is almost always: buy, don't build. The engineering hours to reverse-engineer a Kasada payload cost more than a year of Bright Data Web Unlocker at any reasonable volume.
A note on ethics and compliance
Every technique on this page can be used against targets that actively don't want to be scraped. That's a legal and ethical question, not a technical one. Before deploying: check the target's ToS, respect robots.txt where it applies, rate-limit conservatively, scrape public data only, and consult counsel for anything near PII, copyrighted content, or CFAA-relevant jurisdictions. Apify's Store Actors for major platforms are built with these considerations in mind; rolling your own means you own the risk.
TLS fingerprinting, by a wide margin. A Python requests or Node http client produces a JA3/JA4 signature that no real browser emits, and Cloudflare and Akamai block at the edge before any HTTP logic runs. Switching to curl-cffi (Python) or got-scraping (Node), or using real browser automation, fixes the majority of 'it worked yesterday' regressions.
No. Per Cloudflare's documentation, Turnstile runs non-interactive challenges (proof-of-work, proof-of-space, web-API probes, browser-quirk checks) and only surfaces a visible checkbox when the invisible signals are ambiguous. A clean, real browser session with a residential IP usually passes without any user interaction.
Rarely. DataDome scores server-side on every request using device fingerprint, behavioral telemetry, and session depth. Residential IPs clear the ASN check but don't address fingerprinting or behavioral signals. The working stack is Patchright or Nodriver (for clean fingerprints) plus residential proxies plus realistic interaction timing. Even then, DataDome ships weekly detection updates, so expect to revisit.
JA3 (2017) hashes the TLS ClientHello: version, ciphers, extensions, elliptic curves, EC point formats. JA4 (FoxIO, 2023) is a structured fingerprint that separates TLS from QUIC, includes ALPN and SNI presence, and produces a readable prefix plus hash. JA4 is now widely deployed at Cloudflare and Akamai; if you're building impersonation logic in 2026, target JA4.
The JS-injection stealth plugins (puppeteer-extra-plugin-stealth, playwright-stealth) are increasingly detected because the act of patching is itself observable, and DataDome and Kasada look for the override. Binary-patched forks like Patchright (Python), Nodriver, and Camoufox patch at a level that isn't JS-visible, and are the current effective choice for browser-based scraping against serious protection.
The core set is rate limiting and IP reputation scoring at the network layer, TLS fingerprinting (JA3/JA4) and HTTP/2 header fingerprinting at the protocol layer, CAPTCHAs and invisible challenges, JavaScript browser fingerprinting (canvas, WebGL, audio, navigator.webdriver), behavioral analysis of mouse and scroll dynamics, and honeypot traps. Serious anti-bot vendors run all of these in parallel, so passing one check is not enough.
Scrape only public data, read and respect robots.txt and the site's Terms of Service, rate-limit conservatively and honor Retry-After headers, identify a real contact where possible, and avoid PII or copyrighted content without permission. Pacing requests to a level the site can comfortably serve is both the most ethical approach and the most reliable way to stay below detection thresholds. For anything legally sensitive, consult counsel before deploying.



