Skip to main content

Anti-Scraping Techniques: How Websites Detect and Block Bots

Anti-scraping techniques are the detection and blocking methods websites use to tell automated bots apart from real users. They span IP reputation and rate limiting, TLS and HTTP fingerprinting, CAPTCHAs, JavaScript browser fingerprinting, behavioral analysis, and honeypot traps, usually layered together so a request must pass every check to succeed.

Modern anti-bot systems no longer rely on a single signal. They run dozens of checks in parallel across the network, TLS, HTTP, JavaScript, and behavioral layers, and a scraper that passes nine out of ten will still be blocked on the tenth. This page is a working reference to the detection stack in 2026 and the countermeasures that actually move the needle.

If you have not read a provider's threat report in the last six months, assume your mental model is stale. Cloudflare, DataDome, Kasada, and Akamai ship detection updates weekly.


The detection stack, in order of execution

Checks fire in sequence. Fail an early check and later layers never run. That is why curl sometimes gets a cleaner response than a misconfigured headless Chrome.

LayerWhat it inspectsWho does this well
NetworkIP reputation, ASN classification, proxy/VPN detectionCloudflare IP Reputation, Akamai Client Reputation
TLSJA3/JA4 hash, cipher suite order, ALPN, extensionsCloudflare, Akamai Bot Manager, F5 Shape
HTTP/2SETTINGS frame order, HPACK dynamic table, pseudo-header order (Akamai's akamai_fingerprint)Akamai, Cloudflare
HTTP/1.1Header casing, order, Sec-Fetch-*, Sec-Ch-Ua-* client hintsDataDome, PerimeterX (now HUMAN)
JS environmentnavigator.webdriver, CDP leaks, canvas/WebGL/audio hashes, font enumerationCloudflare Turnstile, DataDome, Kasada
BehavioralMouse deltas, scroll cadence, keystroke timing, session depthKasada, HUMAN, Shape Security

The practical consequence: you cannot fix a TLS fingerprint problem with a better proxy, and you cannot fix a behavioral problem with a better browser. Diagnose the layer first.


Detection technique, signal, and mitigation at a glance

Use this as a triage table. Match the symptom you see (block page, silent failure, CAPTCHA) to the technique, then jump to the section that explains the responsible countermeasure.

Detection techniqueSignal it watches forResponsible mitigation
Rate limitingRequest volume and frequency per IP, burst patternsConservative concurrency, randomized delays, respect Retry-After and robots.txt crawl limits
IP reputation / ASN blockingDatacenter ASN, known proxy ranges, threat-intel scoresResidential or mobile IPs via Apify Proxy, sticky sessions, geo-matched exits
CAPTCHA / invisible challengeAmbiguous risk score on a request or sessionClean browser session, realistic pacing, solver or managed unblocker as last resort
TLS fingerprinting (JA3/JA4)ClientHello cipher and extension order that no real browser emitscurl-cffi, got-scraping, or real browser automation
HTTP/2 and header fingerprintingFrame settings, pseudo-header order, header casing and orderBrowser automation or impersonation libraries, never hand-reordered headers
Browser fingerprintingnavigator.webdriver, canvas/WebGL/audio hashes, CDP leaksPatchright, Nodriver, Camoufox, Crawlee fingerprint injection
Behavioral analysisMouse path linearity, scroll cadence, time-to-first-clickBezier mouse motion, human pacing, one coherent session per profile
Honeypots and trapsFollowing hidden links and filling hidden fieldsVisibility-aware crawling, target real user-visible selectors only

1. IP reputation and rate limiting

What it catches. Request volume per IP, ASN classification (AWS/GCP/DigitalOcean are instantly datacenter-flagged), and threat-intel reputation scores. A single IP hitting /products/?page=1…500 in sequence is trivial to block.

What works in 2026:

  • Residential or mobile IPs for any target with serious protection. Cloudflare, DataDome, and Akamai all maintain ASN allow/deny lists; datacenter IPs are pre-classified and cannot be rehabilitated by slowing down.
  • Sticky sessions for flows that span multiple requests (login, cart, checkout). Session-consistency checks flag sudden IP changes mid-session as proxy hopping.
  • Concurrency that matches the target, not your rig. High-traffic e-commerce can absorb 10–20 parallel sessions per IP tier; a lightly trafficked directory site blocks at two.
  • Apify Proxy with Crawlee's SessionPool handles rotation, session binding, and automatic retiring of burned sessions without hand-rolling state machines.

For choosing between datacenter, residential, and mobile pools and sizing your rotation, see the proxy strategy learning path.


2. CAPTCHAs and invisible challenges

Visual CAPTCHAs are now the fallback, not the first line. The major systems score you invisibly first and only escalate to a puzzle when the invisible signals are ambiguous.

The 2026 landscape:

  • Cloudflare Turnstile: non-interactive in most cases. Runs a chain of proof-of-work puzzles, proof-of-space, web-API probing, and browser-quirk checks before deciding whether to surface a checkbox (per Cloudflare's own docs). Turnstile can be embedded on non-Cloudflare sites, which is why you now see it on targets that aren't behind the CDN.
  • reCAPTCHA v3: no puzzle, ever. Scores each request 0.0–1.0 based on behavioral telemetry accumulated across Google properties. Below the site's threshold, requests just fail.
  • hCaptcha Enterprise: behavioral risk score plus optional puzzle, with machine-learning risk classifiers.
  • DataDome: server-side scoring of every request, with a challenge page ("Please verify you are a human") when the score crosses the threshold. Ships new detection rules on roughly a weekly cadence.
  • Kasada: client-side JS payload that is deliberately obfuscated and re-minted. It targets anti-bot tooling directly by detecting known bypass libraries.

What works:

  • Avoid triggering them in the first place. Every one of these systems prefers to let traffic through, because false positives hurt the protected site too. Real browser TLS, real headers, realistic pacing, and a clean IP get you past the invisible scoring layer on most sites.
  • Solver services (2Captcha, CapSolver, Anti-Captcha) for reCAPTCHA v2 image challenges and hCaptcha puzzles. Solvers for Turnstile exist but are a moving target. Cloudflare ships mitigations, and services scramble to keep up.
  • Managed unblockers (Bright Data Web Unlocker, Zyte API, Apify's pre-built Store Actors) when the above doesn't cut it. You pay per successful request instead of managing the arms race yourself.

For architecture-level patterns, see Engineering Bypass Architecture for Deep-Packet WAFs.


3. TLS fingerprinting (JA3/JA4)

This is the single most common reason a "working in curl, failing in Python" scraper exists.

How it works. JA3 hashes the ClientHello (TLS version, cipher suites, extensions, elliptic curves, elliptic curve point formats) in the order they appear. JA4 (Foxy-IO/FoxIO's 2023 successor, now widely deployed) adds ALPN, SNI presence, and splits TLS and QUIC into distinct signatures. Every HTTP library has a distinctive fingerprint:

  • python-requests + urllib3 → recognized JA3 instantly flagged.
  • aiohttp, httpx → distinct fingerprints, also classified.
  • Node's built-in http/https → classified.
  • Real Chrome 120+, Firefox 120+ → what you want to look like.

Cloudflare and Akamai Bot Manager both inspect TLS fingerprints at the edge, before your request reaches origin. If your JA3/JA4 doesn't match your claimed User-Agent (Chrome UA + Python JA3), you are blocked before JavaScript runs.

What works:

  • curl-cffi (Python): binds to a patched curl that emulates Chrome, Firefox, and Safari TLS stacks. The current gold standard for HTTP-only scraping of TLS-fingerprinted targets.
  • got-scraping (Node): what Crawlee uses under the hood. Ships Chrome-like TLS and HTTP/2 fingerprints.
  • Real browser automation (Playwright, Patchright, Camoufox, Nodriver): uses Chromium/Firefox's actual TLS stack, so the fingerprint matches by default.
  • Do not use requests, urllib3, or aiohttp against TLS-fingerprinted targets. There is no config that fixes it without swapping the TLS layer.

4. HTTP/2 and header fingerprinting

Beyond TLS, the HTTP/2 handshake itself is fingerprintable. Akamai's akamai_fingerprint hashes the SETTINGS frame values, WINDOW_UPDATE increment, priority frame structure, and pseudo-header order (:method :authority :scheme :path vs. Chrome's actual ordering). Hand-written HTTP/2 clients rarely replicate Chrome's exact sequence.

On HTTP/1.1, the giveaways are:

  • Header casing. Real browsers send User-Agent, not user-agent; Python's requests lowercases everything.
  • Header order. Chrome sends Host, Connection, sec-ch-ua, sec-ch-ua-mobile, sec-ch-ua-platform, Upgrade-Insecure-Requests, User-Agent, Accept, Sec-Fetch-Site, Sec-Fetch-Mode, Sec-Fetch-User, Sec-Fetch-Dest, Accept-Encoding, Accept-Language in that order. Most libraries emit a different order.
  • Missing client hints. Sec-Ch-Ua, Sec-Ch-Ua-Mobile, Sec-Ch-Ua-Platform are now mandatory on Chromium; their absence with a Chrome UA string is a contradiction.
  • Sec-Fetch-* semantics. These describe the request context (navigation vs. subresource, cross-site vs. same-origin). Sending Sec-Fetch-Site: none on what claims to be a navigation following a link is a contradiction.

What works: full browser automation or libraries specifically built for impersonation (curl-cffi, got-scraping). Do not try to manually reorder headers in requests, because the underlying urllib3 normalizes them.


5. Browser and environment fingerprinting

Once JavaScript runs, the site collects a signature from the runtime. The canonical checks in 2026:

  • navigator.webdriver: true in unpatched Playwright/Puppeteer, so it must be deleted or shadowed.
  • CDP leaks: Playwright and Puppeteer talk to Chrome via the Chrome DevTools Protocol, which exposes detectable symbols (window.cdc_*, __playwright__, Runtime.enable side-effects). Patchright and Nodriver patch these; stock Playwright does not.
  • Canvas hash: rendering text with specific fonts produces a pixel pattern that varies by OS/GPU. Headless Chrome on Linux produces a fingerprint that's shared across millions of scrapers.
  • WebGL renderer string: SwiftShader in headless mode vs. ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0, D3D11) on a real Windows machine.
  • Audio context fingerprint: OfflineAudioContext produces a deterministic sample that's OS/browser-specific.
  • Font enumeration: headless Chrome ships with a limited font set, while real desktops have 200+ installed fonts.
  • Screen, viewport, and devicePixelRatio: headless defaults (1280×720, DPR 1) are rare on real users.
  • Timezone/locale mismatch: Intl.DateTimeFormat().resolvedOptions().timeZone must match your proxy's geo-IP.

What works:

  • Patchright or Nodriver (Python): Playwright/Chromium forks that patch CDP leaks and webdriver flags at the binary level, not via JS injection. JS-based stealth plugins are increasingly detected because the patching itself is observable.
  • Camoufox: patched Firefox for scraping, less commonly fingerprinted than Chromium forks.
  • Managed browser services like Bright Data Scraping Browser and Apify's browser pool handle fingerprint rotation and patching as a service.
  • Crawlee's fingerprint injection: integrates with fingerprint-suite to rotate realistic, internally consistent fingerprints across sessions.

Test your own browser at browserleaks.com or creepjs before deploying at scale.

If your target also renders content client-side, the browser you use to clear fingerprinting doubles as your rendering engine. See scraping dynamic websites for handling JavaScript-loaded data.


6. Behavioral analysis

This is where Kasada, HUMAN (formerly PerimeterX), and Shape Security earn their keep. Instead of a single-request check, they observe a session over time: mouse trajectory curves (real mouse movement has jitter and acceleration; page.mouse.move() is linear), scroll velocity, time-to-first-click, keystroke inter-arrival times, and focus/blur patterns.

Static fingerprint randomization does not help here. The signal is in the dynamics.

What works:

  • Human-like motion: libraries like playwright-mouse-helper or ghost-cursor generate Bezier-curve mouse paths with jitter.
  • Realistic pacing: don't click within 50ms of DOM ready. Real users take 500–3000ms to orient before the first interaction.
  • Scroll before click: most elements are scrolled into view by a real user before being clicked.
  • Session discipline: one session = one coherent behavioral profile. Mixing rapid-fire requests with human-like ones in the same cookie jar is itself a signal.
  • Accept the cost: against Kasada-tier protection, managed services (Apify Store Actors, Bright Data Web Unlocker, Zyte API) are often cheaper than the engineering hours needed to stay ahead.

7. Honeypots and trap requests

Hidden links styled with display:none, visibility:hidden, zero opacity, or moved off-screen catch crawlers that follow every <a href>. Some sites go further: hidden form fields that should remain empty, or API endpoints advertised only in robots.txt as Disallow entries.

What works: visibility-aware crawling. Crawlee's enqueueLinks honors CSS visibility by default when using a browser crawler. For HTTP-only crawlers, checking computed styles server-side is impossible, so stick to selectors that describe real user-visible navigation (nav a, main a[href^="/product/"]) rather than blanket a[href].


Choosing a countermeasure stack by target difficulty

Target difficultyTypical signals in playStack that usually works
Easy: small sites, open APIsRate limiting onlyhttpx or got-scraping + datacenter proxies, random delays
Medium: mainstream e-commerce, SaaS marketing sitesRate limiting + TLS fingerprinting + basic Cloudflarecurl-cffi or Crawlee's HttpCrawler + residential proxies + sticky sessions
Hard: LinkedIn, Amazon product pages, Instagram, Cloudflare Managed ChallengeEverything above + full browser fingerprinting + behavioral scoringPatchright/Nodriver + residential proxies + Crawlee's PlaywrightCrawler with fingerprint injection, or a managed Scraping Browser
Very hard: Kasada/HUMAN/DataDome-protected, banking, ticketingBehavioral detection, constant rule updatesApify Store Actor (maintained against updates), Bright Data Web Unlocker, Zyte API, or commercial solver-as-a-service

The economic decision at the "very hard" tier is almost always: buy, don't build. The engineering hours to reverse-engineer a Kasada payload cost more than a year of Bright Data Web Unlocker at any reasonable volume.


A note on ethics and compliance

Every technique on this page can be used against targets that actively don't want to be scraped. That's a legal and ethical question, not a technical one. Before deploying: check the target's ToS, respect robots.txt where it applies, rate-limit conservatively, scrape public data only, and consult counsel for anything near PII, copyrighted content, or CFAA-relevant jurisdictions. Apify's Store Actors for major platforms are built with these considerations in mind; rolling your own means you own the risk.


Frequently Asked Questions

TLS fingerprinting, by a wide margin. A Python requests or Node http client produces a JA3/JA4 signature that no real browser emits, and Cloudflare and Akamai block at the edge before any HTTP logic runs. Switching to curl-cffi (Python) or got-scraping (Node), or using real browser automation, fixes the majority of 'it worked yesterday' regressions.

No. Per Cloudflare's documentation, Turnstile runs non-interactive challenges (proof-of-work, proof-of-space, web-API probes, browser-quirk checks) and only surfaces a visible checkbox when the invisible signals are ambiguous. A clean, real browser session with a residential IP usually passes without any user interaction.

Rarely. DataDome scores server-side on every request using device fingerprint, behavioral telemetry, and session depth. Residential IPs clear the ASN check but don't address fingerprinting or behavioral signals. The working stack is Patchright or Nodriver (for clean fingerprints) plus residential proxies plus realistic interaction timing. Even then, DataDome ships weekly detection updates, so expect to revisit.

JA3 (2017) hashes the TLS ClientHello: version, ciphers, extensions, elliptic curves, EC point formats. JA4 (FoxIO, 2023) is a structured fingerprint that separates TLS from QUIC, includes ALPN and SNI presence, and produces a readable prefix plus hash. JA4 is now widely deployed at Cloudflare and Akamai; if you're building impersonation logic in 2026, target JA4.

The JS-injection stealth plugins (puppeteer-extra-plugin-stealth, playwright-stealth) are increasingly detected because the act of patching is itself observable, and DataDome and Kasada look for the override. Binary-patched forks like Patchright (Python), Nodriver, and Camoufox patch at a level that isn't JS-visible, and are the current effective choice for browser-based scraping against serious protection.

The core set is rate limiting and IP reputation scoring at the network layer, TLS fingerprinting (JA3/JA4) and HTTP/2 header fingerprinting at the protocol layer, CAPTCHAs and invisible challenges, JavaScript browser fingerprinting (canvas, WebGL, audio, navigator.webdriver), behavioral analysis of mouse and scroll dynamics, and honeypot traps. Serious anti-bot vendors run all of these in parallel, so passing one check is not enough.

Scrape only public data, read and respect robots.txt and the site's Terms of Service, rate-limit conservatively and honor Retry-After headers, identify a real contact where possible, and avoid PII or copyrighted content without permission. Pacing requests to a level the site can comfortably serve is both the most ethical approach and the most reliable way to stay below detection thresholds. For anything legally sensitive, consult counsel before deploying.

Apify Affiliate Banner 728x90Apify Affiliate Banner 728x90Apify Affiliate Banner 300x50Apify Affiliate Banner 300x50