Is Web Scraping Legal?
Last reviewed: 18 May 2026 by Yassine El Haddad. Reflects post-settlement hiQ v. LinkedIn record, Meta v. Bright Data (N.D. Cal. 2024), the EDPB Opinion 28/2024 on AI models, and the EU AI Act GPAI provisions in force from 2 August 2026.
Yes. Scraping publicly accessible data is generally legal in the US and EU when you stay logged-out, do not damage the site, and minimize personal data. Three rules matter most in 2026: (1) robots.txt is best practice, not a statute (it does not by itself create liability); (2) bypassing a login or technical block can trigger CFAA-style and breach-of-contract claims (hiQ v. LinkedIn final consent judgment, 2022); (3) personal data is regulated even when public under EDPB Opinion 28/2024 and GDPR/CCPA. Apify itself is legal software: your targets, fields, and method decide the risk.
Web scraping sits at the intersection of contract law (terms of use), computer access laws, copyright, and privacy regimes. The practical question is rarely "is scraping universally legal?" It is whether your specific method, data fields, and use case stay inside defensible boundaries. This guide frames the decision; it is not legal advice.
We are not lawyers. This page is educational only. For product launches, regulated industries, or personal data at scale, hire counsel in your jurisdictions. See our legal compliance framework for scraping teams for an operational checklist.
Is web scraping legal?
There is no global yes/no. Courts and regulators look at:
- How you access data (credentials, circumvention, rate of requests).
- What you collect (public facts vs. personal data vs. creative works).
- Why you process it (research, analytics, republication, model training).
Public, unauthenticated pages (no login wall) are generally scrapable for internal analytics, price monitoring, lead enrichment, or aggregation, provided you avoid harassment-scale traffic and respect applicable privacy and IP rules. Logins, paywalls, and "private" APIs raise contract and CFAA-style risk in the U.S. and analogous laws elsewhere. The 2024 ruling in Meta v. Bright Data, where Judge Edward Chen held that Meta's terms "do not bar logged-off scraping of public data," is the clearest US contract-law signal so far that the logged-out / logged-in line matters far more than the URL alone.
Robots.txt and Terms of Service (ToS)
| Signal | Why it matters |
|---|---|
robots.txt | A machine-readable preference file. Honoring it is best practice and reduces dispute risk; it is not a statute by itself. The EU AI Act and the GPAI Code of Practice now treat robots.txt (and emerging ai.txt) as one acceptable machine-readable rights-reservation signal. |
| Terms of Service | Sites may contractually restrict automated access even to public pages. Breach of contract claims can exist independent of "public visibility." After Meta v. Bright Data, courts read these terms narrowly when the scraper is logged-out. |
| Copyright / TDM | Copying large bodies of creative text, images, or layout for republication or model training differs legally from extracting facts (e.g., prices). In the EU, the Article 4 DSM TDM exception applies unless rightsholders opt out via machine-readable means. See the EU TDM playbook. |
Operational habit: archive the ToS version date, your robots.txt snapshot, and logs showing conservative concurrency. Our web scraping challenges guide walks through the matching engineering controls.
GDPR, CCPA, and personal data
Personal data does not stop being regulated because a profile is "public." In the EU/UK GDPR and in California CCPA/CPRA, you may need:
- A lawful basis (often legitimate interest, sometimes consent).
- Data minimization: only fields you can justify.
- Retention limits, deletion workflows, and DSAR handling where required.
The EDPB Opinion 28/2024 on AI models, adopted in December 2024, tightened the analysis for anyone scraping personal data to train or improve AI. The Board endorsed legitimate interest as a possible basis but only under a strict three-step test, and explicitly listed respecting robots.txt or ai.txt protocols as a mitigating measure businesses should adopt. France's CNIL followed in mid-2025 with practical guidance on lawful AI training data collection.
Apify provides infrastructure; you remain the data controller/processor for the datasets you build. Map each field (name, email, job title, avatar URL) to purpose and retention before scaling. If your pipeline targets EU citizens for lead generation, document your legitimate-interest balancing test.
hiQ Labs v. LinkedIn: the full record
The hiQ v. LinkedIn line of cases is the canonical US web-scraping precedent, but its outcome is often misread. Two distinct rulings matter:
- 9th Circuit, April 2022: affirmed the original injunction and held that the CFAA's "without authorization" clause does not reach scraping of public LinkedIn profiles. That holding survived the settlement.
- N.D. Cal., November–December 2022: the district court found hiQ had breached LinkedIn's user agreement and used fake accounts/turkers to bypass authentication. The parties settled with a $500,000 judgment, a permanent injunction barring hiQ from scraping LinkedIn, and destruction of the scraped corpus.
Reading: the CFAA shield for logged-out public scraping is intact, but state tort claims (trespass to chattels, misappropriation) and breach of contract still apply when you circumvent login walls or violate plain-language terms. Treat hiQ as one strong data point, not a blanket license.
Meta v. Bright Data: the 2024 sequel
In January 2024, Judge Edward Chen of the Northern District of California granted summary judgment to Bright Data, holding that Meta's Facebook and Instagram terms "do not bar logged-off scraping of public data; perforce it does not prohibit the sale of such public data." Meta dropped the rest of the case in February 2024 and waived appeal. The decision reinforces three points for 2026 scrapers:
- The logged-out posture is now the most legally defensible mode for public-page scraping.
- Once a scraper terminates its account, post-termination logged-out scraping is generally outside the contract.
- Operators who want to block scraping must use technical controls and rights-reservation signals, not just buried ToS clauses.
EU AI Act and the TDM opt-out (from August 2026)
The EU AI Act's general-purpose AI provisions enter into force on 2 August 2026. Providers must comply with the Article 4(3) DSM TDM opt-out when training on scraped content, and the AI Office can fine non-compliance up to 3% of global turnover or €15M, whichever is higher. The European Commission consultation is standardizing machine-readable protocols (robots.txt directives, ai.txt, and similar), and recent German case law has accepted natural-language opt-outs in terms of use as "machine-readable." If you are building a corpus for model training, robots.txt now has direct copyright weight, not just etiquette weight.
What Apify does for compliance-minded scraping
Apify is legitimate software: a cloud runtime for automation, datasets, and scheduling. It does not decide whether your target or your fields are lawful. That remains your responsibility.
Platform-side alignment you can rely on operationally:
- Documentation on acceptable use: read Apify's Acceptable Use Policy alongside your counsel's guidance.
- Engineering controls in tools like Crawlee for queues, rate limits, backoff, and session hygiene. These reduce harm to target sites and support proportionate access patterns.
- Transparency: identify bots clearly where appropriate; avoid credential stuffing or circumventing technical barriers you are not authorized to bypass.
- Compliant lead-gen workflows: see our lead generation use cases for examples that respect GDPR minimization.
For Apify's own commentary, see Is web scraping legal? on the Apify blog.
Best practices for legally safer scraping in 2026
- Stay logged-out unless you have written permission or a clear legal opinion for authenticated scraping. Meta v. Bright Data made this the safest US posture.
- Respect
robots.txtandai.txt: under the EU AI Act and EDPB Opinion 28/2024 they now carry copyright and GDPR weight. - Throttle concurrency; backoff on errors; do not degrade third-party services. State trespass-to-chattels claims (the hiQ settlement basis) hinge on harm.
- Minimize personal data; pseudonymize where possible; delete on schedule. Build a legitimate-interest balancing record before processing EU/UK identities.
- Do not republish copyrighted articles, images, or long creative excerpts without rights, especially for model training under EU TDM rules.
- Document purpose, data inventory, and legal review checkpoints, especially before reselling data or training models on scraped corpora.
When in doubt, assume "public HTML" ≠ "free to reuse for any purpose."
Use Apify to enforce rate limits, retries, and repeatable jobs, then pair that discipline with legal review for targets and fields.
Generally yes for publicly accessible data accessed in a logged-out state, but terms of service, GDPR/CCPA, copyright, and computer-access laws still apply. The 2024 Meta v. Bright Data ruling and the 2022 hiQ v. LinkedIn 9th Circuit holding both protect logged-out scraping of public pages from CFAA liability, but breach-of-contract and state-tort claims remain live. This is not legal advice; consult counsel for your use case.
Yes. Apify is a legitimate automation and data extraction platform used by enterprises and developers worldwide. Legality depends on how you use it: your targets, credentials, data fields, and retention must comply with applicable law and site rules.
No platform can grant blanket permission to ignore third-party rights. Apify publishes an Acceptable Use Policy and expects customers to follow the law and site terms. Prohibited or abusive activity can violate platform policy even when a target page is public.
GDPR regulates personal data, not scraping as a whole. The EDPB Opinion 28/2024 confirms legitimate interest can be a lawful basis for AI-related scraping under a strict three-step test, with robots.txt/ai.txt compliance listed as a mitigation. You still need data minimization, retention limits, and DSAR handling for personal data.
Yes as best practice, and increasingly yes as a legal signal. The EU AI Act and EDPB guidance now treat robots.txt (and the emerging ai.txt) as machine-readable rights-reservation signals for copyright opt-outs and as a mitigation factor under GDPR legitimate-interest analysis. Ignoring it weakens both copyright and privacy defenses.
Two things. The 9th Circuit (2022) held the CFAA's 'without authorization' clause does not reach scraping of public LinkedIn profiles. The N.D. Cal. district court (Nov 2022) then found hiQ liable for breach of contract and California state torts because hiQ used fake accounts to bypass authentication; the parties settled with a $500K judgment and a permanent scraping injunction against hiQ.
In January 2024, Judge Chen of the Northern District of California granted summary judgment to Bright Data, holding that Meta's Facebook and Instagram terms do not bar logged-off scraping of public data, nor the sale of that data. Meta dropped the case in February 2024 and waived appeal. It is now the leading US contract-law authority for logged-out public scraping.
That is higher risk: you are bound by account contracts, may be circumventing a technical barrier (a CFAA red flag after hiQ), and the data may be non-public or more clearly personal under GDPR. Many teams restrict production scrapers to unauthenticated pages unless they have explicit permission or counsel-approved grounds.
Sources
- Apify: Is web scraping legal?
- Apify Acceptable Use Policy
- Apify Legal hub
- hiQ Labs v. LinkedIn: Wikipedia case file
- hiQ v. LinkedIn proposed consent judgment (Privacy World)
- Meta v. Bright Data summary judgment analysis (Eric Goldman)
- EDPB Opinion 28/2024 on AI models and GDPR
- CNIL guidance on web scraping for AI development (2025)
- EU AI Act TDM and transparency playbook (IAPP)
- European Commission consultation on TDM rights-reservation protocols
Common mistakes and fixes
I do not know if my target dataset contains personal data.
Classify each field first and remove any data that is not essential to your objective.
Terms of service appear ambiguous about automation.
Escalate to legal review and keep logs proving conservative request rates and public access.
Compliance asks for deletion and retention controls.
Store source metadata, define retention windows, and implement deletion workflows by key.



