Skip to main content
use-apify.com

Web scraping: guides & tutorials

Extract structured data from websites with code: crawling, parsing, and anti-bot handling for engineers building datasets and automations on Apify.

117 articlesPage 1 of 12

View all tags

Web scraping turns public web pages into structured datasets you can analyze, monitor, or feed into AI. These guides cover the full workflow: choosing between HTTP requests and headless browsers, parsing HTML with CSS or XPath selectors, handling pagination and infinite scroll, and getting past rate limits and bot detection without breaking sites or laws.

Whether you write your own scraper in Python or JavaScript or run a ready-made actor from the Apify Store, the patterns are the same. Start with a small, well-behaved crawl, add proxies and retries as targets get stricter, and export clean JSON or CSV your pipeline can trust. The tutorials below take you from a first script to production crawls running on a schedule.

Related topics

Bright Data12 min read

Bright Data vs ScraperAPI: Which Proxy Platform Wins?

· 12 min read
Achraf Bizyane
Software Engineer

Bright Data and ScraperAPI both solve the same core problem: getting your scraper past anti-bot systems. But they solve it very differently.

ScraperAPI is a lightweight proxy pass-through API. Send a URL, get HTML back. Simple, cheap at small scale, and you own the scraper logic.

Bright Data is an enterprise proxy network plus managed datasets and a cloud browser. More powerful unblocking, more features, higher price tag.

This is a split-decision comparison. Neither is universally "better" — it depends on your volume, target sites, and budget.

Alternatives9 min read

Octoparse Alternatives: 7 Tools for Web Scraping Without the Limits

· 9 min read
Achraf Bizyane
Software Engineer

Octoparse is a solid no-code scraper. But if you're hitting its limits—Windows-only builder, JavaScript ceiling, API locked behind Professional tier, or $75/month pricing for light use—you have options. This guide compares seven alternatives, ranked by what matters: free tier, JavaScript support, pricing, and where each tool actually shines.

Quick Answer

Apify for developers and APIs. ParseHub for visual scraping with better JavaScript. WebScraper.io for lightweight, extension-based work. Browse.ai for monitoring. Bardeen for automation workflows. PhantomBuster for LinkedIn and social. Octoparse if pure no-code and templates still fit your workflow.

ScraperAPI9 min read

ScraperAPI Review: Honest Verdict on Pricing, Pros & Cons

· 9 min read
Achraf Bizyane
Software Engineer

ScraperAPI is a proxy rendering API. You send it a URL, it returns HTML routed through their managed proxy pool. Simple, focused, and useful for a specific job: adding a reliable proxy layer to code you already have.

But "simple" doesn't mean "best for everything." This review covers what ScraperAPI actually does, what it costs, where it shines, and where it falls short.

Proxy-Seller21 min read

Proxy-Seller Review (2026): All 5 Proxy Types, Real Pricing, and When It Fits

· 21 min read
Yassine El Haddad
Software Developer & Automation Specialist

Most proxy vendors make you choose: raw IPs at good prices from a scrappy provider, or a polished dashboard from an enterprise vendor charging three times more. Proxy-Seller has spent a decade building a third option: all five proxy classes (datacenter IPv4/IPv6, ISP, residential, mobile) under one account, at prices that stay competitive even against single-category specialists, with compliance certs that survive legal review.

They've been running since 2014. 500,000+ clients. Every IP exclusively yours, never shared. The residential pool spans 20M+ IPs across 220+ countries.

Is it the right pick for your stack? That depends on whether you want raw IPs you control, or a managed extraction layer someone else runs. This review walks through verified pricing for all five proxy types, the three discount mechanisms that stack, where Proxy-Seller is genuinely strong, and two scenarios where a different provider will serve you better.

AI agents8 min read

OpenClaw Ecosystem Analysis 2026: Growth, Signals, and Local AI Stacks

· 8 min read
Yassine El Haddad
Software Developer & Automation Specialist

OpenClaw is a self-hosted AI assistant gateway: it connects chat channels (Telegram, Discord, web, and more) and tools to an LLM you choose—often Ollama or vLLM on your own hardware, or a cloud API when you accept that tradeoff. It is not a foundation model; it is orchestration you run yourself.

In March 2026 the project drew unusual attention—including a milestone our editors cited in the weekly roundup (Top 10 AI and tech stories this week). This is time-stamped commentary, not a substitute for upstream docs: channel lists, defaults, and feature names change; confirm behavior, licensing, and security advisories in the official project before production. The piece separates what that attention reflects from what still depends on your own ops discipline, and shows where OpenClaw sits next to local inference, workflow automation, and data collection layers.

Apify6 min read

Apify + Clay: Use Web Scraping to Enrich Your Personal CRM

· 6 min read
Yassine El Haddad
Software Developer & Automation Specialist

Clay (now Mesh) does a lot of the heavy lifting when you connect email, calendar, LinkedIn, and Twitter. What it won’t do on its own is keep polling the open web forever: enrichment tends to reflect what was true when the contact landed in your book, not every headline or title change afterward.

Apify is where scheduled scraping helps — job moves, company news, fresh posts, GitHub activity — then you fold those findings back into Mesh as notes or updates.

Here are three workflows that combine the two without pretending there’s a single “native” button for it.

AI agents8 min read

LangGraph vs AutoGen vs CrewAI 2026: Which One Ships

· 8 min read
Yassine El Haddad
Software Developer & Automation Specialist

Production “agents” are mostly orchestration: LLM calls, tools, memory/state, retries, and guardrails. Three ecosystems lead in 2026—LangGraph, AutoGen, and CrewAI—each with different ergonomics for web data workloads.

Quick Answer

Pick LangGraph 1.0 for production agents that need stateful graphs, retries, and resumable checkpoints — it now powers agents at Uber, LinkedIn, and Klarna. Pick AutoGen 0.4 AgentChat when multi-agent debate is the product. Pick CrewAI for role-based workflows (researcher → editor → analyst) that map to org charts. For web data inside any of them, expose Apify Actors via REST, langchain-apify, or the Apify MCP server.

Amazon4 min read

How to Scrape Amazon Product Data with Apify 2026: ASINs, Prices, and Reviews

· 4 min read
Yassine El Haddad
Software Developer & Automation Specialist

Amazon is the primary source for product pricing, review sentiment, and competitive research. Scraping it manually is notoriously difficult — Amazon deploys heavy bot protection, JavaScript rendering, and geo-pricing.

Apify's Amazon scrapers handle all of this with residential proxies, CAPTCHA solving, and structured output. No code required.

Legal note: Amazon ToS prohibits unauthorized scraping. Only scrape publicly displayed pricing data for research, price comparison, and competitive intelligence. Never create accounts programmatically or access private data.

Actors4 min read

Build and Deploy Your First Apify Actor: Step-by-Step Tutorial (2026)

· 4 min read
Yassine El Haddad
Software Developer & Automation Specialist

An Apify Actor is a serverless scraper or automation packaged for cloud execution. You write standard Node.js code, push it to Apify, and it runs on demand — with built-in proxies, storage, scheduling, and API access included.

This tutorial takes you from an empty folder to a deployed, runnable Actor in about 20 minutes.

Freshness note: Steps verified with Apify CLI 3.x and Apify SDK 3.x (March 2026).

Apify4 min read

Apify vs Scrapy 2026: Which Web Scraping Tool Should You Use?

· 4 min read
Yassine El Haddad
Software Developer & Automation Specialist

Scrapy is the mature Python web crawling framework. Apify is a cloud platform (with Crawlee as its open-source framework) that handles infrastructure, scaling, and storage on top of Node.js.

They're not direct competitors — Scrapy is a code framework, Apify is a full platform — but teams frequently choose between them. This comparison covers where each excels.

Guides on this site

Frequently asked questions

Frequently Asked Questions

Web scraping automatically extracts publicly available data from websites — no manual copy-paste. Common business uses: monitoring competitor prices in real time, building lead lists from directories and LinkedIn, tracking product reviews and brand mentions, fueling AI models with fresh training data, and keeping internal databases synced to live sources. If a website shows it, scraping can collect it on a schedule.

The fastest path is an Apify Store actor. Search for your target site (e.g. "Google Maps Scraper"), fill in the inputs, run it, and download your data. For sites without ready-made scrapers, Apify lets you write a simple script — dozens of tutorials on this blog cover common starting points. If you need a custom scraper built and deployed for you, the deployment services on this site cover that too.

Scraping publicly available, non-personal data is generally accepted in most jurisdictions, but platform terms of service, GDPR, CCPA, and copyright law create real constraints. The short rule: scraping public facts — prices, addresses, product names — for research or business intelligence is widely accepted; scraping personal data or bypassing a paywall is not. Always review the target site's ToS and consult a lawyer for commercial pipelines that handle personal information.

Sites detect and block automated traffic using CAPTCHAs, IP rate limits, JavaScript challenges, and browser fingerprinting. The fix is usually rotating residential proxies, using a full browser instead of a plain HTTP request, and pacing requests like a human would. Apify handles most of this automatically via its proxy network and browser-based actors. For persistent blocking on a specific site, a specialist can diagnose the detection mechanism.