Skip to main content

Complete Guide to Web Scraping with JavaScript and Node.js in 2026

· 8 min read
Yassine El Haddad
Software Developer & Automation Specialist

I build production AI agents, web scrapers, and automation pipelines. Most of what I publish here comes from the actual problems they run into: proxies that get banned, anti-bot stacks that fingerprint your client, RAG that drifts when the underlying data moves. Stack: Python, TypeScript, Go, FastAPI, LangChain, Crawlee, Playwright, deployed on AWS, GCP, and Cloudflare.

JavaScript and Node.js power some of the most capable web scrapers in 2026. The ecosystem spans Axios and node-fetch for HTTP, Cheerio for HTML parsing, Playwright and Puppeteer for browser automation, and Crawlee as the full framework that powers Apify Actors. This guide covers the JS scraping stack, comparison tables, TypeScript patterns, data output options, and a complete Crawlee TypeScript Actor example. Try Apify to run Crawlee Actors in the cloud.

JavaScript Scraping Ecosystem 2026

LibraryBrowser SupportTypeScriptAnti-DetectBest For
Axios + CheerioLimitedFast static HTML, APIs
node-fetch + CheerioLimitedSame as Axios, native fetch
PlaywrightVia stealth add-onsSPAs, anti-bot
PuppeteerVia puppeteer-extraChrome-focused automation
Crawlee✅ (Playwright/Puppeteer) or ❌ (Cheerio)Built-in proxy, request queueFull crawlers, Apify Actors

For static pages, Axios + Cheerio is the fastest path. For JavaScript-heavy sites, use Playwright or Crawlee's PlaywrightCrawler. Crawlee bundles request queue, retries, and Apify Dataset integration—ideal for production and Apify deployment. See Playwright vs Puppeteer vs Selenium 2026 for browser tool comparison.

Axios + Cheerio: Fast HTML Scraping

Cheerio provides a jQuery-like API for parsing HTML. Combine with Axios for simple requests:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapePage(url) {
const { data } = await axios.get(url, {
headers: { 'User-Agent': 'Mozilla/5.0 (compatible; Bot/1.0)' },
timeout: 15000
});
const $ = cheerio.load(data);
const links = [];
$('a[href^="http"]').each((_, el) => {
links.push($(el).attr('href'));
});
return { url, links };
}

For TypeScript, add types for the scraped shape. Use cheerio's typings via @types/cheerio or the built-in types in newer versions.

Playwright for Node.js: Browser Automation

Playwright drives Chromium, Firefox, or WebKit. Use it when the target requires JavaScript execution:

const { chromium } = require('playwright');

async function scrapeSPA(url) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle' });
await page.waitForSelector('.product-listing', { timeout: 15000 });

const products = await page.locator('.product-card').evaluateAll((nodes) =>
nodes.map((el) => ({
title: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim()
}))
);

await browser.close();
return { url, products };
}

Playwright supports network interception and route handlers for blocking ads or mocking responses. For anti-bot targets, pair with Bright Data Scraping Browser or residential proxies.

Crawlee TypeScript: CheerioCrawler and PlaywrightCrawler

Crawlee provides type-safe crawlers with a built-in request queue, automatic retries, and Apify storage. Two main crawler types:

  • CheerioCrawler — static HTML, fast, low memory
  • PlaywrightCrawler — JavaScript-rendered pages, full browser
import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 100,
async requestHandler({ request, $ }) {
const items = $('.item').map((_, el) => ({
title: $(el).find('.title').text().trim(),
link: $(el).find('a').attr('href')
})).get();
for (const item of items) {
await crawler.pushData(item);
}
},
});

await crawler.run(['https://example.com']);

For Playwright, swap to PlaywrightCrawler and use context.page instead of $. Crawlee's pushData writes to the default storage (local or Apify Dataset when run on Apify).

Async Patterns: Promise.all, Async Iteration

Parallel requests with Promise.all:

const urls = ['https://a.com', 'https://b.com', 'https://c.com'];
const results = await Promise.all(urls.map(url => scrapePage(url)));

Bounded concurrency with p-limit or manual semaphore:

import pLimit from 'p-limit';
const limit = pLimit(5);
const results = await Promise.all(
urls.map(url => limit(() => scrapePage(url)))
);

Async iteration for streaming through large URL lists:

for await (const result of processUrlsInBatches(urls, 10)) {
await saveToDb(result);
}

See Apify error handling for retry and fallback patterns when requests fail.

TypeScript Types and Zod Validation

Define interfaces for scraped data and validate with Zod:

import { z } from 'zod';

const ProductSchema = z.object({
title: z.string(),
price: z.string().regex(/^\$[\d.]+$/),
inStock: z.boolean().optional(),
});

type Product = z.infer<typeof ProductSchema>;

function validateProduct(raw: unknown): Product {
return ProductSchema.parse(raw);
}

Use ProductSchema.safeParse() for non-throwing validation. Typed data improves downstream pipelines and catches schema drift early.

Data Output: JSON Lines, CSV, SQLite, Apify Dataset

JSON lines — stream-friendly, one object per line:

import { createWriteStream } from 'fs';
const out = createWriteStream('output.jsonl', { flags: 'a' });
out.write(JSON.stringify(item) + '\n');

CSV — use json2csv or manual serialization for flat structures.

SQLite — with better-sqlite3:

import Database from 'better-sqlite3';
const db = new Database('scraped.db');
db.exec(`CREATE TABLE IF NOT EXISTS products (url TEXT PRIMARY KEY, title TEXT, price TEXT)`);
const insert = db.prepare('INSERT OR REPLACE INTO products VALUES (?, ?, ?)');
for (const item of items) {
insert.run(item.url, item.title, item.price);
}

Apify Dataset — when running on Apify, Actor.pushData() writes to the run's dataset. Fetch via API or export to JSON/CSV from the Apify Console.

Complete Crawlee TypeScript Actor Example

import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';

await Actor.init();

interface Input {
startUrls: { url: string }[];
maxItems?: number;
}

const input = (await Actor.getInput<Input>()) ?? {};
const { startUrls = [{ url: 'https://example.com' }], maxItems = 50 } = input;

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PlaywrightCrawler({
proxyConfiguration,
maxRequestsPerCrawl: maxItems,
async requestHandler({ request, page }) {
await page.waitForSelector('main', { timeout: 15_000 });
const title = await page.title();
const text = await page.locator('main').first().innerText();
await Actor.pushData({ url: request.loadedUrl ?? request.url, title, excerpt: text.slice(0, 500) });
},
});

await crawler.run(startUrls);
await Actor.exit();

This Actor runs on Apify with input startUrls and maxItems. Proxy configuration enables IP rotation for tougher targets. For scaffolding new Actors, use apify create and select the Crawlee + Playwright (TypeScript) template—see building an Apify Actor with TypeScript.

Comparison: DIY vs Crawlee vs Apify

ApproachSetupScalingProxyBest For
Axios + CheerioQuickManualAdd manuallyOne-off, static pages
Crawlee (self-host)ModerateYour infraConfigure per crawlerFull control, custom pipelines
Apify + CrawleeMinimalAutomaticBuilt-inManaged runs, scheduling, API

For production pipelines, Crawlee on Apify handles queues, storage, and scaling. Add Bright Data as a custom proxy for anti-bot targets.

Testing and Debugging

Use requestHandler logging to trace failures. For PlaywrightCrawler, enable headless: false locally to watch the browser. Crawlee's Configuration object lets you set purgeOnStart to clear storage between runs during development. Add failedRequestHandler to log and optionally re-queue failed URLs. For integration tests, mock HTTP responses with nock or msw so tests don't hit live targets. See building an Apify Actor with TypeScript for Actor-specific testing patterns.

Environment and Configuration

Store APIFY_TOKEN, proxy URLs, and target base URLs in environment variables. Use process.env in Node or a .env file with dotenv. For Apify Actors, input comes from Actor.getInput(); env vars are for secrets. Configure maxConcurrency and maxRequestsPerCrawl based on target tolerance—aggressive defaults can trigger blocks. Start conservative (e.g., concurrency 2–5) and increase after validating success rates.

Deployment: Local vs Apify

Local — Run with node src/main.js or npx crawlee run. Storage goes to ./storage by default. Good for development and small runs.

Apify — Use apify push to deploy. Apify provides input UI, scheduling, webhooks, and dataset export. Your Crawlee code runs unchanged; only the execution environment and storage backend differ. For team collaboration and recurring pipelines, Apify reduces operational burden. Get started with Apify.

Quick Start: Crawlee in 60 Seconds

apify create my-scraper
cd my-scraper
# Edit src/main.ts, then:
npm start

Use apify create (Apify CLI) and choose Crawlee + Playwright or Cheerio. The default template uses CheerioCrawler. Swap to PlaywrightCrawler in main.ts if you need browser rendering. Add Actor.pushData() and proxy configuration for Apify deployment. The Playwright vs Puppeteer vs Selenium comparison helps choose the right browser driver when moving from Cheerio to full browser automation.

Apify Affiliate Banner 728x90Apify Affiliate Banner 728x90Apify Affiliate Banner 300x50Apify Affiliate Banner 300x50
Start with the right crawler

Static HTML → CheerioCrawler. JavaScript-heavy → PlaywrightCrawler. Need Apify scheduling and storage → deploy as Actor with Apify CLI.



Try Apify | Bright Data Proxies

Frequently Asked Questions

Use Cheerio (with Axios/fetch) for static HTML—faster and lighter. Use Playwright when the page loads content via JavaScript or requires cookies/sessions.

Define interfaces or Zod schemas for your output shape. Use z.infer for type inference. Validate raw objects with ProductSchema.parse() before storing.

Crawlee adds a request queue, automatic retries, proxy rotation, and Apify storage integration. Use Crawlee when building crawlers; use raw Playwright for simple automation or scripts.

Use apify create to scaffold, add Actor.pushData() for output, define INPUT_SCHEMA, then apify push. See building an Apify Actor with TypeScript for the full tutorial.

Yes. Crawlee has PuppeteerCrawler. Playwright is generally preferred for better cross-browser support and maintained by Microsoft, but Puppeteer works for Chrome-only use cases.

Common mistakes and fixes

Cheerio returns empty on SPA

Target renders via JavaScript. Use PlaywrightCrawler or Puppeteer instead of CheerioCrawler.

Too many open files / connection limits

Reduce maxConcurrency. Use request handler timeout. Monitor memory; Playwright browsers are heavy.

Dataset export fails or times out

Push data in batches. For large runs, stream to external storage (S3, Postgres) from the handler.