How to scrape website content (no-code tutorial)

Crawl sites and export clean page text for LLMs, retrieval systems, and research, without maintaining your own headless browser farm.

Quick Answer

Apify's Website Content Crawler extracts full-text content from websites in Markdown or HTML format. It handles JavaScript-rendered pages and is optimized for LLM and RAG use cases.

You run the Website Content Crawler inside an Apify account. The free plan includes monthly credits for small and medium tests; check the Actor Pricing tab before large crawls.

Time~5 min

Cost~$0.20–5 / 1K pages

DifficultyBeginner

OutputMarkdown / HTML / Text / JSON

The Website Content Crawler is one of Apify's most-used Actors for full-page content (not single-field product scraping). Store stats and README pricing hints were last checked in May 2026; verify live numbers on the Actor page.

For a code-first alternative, Apify’s team publishes a beginner website scraping walkthrough on their blog: How to scrape any website for beginners.

In this guide you will:

Configure a crawl with safe depth and page caps
Pick the right output format for RAG, LLM ingestion, or research
Download structured rows for spreadsheets, notebooks, or vector pipelines

What you need

An Apify account. The free plan includes $5 in monthly credits, enough for many documentation-sized crawls.

Step 1: Open the Website Content Crawler

In the Store, search Website Content Crawler, or open the Actor directly: Website Content Crawler →.

Choose Try me for free to open the input form.

Step 2: Configure start URLs and scope

Start URLs: e.g. https://docs.example.com (add multiple roots if needed).
Max crawl depth: how many link hops from each start (start 0 = only the seed URLs).
Max pages: hard cap to control cost (e.g. 100 for a tutorial).
Optional: include/exclude patterns if the Actor input supports globs or regex. Tighten them for large sites.

Step 3: Choose your output format

Format	Best for
Markdown	Chunking + embeddings, chat context, notebooks
HTML	Layout-aware processing or downstream HTML tools
Plain text	Simple NLP, keyword stats, lightweight pipelines

Set the format in the Actor configuration before you start the run.

Step 4: Start the crawl

Click Start. Monitor the log for rate limits, auth walls, or bot challenges. Long runs may finish with partial data if they hit your max pages. That is expected when you cap scope deliberately.

Step 5: Download your content

Open Output or Storage → Datasets. Each row is typically one URL with extracted body content and metadata.

Export JSON, CSV, or Excel, or read the dataset with the Apify API for automated pipelines.

Output Data Fields

Field	Description	Example
url	Page URL	https://docs.example.com/getting-started
title	Page title	Getting Started Guide
text	Extracted content (Markdown/text/HTML)	# Getting Started\n\nWelcome to our platform...
metadata.description	Meta description	Learn how to get started with our platform
metadata.languageCode	Language code	en
crawl.depth	Crawl depth from start URL	2
crawl.loadedAt	When the page was crawled	2026-02-10T14:30:00.000Z

Actor highlights

Topic	Details
Output formats	Markdown, HTML, plain text
JavaScript	Adaptive rendering: fast HTTP where possible, browser where needed
Link discovery	Follows internal links with configurable depth and caps
Cleaning	Strips boilerplate such as nav and chrome where the Actor allows
AI-oriented Markdown	Structured headings and text friendly for chunking
Scale	Suitable for large crawls when you set limits and monitor cost

Use cases

RAG: Crawl docs or help centers, chunk Markdown, and embed into Pinecone, Qdrant, Milvus, or Weaviate. Pair this with Data for AI & RAG.
LLM training / fine-tuning prep: Export consistent plain text or Markdown corpora (respect licensing and robots rules).
Research & competitive intel: Archive readable article text for analysis without copying unrelated chrome.
Migration: Move legacy HTML sites into Markdown for a new CMS.
SEO content QA: Inventory titles, descriptions, and thin pages when combined with your own scoring.

Pricing notes

The Website Content Crawler bills on a pay-per-usage model (you pay for the platform Compute Units a run consumes). README examples often cite ~$0.20–$5.00 per 1,000 pages depending on rendering and site weight: static pages skew cheaper than heavy JS. Reconcile with the live Pricing tab before scaling.

Run your first crawl

Start with low max pages, confirm row quality, then widen scope.

Open the Website Content Crawler on Apify →

Frequently Asked Questions

It is a Store Actor that crawls websites within the limits you set and writes one dataset row per page with full-text content (Markdown, HTML, or plain text) plus metadata. It is optimized for content extraction rather than scraping individual price or SKU fields.

Yes. It uses adaptive rendering: lightweight HTTP fetching when sufficient and browser-based rendering when the page needs JavaScript to expose the main content.

The Website Content Crawler uses pay-per-usage billing, so you pay for the Apify platform Compute Units a run consumes. README examples often show roughly $0.20 to $5.00 per thousand pages depending on complexity. Always read the Actor Pricing tab and your workspace billing settings before large jobs.

Yes. Markdown output is especially convenient for chunking and embeddings. Combine exported rows with vector databases and orchestration tools such as LangChain or LlamaIndex.

Use the Website Content Crawler when you need readable article or documentation text across many URLs. Use the Web Scraper (or targeted Actors) when you need structured fields like prices, SKUs, or ratings.

Legality depends on jurisdiction, site terms, and how you use the data. This page is not legal advice. Review your compliance obligations and see our Is Apify legal? overview for general guidance.

Crawling discovers URLs by following links (within your depth and domain rules). Scraping extracts data from each visited page. This Actor does both: it crawls according to your settings and scrapes main content from each page it visits.

Quick Answer​

What you need​

Step 1: Open the Website Content Crawler​

Step 2: Configure start URLs and scope​

Step 3: Choose your output format​

Step 4: Start the crawl​

Step 5: Download your content​

Actor highlights​

Use cases​

Pricing notes​