How to scrape website content (no-code tutorial)
Crawl sites and export clean page text for LLMs, retrieval systems, and research, without maintaining your own headless browser farm.
Quick Answer
Apify's Website Content Crawler extracts full-text content from websites in Markdown or HTML format. It handles JavaScript-rendered pages and is optimized for LLM and RAG use cases.
You run the Website Content Crawler inside an Apify account. The free plan includes monthly credits for small and medium tests; check the Actor Pricing tab before large crawls.
The Website Content Crawler is one of Apify's most-used Actors for full-page content (not single-field product scraping). Store stats and README pricing hints were last checked in May 2026; verify live numbers on the Actor page.
For a code-first alternative, Apify’s team publishes a beginner website scraping walkthrough on their blog: How to scrape any website for beginners.
In this guide you will:
- Configure a crawl with safe depth and page caps
- Pick the right output format for RAG, LLM ingestion, or research
- Download structured rows for spreadsheets, notebooks, or vector pipelines
What you need
- An Apify account. The free plan includes $5 in monthly credits, enough for many documentation-sized crawls.
Step 1: Open the Website Content Crawler
In the Store, search Website Content Crawler, or open the Actor directly: Website Content Crawler →.
Choose Try me for free to open the input form.
Step 2: Configure start URLs and scope
- Start URLs: e.g.
https://docs.example.com(add multiple roots if needed). - Max crawl depth: how many link hops from each start (start
0= only the seed URLs). - Max pages: hard cap to control cost (e.g.
100for a tutorial). - Optional: include/exclude patterns if the Actor input supports globs or regex. Tighten them for large sites.
Step 3: Choose your output format
| Format | Best for |
|---|---|
| Markdown | Chunking + embeddings, chat context, notebooks |
| HTML | Layout-aware processing or downstream HTML tools |
| Plain text | Simple NLP, keyword stats, lightweight pipelines |
Set the format in the Actor configuration before you start the run.
Step 4: Start the crawl
Click Start. Monitor the log for rate limits, auth walls, or bot challenges. Long runs may finish with partial data if they hit your max pages. That is expected when you cap scope deliberately.
Step 5: Download your content
Open Output or Storage → Datasets. Each row is typically one URL with extracted body content and metadata.
Export JSON, CSV, or Excel, or read the dataset with the Apify API for automated pipelines.
| Field | Description | Example |
|---|---|---|
| url | Page URL | https://docs.example.com/getting-started |
| title | Page title | Getting Started Guide |
| text | Extracted content (Markdown/text/HTML) | # Getting Started\n\nWelcome to our platform... |
| metadata.description | Meta description | Learn how to get started with our platform |
| metadata.languageCode | Language code | en |
| crawl.depth | Crawl depth from start URL | 2 |
| crawl.loadedAt | When the page was crawled | 2026-02-10T14:30:00.000Z |
Actor highlights
| Topic | Details |
|---|---|
| Output formats | Markdown, HTML, plain text |
| JavaScript | Adaptive rendering: fast HTTP where possible, browser where needed |
| Link discovery | Follows internal links with configurable depth and caps |
| Cleaning | Strips boilerplate such as nav and chrome where the Actor allows |
| AI-oriented Markdown | Structured headings and text friendly for chunking |
| Scale | Suitable for large crawls when you set limits and monitor cost |
Use cases
- RAG: Crawl docs or help centers, chunk Markdown, and embed into Pinecone, Qdrant, Milvus, or Weaviate. Pair this with Data for AI & RAG.
- LLM training / fine-tuning prep: Export consistent plain text or Markdown corpora (respect licensing and robots rules).
- Research & competitive intel: Archive readable article text for analysis without copying unrelated chrome.
- Migration: Move legacy HTML sites into Markdown for a new CMS.
- SEO content QA: Inventory titles, descriptions, and thin pages when combined with your own scoring.
Pricing notes
The Website Content Crawler bills on a pay-per-usage model (you pay for the platform Compute Units a run consumes). README examples often cite ~$0.20–$5.00 per 1,000 pages depending on rendering and site weight: static pages skew cheaper than heavy JS. Reconcile with the live Pricing tab before scaling.
Start with low max pages, confirm row quality, then widen scope.
It is a Store Actor that crawls websites within the limits you set and writes one dataset row per page with full-text content (Markdown, HTML, or plain text) plus metadata. It is optimized for content extraction rather than scraping individual price or SKU fields.
Yes. It uses adaptive rendering: lightweight HTTP fetching when sufficient and browser-based rendering when the page needs JavaScript to expose the main content.
The Website Content Crawler uses pay-per-usage billing, so you pay for the Apify platform Compute Units a run consumes. README examples often show roughly $0.20 to $5.00 per thousand pages depending on complexity. Always read the Actor Pricing tab and your workspace billing settings before large jobs.
Yes. Markdown output is especially convenient for chunking and embeddings. Combine exported rows with vector databases and orchestration tools such as LangChain or LlamaIndex.
Use the Website Content Crawler when you need readable article or documentation text across many URLs. Use the Web Scraper (or targeted Actors) when you need structured fields like prices, SKUs, or ratings.
Legality depends on jurisdiction, site terms, and how you use the data. This page is not legal advice. Review your compliance obligations and see our Is Apify legal? overview for general guidance.
Crawling discovers URLs by following links (within your depth and domain rules). Scraping extracts data from each visited page. This Actor does both: it crawls according to your settings and scrapes main content from each page it visits.





