Skip to main content

How to scrape website content (no-code tutorial)

Crawl sites and export clean page text for LLMs, retrieval systems, and research, without maintaining your own headless browser farm.

Quick Answer

Apify's Website Content Crawler extracts full-text content from websites in Markdown or HTML format. It handles JavaScript-rendered pages and is optimized for LLM and RAG use cases.

You run the Website Content Crawler inside an Apify account. The free plan includes monthly credits for small and medium tests; check the Actor Pricing tab before large crawls.

Time~5 min
Cost~$0.20–5 / 1K pages
DifficultyBeginner
OutputMarkdown / HTML / Text / JSON

The Website Content Crawler is one of Apify's most-used Actors for full-page content (not single-field product scraping). Store stats and README pricing hints were last checked in May 2026; verify live numbers on the Actor page.

For a code-first alternative, Apify’s team publishes a beginner website scraping walkthrough on their blog: How to scrape any website for beginners.

In this guide you will:

  • Configure a crawl with safe depth and page caps
  • Pick the right output format for RAG, LLM ingestion, or research
  • Download structured rows for spreadsheets, notebooks, or vector pipelines

What you need

  • An Apify account. The free plan includes $5 in monthly credits, enough for many documentation-sized crawls.

Step 1: Open the Website Content Crawler

In the Store, search Website Content Crawler, or open the Actor directly: Website Content Crawler →.

Choose Try me for free to open the input form.

Step 2: Configure start URLs and scope

  1. Start URLs: e.g. https://docs.example.com (add multiple roots if needed).
  2. Max crawl depth: how many link hops from each start (start 0 = only the seed URLs).
  3. Max pages: hard cap to control cost (e.g. 100 for a tutorial).
  4. Optional: include/exclude patterns if the Actor input supports globs or regex. Tighten them for large sites.

Step 3: Choose your output format

FormatBest for
MarkdownChunking + embeddings, chat context, notebooks
HTMLLayout-aware processing or downstream HTML tools
Plain textSimple NLP, keyword stats, lightweight pipelines

Set the format in the Actor configuration before you start the run.

Step 4: Start the crawl

Click Start. Monitor the log for rate limits, auth walls, or bot challenges. Long runs may finish with partial data if they hit your max pages. That is expected when you cap scope deliberately.

Step 5: Download your content

Apify Affiliate Banner 300x250Apify Affiliate Banner 300x250Apify Affiliate Banner 300x50Apify Affiliate Banner 300x50

Open Output or Storage → Datasets. Each row is typically one URL with extracted body content and metadata.

Export JSON, CSV, or Excel, or read the dataset with the Apify API for automated pipelines.

Output Data Fields
FieldDescriptionExample
urlPage URLhttps://docs.example.com/getting-started
titlePage titleGetting Started Guide
textExtracted content (Markdown/text/HTML)# Getting Started\n\nWelcome to our platform...
metadata.descriptionMeta descriptionLearn how to get started with our platform
metadata.languageCodeLanguage codeen
crawl.depthCrawl depth from start URL2
crawl.loadedAtWhen the page was crawled2026-02-10T14:30:00.000Z

Actor highlights

TopicDetails
Output formatsMarkdown, HTML, plain text
JavaScriptAdaptive rendering: fast HTTP where possible, browser where needed
Link discoveryFollows internal links with configurable depth and caps
CleaningStrips boilerplate such as nav and chrome where the Actor allows
AI-oriented MarkdownStructured headings and text friendly for chunking
ScaleSuitable for large crawls when you set limits and monitor cost

Use cases

  • RAG: Crawl docs or help centers, chunk Markdown, and embed into Pinecone, Qdrant, Milvus, or Weaviate. Pair this with Data for AI & RAG.
  • LLM training / fine-tuning prep: Export consistent plain text or Markdown corpora (respect licensing and robots rules).
  • Research & competitive intel: Archive readable article text for analysis without copying unrelated chrome.
  • Migration: Move legacy HTML sites into Markdown for a new CMS.
  • SEO content QA: Inventory titles, descriptions, and thin pages when combined with your own scoring.

Pricing notes

The Website Content Crawler bills on a pay-per-usage model (you pay for the platform Compute Units a run consumes). README examples often cite ~$0.20–$5.00 per 1,000 pages depending on rendering and site weight: static pages skew cheaper than heavy JS. Reconcile with the live Pricing tab before scaling.

Run your first crawl

Start with low max pages, confirm row quality, then widen scope.



Open the Website Content Crawler on Apify →

Frequently Asked Questions

It is a Store Actor that crawls websites within the limits you set and writes one dataset row per page with full-text content (Markdown, HTML, or plain text) plus metadata. It is optimized for content extraction rather than scraping individual price or SKU fields.

Yes. It uses adaptive rendering: lightweight HTTP fetching when sufficient and browser-based rendering when the page needs JavaScript to expose the main content.

The Website Content Crawler uses pay-per-usage billing, so you pay for the Apify platform Compute Units a run consumes. README examples often show roughly $0.20 to $5.00 per thousand pages depending on complexity. Always read the Actor Pricing tab and your workspace billing settings before large jobs.

Yes. Markdown output is especially convenient for chunking and embeddings. Combine exported rows with vector databases and orchestration tools such as LangChain or LlamaIndex.

Use the Website Content Crawler when you need readable article or documentation text across many URLs. Use the Web Scraper (or targeted Actors) when you need structured fields like prices, SKUs, or ratings.

Legality depends on jurisdiction, site terms, and how you use the data. This page is not legal advice. Review your compliance obligations and see our Is Apify legal? overview for general guidance.

Crawling discovers URLs by following links (within your depth and domain rules). Scraping extracts data from each visited page. This Actor does both: it crawls according to your settings and scrapes main content from each page it visits.

Apify Affiliate Banner 728x90Apify Affiliate Banner 728x90Apify Affiliate Banner 300x50Apify Affiliate Banner 300x50