Skip to main content

Self-Hosting Guide

Self-hosting earns its keep in three cases: compliance/residency constraints that rule out managed platforms, stable high-volume workloads where compute unit billing exceeds infra cost, and integrations with internal data stores that can't be exposed externally. For most teams below roughly $500/mo of Apify usage, self-hosting is more expensive once you cost in engineering time. The cheapest line on your invoice is the one you don't have to maintain.

If one of the three cases applies, this guide covers the stack.

When self-hosting makes sense

  • Compliance or residency: HIPAA, GDPR Article 44–49 transfer restrictions, public-sector procurement rules, or a security review that won't clear a US-based SaaS.
  • Stable high-volume workloads: sustained 100+ concurrent runs or 1,000+ CU/day, where managed billing is dominated by compute you could run yourself at a fraction of the price.
  • Internal-system integration: direct database access, VPC-only services, or pipelines that must never leave your private network.

Outside these cases, Apify's managed platform is usually the cheaper total-cost option once you include ops time, security patching, and incident response.

Reference architecture

A production self-hosted scraping stack has five layers. Build them in order: ingress and execution first, observability before you add any real load.

1. Ingress layer

  • Scheduler: cron inside a container, Kubernetes CronJob, Temporal, or your own workflow engine. n8n (self-hosted Community Edition) also works as a scheduler front-end.
  • API gateway / job submission: authenticated endpoint that validates input schema before enqueuing. Fail fast at the edge so bad input never consumes worker capacity.
  • Rate limiting and quotas per tenant if you're exposing this to multiple teams.

2. Execution layer

  • Docker containers as the unit of work. One crawler image per scraper family (HTTP-only, browser-based), parameterized by environment variables or JSON input.
  • Resource limits per container: CPU shares, memory limits (critical for Playwright, since headless Chromium will happily OOM a 2 GB container), and wall-clock timeouts.
  • Orchestrator: Docker Compose for small deployments, Kubernetes or Nomad once you're running concurrent workers across multiple hosts. Swarm is viable but has lost momentum since 2023.
  • Crawlee (Node or Python) and Scrapy are the two mature open-source frameworks. For browser automation, Playwright, Patchright (Python, CDP-leak-patched), and Nodriver are current. Crawlee ships Docker images under apify/actor-node-playwright-chrome and similar tags, which include a pre-patched Chromium.

3. State layer

  • Request queue: Redis (with RQ, BullMQ, or a Crawlee RequestQueue backed by Redis), PostgreSQL with SKIP LOCKED, or SQS. Redis is fastest; Postgres is easiest to back up and audit.
  • Datasets / artifact storage: S3 or S3-compatible (Cloudflare R2, MinIO) for JSON/Parquet outputs. PostgreSQL for structured outputs you'll query directly. Managed databases are worth it here: failover and PITR are not projects you want to own.
  • Key-value state / session store: Redis for ephemeral session pools, PostgreSQL or a managed KV for persistent cross-run state.

4. Observability layer

Don't defer this. It is far cheaper to ship it on day one than to retrofit during an incident.

  • Logs: structured JSON to stdout, shipped via Vector or Fluent Bit to Loki, Elasticsearch, or a managed log store (Datadog, Better Stack, Axiom).
  • Metrics: Prometheus + Grafana for infra and per-scraper counters (requests/min, success rate, ban rate, bytes per run).
  • Traces: OpenTelemetry if your workflows span services. Not strictly required for a single-host deployment.
  • Alerts: success-rate drops, ban-rate spikes, queue depth, per-target 4xx/5xx ratios. Alerting should be target-aware, because a Cloudflare update can degrade one scraper without affecting others.

5. Security layer

  • Secrets: HashiCorp Vault, AWS Secrets Manager, or Doppler. Never bake credentials into images or env files checked into git.
  • Network policies: egress allowlists per scraper (you almost never need unrestricted outbound), private subnets for state stores.
  • Access audit: who deployed what, who ran which job against which target. SIEM integration if you have one.
  • Proxy credentials belong in the secrets store, not the image. Rotate them.

Hardware sizing (starting points)

These are working starting points, not minimums. Actual needs vary by target and concurrency.

WorkloadCPURAMDiskNotes
HTTP-only scraping, single worker2 vCPU2 GB20 GBhttpx/curl-cffi + Crawlee/Scrapy
Browser scraping, single worker4 vCPU8 GB40 GBOne Chromium process = ~1 GB steady-state
Browser scraping, 4 concurrent workers8 vCPU16 GB80 GBHeadroom matters; Chromium spikes under load
n8n self-hosted (small team)2 vCPU2 GB20 GBQueue mode requires Redis and a worker
PostgreSQL for scraped data2 vCPU4 GB100+ GB SSDManaged database recommended

Browser scraping memory is the usual surprise: plan for 1 GB per concurrent Chromium instance plus overhead for the Node/Python process, your queue client, and OS. An 8 GB box realistically runs 4 to 5 concurrent Playwright workers, not 8.

Hosting providers

  • Liquid Web: managed VPS and dedicated servers with guaranteed resources, root access, and 24/7 support. Fit for teams that want hosting reliability without a DevOps hire.
  • Hetzner: the cheapest raw compute in the EU (CX/CPX VPS tiers, dedicated AX servers). No managed layer; you run the full stack.
  • AWS / GCP / Azure: global regions, spot instances, deep service integration. Pricing is variable; EC2 egress is the common gotcha for scraping workloads.
  • DigitalOcean / Linode (Akamai Cloud) / Vultr: middle ground on price and operational simplicity.

Self-hosting n8n (concrete steps)

Most teams that self-host a scraping stack also self-host n8n as the workflow layer. The path of least resistance:

  1. Provision a VPS (2 vCPU / 2 GB RAM / 40 GB SSD minimum for a small team).
  2. Install Docker Engine and Docker Compose using the official convenience script.
  3. Clone n8n's official Docker Compose setup or use the docker-n8n-caddy pattern, where Caddy provides automatic TLS.
  4. Set N8N_HOST, WEBHOOK_URL, N8N_ENCRYPTION_KEY and a strong N8N_BASIC_AUTH_PASSWORD. The encryption key must be persisted, since losing it invalidates every stored credential.
  5. Back up the .n8n volume (SQLite workflows, encrypted credentials) nightly to S3 or a managed backup.
  6. Enable queue mode with Redis once you need more than a handful of concurrent executions. Single-main mode hits concurrency limits quickly.

Community Edition has no execution cap: you pay for your VPS, not per run. Business features (SSO, version control, multi-environment) require a paid license starting at €667/mo.

Apify SDK and Crawlee: running "Actors" on your own infra

An Apify Actor is, at runtime, a Docker container with a standardized input schema and a dataset/KV-store interface. You can run Actors outside Apify's platform with minimal changes:

  • Crawlee handles the crawler logic and ships identically in managed and self-hosted environments.
  • Apify SDK (apify in Python, apify in Node) talks to Apify's API by default but can be configured to use local storage (APIFY_LOCAL_STORAGE_DIR) for datasets and key-value stores.
  • Request queues fall back to local SQLite by default; swap to Redis for multi-worker deployments.

This gives you a migration path in both directions: develop against local storage, deploy to Apify for convenience, or deploy to your own Kubernetes cluster when compliance demands it. The Actor code is the same.

Docker fluency is the single highest-leverage skill for self-hosted scraping. If you're not already comfortable with images, volumes, networks, and Compose files, this is where to invest first.

BestsellerIntermediateUpdated Sep 2025

Docker Mastery: with Kubernetes + Swarm from a Docker Captain

by Bret Fisher

22+ hours on Docker, Compose, Kubernetes, and Swarm. Endorsed by Docker Inc. The practical foundation for containerized scrapers, self-hosted n8n, and multi-worker deployments.

Operational checklist

  • Standardize deployment with versioned Docker images and a CI pipeline (GitHub Actions, GitLab CI) that pushes to your own registry.
  • Enforce per-job limits (CPU shares, memory cap, wall-clock timeout, max retries) at the orchestrator level.
  • Track unit economics per scraper: cost per successful run, cost per extracted record, ban rate over time.
  • Write runbooks for: target block spike, source HTML change, proxy provider outage, queue backlog. The first time you need them is the worst time to write them.
  • Test backups by restoring, quarterly. An untested backup is a hope.
Frequently Asked Questions

Self-host when one of three conditions applies: compliance or data residency rules prevent using a managed US-based platform, you have stable high-volume workloads where compute unit billing exceeds the cost of running the same compute yourself, or your pipeline needs direct access to internal systems that can't be exposed externally. Below roughly $500/mo of Apify usage, managed is usually cheaper once you include engineering time.

Depends on what you're optimizing. Liquid Web is the strongest choice when you want managed security, guaranteed resources, and 24/7 support without a DevOps team. Hetzner is the cheapest raw compute in Europe. AWS, GCP, and Azure are best when you need global regions, spot pricing, or deep integration with their service ecosystems, but watch egress costs on scraping workloads.

Provision a 2 vCPU / 2 GB VPS, install Docker, and deploy n8n's official docker-compose stack behind Caddy (which handles TLS automatically). Set the encryption key, enable basic auth, and back up the .n8n volume nightly. Once you outgrow single-main mode, enable queue mode with Redis. Expect about 20 minutes for a first deploy plus another hour for TLS, backups, and credential hygiene.

Yes. An Actor is a Docker container with a standardized input schema and storage interface. Set APIFY_LOCAL_STORAGE_DIR for local dataset storage, use Redis for the request queue in multi-worker setups, and the same code runs on your own infrastructure. This gives you a clean migration path: develop locally, deploy to Apify for convenience, or move to your own cluster when compliance requires it.

Plan for approximately 1 GB per concurrent Chromium instance at steady state, plus overhead for your Python/Node process, queue client, and OS. An 8 GB host realistically runs 4 to 5 concurrent browser workers, not 8. Memory is the most common self-hosting surprise, and also the cheapest problem to fix (larger instance).

Common mistakes and fixes

Self-hosted runs are unstable under load.

Separate queue, worker, and storage layers, then scale each independently.

Ops overhead is higher than expected.

Automate deploys, backups, and health checks before adding more workloads.

Apify Affiliate Banner 728x90Apify Affiliate Banner 728x90Apify Affiliate Banner 300x50Apify Affiliate Banner 300x50