Skip to main content

Web Scraping with Python for Beginners (2026 Guide)

· 7 min read
Yassine El Haddad
Software Developer & Automation Specialist

I build production AI agents, web scrapers, and automation pipelines. Most of what I publish here comes from the actual problems they run into: proxies that get banned, anti-bot stacks that fingerprint your client, RAG that drifts when the underlying data moves. Stack: Python, TypeScript, Go, FastAPI, LangChain, Crawlee, Playwright, deployed on AWS, GCP, and Cloudflare.

Python is the most popular language for collecting web data because it pairs readable syntax with strong libraries for HTTP, HTML parsing, browsers, and large-scale crawling. This guide explains when to use each approach, gives copy-paste examples, and shows where Apify fits if you want no-code or fully managed scraping.

Quick Answer

Python web scraping uses requests+BeautifulSoup for static pages, Playwright/Selenium for JavaScript pages, and Scrapy for large-scale crawling. For no-code scraping, use Apify.

If you already know you want cloud runs, proxies, and scheduling without maintaining code, open the Apify Store and pick a ready-made Actor for your target site.

Comparison table

ApproachJavaScript renderingScaleLearning curveBest for
requests + BeautifulSoupNoLow–mediumEasiestStatic HTML, APIs, prototypes
httpx + BeautifulSoup (async)NoMedium–highMediumMany static URLs in parallel
PlaywrightYesMediumMediumSPAs, logins, heavy dynamic UI
SeleniumYesMediumMediumLegacy stacks, teams already on Selenium
ScrapyVia extensionsVery highSteeperBroad crawls, pipelines, scheduling
Apify (no-code / Actors)Yes (Actor-dependent)High (cloud)Easiest for deliveryProduction without owning infra

When to use each library

requests + BeautifulSoup — static HTML

Use when: the page shows the data you need with JavaScript turned off (view source matches what you see), or the site offers a simple JSON API you can call directly.

Avoid when: the DOM is filled in by React/Vue after load, or the site uses complex bot detection—pure HTTP clients are easy to fingerprint.

import requests
from bs4 import BeautifulSoup

URL = "https://example.com"
resp = requests.get(
URL,
headers={"User-Agent": "Mozilla/5.0 (compatible; ResearchBot/1.0)"},
timeout=30,
)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
title = soup.title.string.strip() if soup.title else None
links = [a["href"] for a in soup.select('a[href^="http"]')]
print(title, len(links))

Add retries, backoffs, and robots.txt respect before production. For many URLs, switch to async httpx or Scrapy so you are not blocked on the network one page at a time.

Playwright — JavaScript and real browsers

Use when: content appears only after JS runs, you need clicks/scrolls, or you want to capture network responses (XHR) instead of scraping hydrated HTML.

Avoid when: static parsing is enough—browsers cost more CPU and complexity.

from playwright.sync_api import sync_playwright

def scrape_dynamic(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle", timeout=60_000)
text = page.inner_text("body")
browser.close()
return text

print(scrape_dynamic("https://example.com")[:500])

Prefer Playwright over Selenium for new projects in 2026: faster lifecycle, modern APIs, and strong trace tooling. Selenium is still valid if your org standardizes on it or you rely on Grid-based infrastructure.

Selenium — browser automation (legacy-friendly)

Use when: your team already invested in Selenium Grid, or you must reuse existing Selenium tests as scrapers.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

opts = Options()
opts.add_argument("--headless=new")
driver = webdriver.Chrome(options=opts)
driver.get("https://example.com")
el = driver.find_element(By.TAG_NAME, "h1")
print(el.text)
driver.quit()

Pair with explicit waits (not time.sleep everywhere) and pin browser + driver versions in CI.

Scrapy — crawling at scale

Use when: you need to follow links across a domain, deduplicate URLs, throttle politely, and push items through item pipelines (cleaning, storage).

Avoid when: you only need a handful of URLs—Scrapy's project structure is heavier than a single script.

# scrapy run: scrapy runspider quotes.py -o out.jsonl
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]

def parse(self, response):
for q in response.css("div.quote"):
yield {
"text": q.css("span.text::text").get(),
"author": q.css("small.author::text").get(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)

For JavaScript sites, use Scrapy + Playwright integrations or offload the worst pages to a browser Actor.

Async static scraping (one step past beginners)

When you outgrow sequential requests, httpx async mode keeps you in Python without jumping to Scrapy:

import asyncio
import httpx
from bs4 import BeautifulSoup

async def fetch_title(client: httpx.AsyncClient, url: str) -> dict:
try:
r = await client.get(url, timeout=20.0)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
t = soup.title.string.strip() if soup.title else None
return {"url": url, "title": t}
except Exception as e:
return {"url": url, "error": str(e)}

async def main(urls: list[str]):
async with httpx.AsyncClient(headers={"User-Agent": "Mozilla/5.0"}) as client:
return await asyncio.gather(*[fetch_title(client, u) for u in urls])

urls = ["https://example.com", "https://www.python.org"]
print(asyncio.run(main(urls)))

Cap concurrency with a semaphore when you target real sites so you do not overwhelm servers or trigger blocks.

Apify as an alternative to DIY scraping

Apify hosts thousands of Actors—prebuilt scrapers and workflows—for sites that are painful to maintain in raw Python (anti-bot, CAPTCHAs, rotating proxies). You can:

  • Run scrapers from the browser with no local Chrome install.
  • Schedule jobs, store datasets, and connect webhooks to n8n, Make, or Zapier.
  • Call everything through the REST API if you still want Python orchestration.

This does not replace learning Python for custom extraction logic, but it often replaces months of proxy and selector maintenance for well-known domains.

Apify Affiliate Banner 728x90Apify Affiliate Banner 728x90Apify Affiliate Banner 300x50Apify Affiliate Banner 300x50
Try Apify if you want scraping without browser ops

Sign up at Apify, pick a Store Actor for your target, and export JSON—Python optional.

Frequently Asked Questions

Start with requests and BeautifulSoup to learn HTML structure and HTTP. Move to Playwright the first time you see blank or incomplete HTML because the site renders with JavaScript.

No, but Playwright is the default recommendation for new browser automation. Selenium remains widely used in enterprises and is fine if your team already standardized on it.

When you crawl many pages or whole domains, need politeness settings, deduplication, and pipelines. For dozens of URLs, a single script or an Apify Actor is usually faster to ship.

Technically many sites can be reached with enough effort, but legal and contractual limits apply. Terms of service, copyright, and data protection laws still bind you regardless of language.

Python gives maximum flexibility on your machine. Apify gives hosted runs, storage, proxies, and maintained Actors for popular targets—often faster for production lead lists or monitoring.

No. Respectful low-volume scraping of static sites often needs no proxies. Proxies matter for geo-targeting, high volume, or strict anti-bot systems.

Common mistakes and fixes

BeautifulSoup returns empty page

The HTML may be rendered by JavaScript. Open DevTools, disable JS, and reload—if content disappears, use Playwright (or a hosted Actor) instead of requests alone.

Playwright install errors

Run playwright install after pip install playwright to download browser binaries. On servers, install OS dependencies per Playwright docs.

Scrapy gets 403 on every request

Enable user-agent rotation, reduce concurrency, and add reputable proxies. For hard targets, consider a browser-based or managed scraping service.