Web Scraping with Python for Beginners (2026 Guide)
Python is the most popular language for collecting web data because it pairs readable syntax with strong libraries for HTTP, HTML parsing, browsers, and large-scale crawling. This guide explains when to use each approach, gives copy-paste examples, and shows where Apify fits if you want no-code or fully managed scraping.
Quick Answer
Python web scraping uses requests+BeautifulSoup for static pages, Playwright/Selenium for JavaScript pages, and Scrapy for large-scale crawling. For no-code scraping, use Apify.
If you already know you want cloud runs, proxies, and scheduling without maintaining code, open the Apify Store and pick a ready-made Actor for your target site.
Comparison table
| Approach | JavaScript rendering | Scale | Learning curve | Best for |
|---|---|---|---|---|
| requests + BeautifulSoup | No | Low–medium | Easiest | Static HTML, APIs, prototypes |
| httpx + BeautifulSoup (async) | No | Medium–high | Medium | Many static URLs in parallel |
| Playwright | Yes | Medium | Medium | SPAs, logins, heavy dynamic UI |
| Selenium | Yes | Medium | Medium | Legacy stacks, teams already on Selenium |
| Scrapy | Via extensions | Very high | Steeper | Broad crawls, pipelines, scheduling |
| Apify (no-code / Actors) | Yes (Actor-dependent) | High (cloud) | Easiest for delivery | Production without owning infra |
When to use each library
requests + BeautifulSoup — static HTML
Use when: the page shows the data you need with JavaScript turned off (view source matches what you see), or the site offers a simple JSON API you can call directly.
Avoid when: the DOM is filled in by React/Vue after load, or the site uses complex bot detection—pure HTTP clients are easy to fingerprint.
import requests
from bs4 import BeautifulSoup
URL = "https://example.com"
resp = requests.get(
URL,
headers={"User-Agent": "Mozilla/5.0 (compatible; ResearchBot/1.0)"},
timeout=30,
)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
title = soup.title.string.strip() if soup.title else None
links = [a["href"] for a in soup.select('a[href^="http"]')]
print(title, len(links))
Add retries, backoffs, and robots.txt respect before production. For many URLs, switch to async httpx or Scrapy so you are not blocked on the network one page at a time.
Playwright — JavaScript and real browsers
Use when: content appears only after JS runs, you need clicks/scrolls, or you want to capture network responses (XHR) instead of scraping hydrated HTML.
Avoid when: static parsing is enough—browsers cost more CPU and complexity.
from playwright.sync_api import sync_playwright
def scrape_dynamic(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle", timeout=60_000)
text = page.inner_text("body")
browser.close()
return text
print(scrape_dynamic("https://example.com")[:500])
Prefer Playwright over Selenium for new projects in 2026: faster lifecycle, modern APIs, and strong trace tooling. Selenium is still valid if your org standardizes on it or you rely on Grid-based infrastructure.
Selenium — browser automation (legacy-friendly)
Use when: your team already invested in Selenium Grid, or you must reuse existing Selenium tests as scrapers.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("--headless=new")
driver = webdriver.Chrome(options=opts)
driver.get("https://example.com")
el = driver.find_element(By.TAG_NAME, "h1")
print(el.text)
driver.quit()
Pair with explicit waits (not time.sleep everywhere) and pin browser + driver versions in CI.
Scrapy — crawling at scale
Use when: you need to follow links across a domain, deduplicate URLs, throttle politely, and push items through item pipelines (cleaning, storage).
Avoid when: you only need a handful of URLs—Scrapy's project structure is heavier than a single script.
# scrapy run: scrapy runspider quotes.py -o out.jsonl
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
def parse(self, response):
for q in response.css("div.quote"):
yield {
"text": q.css("span.text::text").get(),
"author": q.css("small.author::text").get(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
For JavaScript sites, use Scrapy + Playwright integrations or offload the worst pages to a browser Actor.
Async static scraping (one step past beginners)
When you outgrow sequential requests, httpx async mode keeps you in Python without jumping to Scrapy:
import asyncio
import httpx
from bs4 import BeautifulSoup
async def fetch_title(client: httpx.AsyncClient, url: str) -> dict:
try:
r = await client.get(url, timeout=20.0)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
t = soup.title.string.strip() if soup.title else None
return {"url": url, "title": t}
except Exception as e:
return {"url": url, "error": str(e)}
async def main(urls: list[str]):
async with httpx.AsyncClient(headers={"User-Agent": "Mozilla/5.0"}) as client:
return await asyncio.gather(*[fetch_title(client, u) for u in urls])
urls = ["https://example.com", "https://www.python.org"]
print(asyncio.run(main(urls)))
Cap concurrency with a semaphore when you target real sites so you do not overwhelm servers or trigger blocks.
Apify as an alternative to DIY scraping
Apify hosts thousands of Actors—prebuilt scrapers and workflows—for sites that are painful to maintain in raw Python (anti-bot, CAPTCHAs, rotating proxies). You can:
- Run scrapers from the browser with no local Chrome install.
- Schedule jobs, store datasets, and connect webhooks to n8n, Make, or Zapier.
- Call everything through the REST API if you still want Python orchestration.
This does not replace learning Python for custom extraction logic, but it often replaces months of proxy and selector maintenance for well-known domains.
Sign up at Apify, pick a Store Actor for your target, and export JSON—Python optional.
Start with requests and BeautifulSoup to learn HTML structure and HTTP. Move to Playwright the first time you see blank or incomplete HTML because the site renders with JavaScript.
No, but Playwright is the default recommendation for new browser automation. Selenium remains widely used in enterprises and is fine if your team already standardized on it.
When you crawl many pages or whole domains, need politeness settings, deduplication, and pipelines. For dozens of URLs, a single script or an Apify Actor is usually faster to ship.
Technically many sites can be reached with enough effort, but legal and contractual limits apply. Terms of service, copyright, and data protection laws still bind you regardless of language.
Python gives maximum flexibility on your machine. Apify gives hosted runs, storage, proxies, and maintained Actors for popular targets—often faster for production lead lists or monitoring.
No. Respectful low-volume scraping of static sites often needs no proxies. Proxies matter for geo-targeting, high volume, or strict anti-bot systems.




