browser-use: Architecting AI-Powered Web Agents (2026)

March 8, 2026 · 6 min read

Software Developer & Automation Specialist

I build production AI agents, web scrapers, and automation pipelines. Most of what I publish here comes from the actual problems they run into: proxies that get banned, anti-bot stacks that fingerprint your client, RAG that drifts when the underlying data moves. Stack: Python, TypeScript, Go, FastAPI, LangChain, Crawlee, Playwright, deployed on AWS, GCP, and Cloudflare.

Quick Answer

browser-use is an open-source Python library that gives an LLM control of a real browser (via Playwright). It runs a perceive–act loop: the page DOM is pruned and tagged, the model chooses actions like click or type, and the loop repeats until the task finishes. It shines when layouts change often and fixed selectors break; it costs more in tokens and time than traditional scrapers, and it is a poor fit for hard WAFs or huge deterministic crawls. For production, run it in containers on Apify with proxy rotation.

Traditional automation (Playwright or Puppeteer) depends on stable selectors. If a team hardcodes .submit-btn and the site renames classes, the job fails.

browser-use inverts that: you describe the goal in natural language, the library feeds a sanitized view of the page to an LLM, and the model plans clicks, typing, and extractions through Playwright.

This guide covers the architecture, where it breaks in production, and how to pair it with Apify Actors when you need cloud browsers and proxies.

How the perceive–act loop works

Raw HTML would blow the context window. Instead, browser-use runs a repeating cycle.

1. Perception

On each step, an injection script prepares the DOM:

Pruning — strips <script>, <style>, and hidden nodes.
Node tagging — adds numeric tags (e.g. [12]) next to actionable elements (buttons, inputs, links).
Vision (optional) — if the model supports images, a viewport screenshot can supplement the text view.

2. Action

The simplified representation (often markdown-like) plus your objective goes to the LLM. The model returns structured tool calls, for example:

click(element_index=12)
type(element_index=15, text="Data Engineer")
extract_structured against a schema
go_back(), done(), and similar control actions

import asyncio
from langchain_openai import ChatOpenAI
from browser_use import Agent

async def execute_agentic_workflow():
    agent = Agent(
        task="Navigate to hackerone.com, locate the public bug bounty directory, and return the top 5 highest-paying telecom programs as JSON.",
        llm=ChatOpenAI(model="gpt-4o"),
    )
    result = await agent.run()
    print(result.final_result())

asyncio.run(execute_agentic_workflow())

Need structured data at the end of a run? Bind Pydantic models so the agent is nudged toward typed output instead of free-form prose.

from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from browser_use import Agent

class CompetitorPrice(BaseModel):
    sku_name: str
    price_usd: float = Field(description="Float only; strip currency symbols")
    in_stock: bool

# Wire CompetitorPrice into extraction per browser-use docs for your version (structured output APIs evolve).
agent = Agent(
    task="Extract Databricks tier pricing as structured fields (sku_name, price_usd, in_stock).",
    llm=ChatOpenAI(model="gpt-4o-mini"),
)

Where browser-use struggles in production

Token cost and latency

Every loop step is at least one LLM call. A checkout flow with many clicks and scrolls can mean dozens of steps. With vision enabled, screenshots plus large DOM snapshots get expensive fast. A few seconds in plain Playwright can become tens of seconds per workflow with an agent.

Hallucination loops

On CAPTCHAs, sliders, or heavy anti-bot flows, the model may repeat useless actions until max_steps stops the run. Treat max_steps as a billing and safety rail, not an optional knob.

Non-deterministic extraction

Unlike a fixed Crawlee spider, LLM extraction can drift: wrong field associations, misread “Sponsored” blocks, or inconsistent JSON. For high-volume, schema-critical pipelines, prefer deterministic scrapers or pre-built Apify Actors when they exist for your target.

Custom controllers

Use a Controller to register Python side effects the LLM can invoke (write to storage, call an API, enqueue work):

from browser_use import Agent, Controller
import polars as pl

controller = Controller()

@controller.action("Save extraction to local Parquet file")
def write_to_parquet(data: str) -> str:
    df = pl.read_json(data.encode())
    df.write_parquet("extraction_log.parquet")
    return "Write successful."

agent = Agent(
    task="Find the top 10 HN posts, output to Parquet.",
    llm=ChatOpenAI(model="gpt-4o"),
    controller=controller,
)

Running browser-use on Apify

Local laptops are a poor place for long agent runs, residential or rotating proxies, and parallel jobs. The usual pattern is to package your script as an Apify Actor:

Isolated Chromium — consistent browser runtime in the cloud.
Proxies — wire Apify Proxy (and groups such as residential where appropriate) into your browser config so requests look like normal user traffic.
Automation — trigger runs from n8n, Zapier, Make, or the API for scheduled and event-driven pipelines.

Good fits for agentic browsing: exploratory audits, one-off internal tools, legacy portals without APIs, and tasks where maintaining selectors is too costly.

Poor fits: high-frequency price tracking, strict Tier-1 WAF bypass as the primary strategy, or extracting millions of rows where specialized scrapers win on cost and reliability.

Explore Apify for serverless browsers and workflows →

Frequently Asked Questions

Usually no. The session still looks like automation to advanced bot stacks. You may need dedicated unlock or proxy products, or avoid agent-only strategies on those pages. Apify offers separate tools and Actors aimed at difficult targets—evaluate them for your specific site.

For simple 'go here and read text' tasks, sometimes. For nested menus, multi-page flows, or precise spatial reasoning, smaller models fail more often. Budget for retries, shorter tasks, or a stronger model when reliability matters.

Set a low max_steps, cap task scope, avoid vision unless necessary, and log each step in development. In production, add timeouts and alerts on run duration and token usage.

If an Actor already exists for your site (Maps, Amazon, job boards, social, etc.), start there—you get maintained selectors, storage, and exports without LLM cost. Use browser-use when you need flexible reasoning on unfamiliar or highly dynamic UIs.

No. It builds on Playwright. You still benefit from understanding browsers, waits, and sessions; the LLM only chooses high-level actions.

How the perceive–act loop works​

1. Perception​

2. Action​

Where browser-use struggles in production​

Token cost and latency​

Hallucination loops​

Non-deterministic extraction​

Custom controllers​

Running browser-use on Apify​