Computer-Use & Browser Agents
Agents that click, type, and browse like a person
- Explain how a computer-use agent perceives a screen and turns that perception into mouse and keyboard actions in a loop
- Compare the three perception approaches — screenshots, accessibility trees, and DOM — and their trade-offs
- Place the 2026 state of the art (Anthropic Computer Use, OpenAI Operator/agent mode, Gemini 2.5 Computer Use) and the open-source browser-agent stack on a map
- Read benchmark numbers like OSWorld and WebVoyager critically and translate them into realistic reliability expectations
- Apply the core safety pattern — sandbox, scope, and confirm — and pick tasks that work reliably today
Most software has no API — but it all has a screen. Computer-use and browser agents close that gap by perceiving a display through screenshots or accessibility trees and acting through mouse, keyboard, and scroll commands, just like a person. This lesson explains how they perceive and act, surveys the 2026 state of the art across Anthropic, OpenAI, and Google, and is honest about where they shine and where they still fail.
- 1The software that has no API
- 2The perceive–act loop
- 3Three ways to see a screen
- 4The 2026 state of the art
- 5Browser agents and the open-source stack
- 6Reading the benchmarks honestly
- 7Safety and what works today
The software that has no API
Here is the simple problem these agents solve. An agent with tools can only reach software that offers an API — a clean, programmatic door to call. But think about the apps a real worker uses all day: a hospital's legacy scheduling system, a supplier portal, a desktop accounting suite, an internal CRM from 2009. Most of them have no usable API at all. The only door in is the same one a human uses: a screen to look at, and a mouse and keyboard to act with.
A computer-use agent (also called a GUI agent) is built to walk through that human door. Instead of calling functions, it does what you do: it looks at the screen, decides where to click and what to type, performs the action, looks again, and repeats. A browser agent is the most common and most reliable special case — the same idea scoped to a web browser rather than the whole desktop.
This is a genuinely different capability from classic tool use. It means an agent can, in principle, drive any application — no integration work, no vendor cooperation, no waiting for an API that will never ship. That generality is the whole appeal. The cost, as we'll see, is reliability: pixels are a far messier interface than a typed function signature.
The perceive–act loop
If you already know the agent loop, you already know this. A computer-use agent is that same loop with one specialized tool bolted on: a computer the model can see and control. Picture a person handed a screenshot, told to describe the next click, then handed a fresh screenshot — that is exactly the rhythm. Each iteration looks like this:
- Perceive — capture the current screen (a screenshot, plus optionally structured UI data) and send it to a multimodal model.
- Reason — the model decides the next action: click at (x, y), type "invoice 4471", scroll down, press Enter.
- Act — your code executes that action against the real environment (a VM, container, or live browser).
- Observe — take a fresh screenshot showing the result, feed it back, and loop until the goal is met.
The model never touches the machine directly. It only emits intended actions; your harness executes them and returns the next screenshot. That separation is why the model can be cloud-hosted while the desktop runs locally or in your own cloud.
Crucially, the action space is coordinate-based. The model literally outputs "left-click at pixel (612, 344)". This demands strong visual grounding — mapping what a button means to exactly where it is — which is the single hardest sub-skill and the main thing benchmarks like ScreenSpot measure.
Key insight
It's still just LLM + tools + a loop
A computer-use agent is not a new architecture. It is the standard agent loop with one tool whose inputs are screenshots and whose outputs are clicks and keystrokes. Everything you know about loops, stopping conditions, and observations still applies.
Three ways to see a screen
Before the agent can act, it has to see. How it perceives the UI is the central design decision, and there are exactly three ways to do it — each is just a different format for describing what's on screen. Think of one human reading raw pixels, another reading the labels a screen reader announces, and a third reading the page's underlying HTML. Production systems usually combine all three.
| Approach | What it is | Strengths | Weaknesses |
|---|---|---|---|
| Screenshot / vision | Raw pixels fed to a multimodal model | Works on any UI, including custom-rendered or canvas apps | Token-heavy (thousands of tokens/image); grounding is hard |
| Accessibility tree (AXTree) | Structured UI metadata used by screen readers | Token-efficient (~200–400 tokens); precise element identity | Missing/incomplete on many apps; doesn't capture visual state |
| DOM | The web page's HTML structure | Rich, precise, web-native | Web-only; can be huge and noisy |
A common misconception is that screenshots are simply better because they're how humans see. In reality each has trade-offs. Screenshots handle visually unusual UIs that have no semantic structure; accessibility trees and DOM are far cheaper and more precise when they exist. That's why most production agents fuse them: use the screenshot for grounding and visual state, and the AXTree/DOM for reliable element identity and cheaper context.
This is also the key distinction from classic browser automation like Selenium or Playwright, which uses deterministic, scripted selectors (#submit-button). Computer use reasons over the live screen instead — slower and fuzzier, but it doesn't break the instant a layout changes.
The 2026 state of the art
In plain terms: as of 2026, all three big AI labs now sell a model that can drive a computer, and the main thing separating them is where the computer lives. Each ships a first-party computer-use model with a different deployment philosophy.
- Anthropic Computer Use. First shipped in beta in October 2024 with Claude 3.5 Sonnet — earlier than most people remember — and steadily expanded since. The current tool (
computer_20251124, beta headercomputer-use-2025-11-24) supports Claude Opus 4.8/4.7/4.6, Sonnet 4.6, and Opus 4.5. Actions are tiered: basic (screenshot, click, type, key), enhanced-2025-01 (scroll, drag, multi-click, wait), and enhanced-2025-11 (addszoomfor magnifying a sub-region). Anthropic provides an official Docker reference implementation. You run the environment; the model is cloud-hosted. - OpenAI CUA / Operator / ChatGPT agent. OpenAI's Computer-Using Agent model (GPT-4o vision + RL-trained reasoning) launched January 2025, powering the Operator product. On July 17, 2025 OpenAI unified Operator, deep research, and code execution into ChatGPT agent — accessible via "agent mode" in the ChatGPT composer — and Operator as a standalone product was retired. It drives a cloud-hosted virtual browser/computer.
- Google Gemini 2.5 Computer Use (
gemini-2.5-computer-use-preview-10-2025). A preview model exposed via acomputer_usetool in the Gemini API and Vertex AI, using the same screenshot loop. Google's Project Mariner takes a Chrome-extension approach against your live, authenticated browser.
One design fork runs through all three, and it is the choice that matters most for you: a sandboxed container (isolation, a fresh session, nothing of yours at risk) versus a live browser (instant access to your existing logins, but far less isolation).
Browser agents and the open-source stack
You don't need a frontier lab's first-party model to build here. Picture a spectrum: scripted automation (rigid, exact) on one end, full pixel-level computer use (flexible, fuzzy) on the other. A mature open-source ecosystem now fills the middle, and most of it standardizes on Playwright for the actual browser control.
- Browser-Use (Python, MIT, ~96k GitHub stars) is the leading framework: an
Agentclass wrapping Playwright that pairs any vision-capable LLM with browser actions. It reached 89.1% on the WebVoyager benchmark. - Stagehand (TypeScript, by Browserbase) exposes clean primitives —
act,extract,observe,agent. Its v3 (Feb 2026) rebuilt the engine on the Chrome DevTools Protocol for a ~44% speedup. - Skyvern (Python) is vision-first — LLMs plus computer vision, no XPaths or selectors — with millions of executed workflows.
- Microsoft's Playwright MCP server (March 2025) exposes Playwright automation over the Model Context Protocol, so any MCP-compatible agent can drive a browser without bespoke glue.
A simple way to choose: if the target is a standard website, a Playwright-backed browser agent (Browser-Use, Stagehand) is faster, cheaper, and more reliable than full pixel-level computer use. Reach for full computer use when you must drive a desktop app or a UI with no usable DOM.
Example
A minimal browser agent
from browser_use import Agent
from langchain_anthropic import ChatAnthropic
agent = Agent(
task="Go to the docs site, find the pricing page, "
"and report the price of the Team tier.",
llm=ChatAnthropic(model="claude-opus-4-8"),
)
result = await agent.run(max_steps=25) # cap the loop
print(result.final_result())The library handles screenshots, the perceive–act loop, and Playwright execution; you supply the goal, the model, and a step cap.
Reading the benchmarks honestly
Computer-use agents are improving fast — but a benchmark score is a grade on a curated exam, not a promise about your real workload. Read every headline number with that gap in mind. Four benchmarks anchor the field in 2026:
- OSWorld — full-OS tasks across real applications. Humans score ~72%. In early 2025 agent SOTA was in the 28–38% range; by late 2025 systems like OSAgent pushed past 76%, and by mid-2026 Agent S3 reached ~66–73% (with best-of-N sampling). The field has closed — and in some configurations surpassed — the human baseline on this benchmark, but performance on curated evals does not equal reliability on arbitrary real-world tasks.
- WebArena — web navigation; single-agent SOTA has risen above 60% and continues to improve.
- WebVoyager — live web browsing; Browser-Use reached 89.1% (vs. OpenAI Operator's 87%).
- ScreenSpot / ScreenSpot-Pro — pure UI grounding ("click the right pixel").
Two cautions matter. First, high benchmark scores do not mean production-ready. Curated tasks lack the unexpected modals, cookie banners, A/B-tested layouts, rate limits, and timeouts of real workloads. Second, roughly half of the historic performance gap was scaffolding and prompting, not raw model capability — meaning the harness you build (retries, screenshot cadence, action validation, replanning) materially changes results. Progress on OSWorld is real and remarkable, but treat any eval score as a directional signal, never as a guarantee for your specific task on your specific environment.
Safety and what works today
Here is the uncomfortable truth: an agent that can click anything can also click the wrong thing — submit a payment, delete a record, leak a credential. Computer use concentrates the genuine risks of agency into one place, so treat the environment as hostile from the start.
The core safety pattern is sandbox, scope, confirm:
- Sandbox. Run in a dedicated VM or Docker container with minimal privileges — never on a machine with access to anything you can't afford to lose.
- Scope. Restrict network access to an allowlist of domains; don't expose password managers or real credentials.
- Confirm. Require human approval for consequential or irreversible actions (purchases, sends, deletions).
The sharpest threat is prompt injection: malicious text or images on the screen itself (a web page, an email) hijacking the agent's instructions. Anthropic added classifier-based injection detection, but mitigations are partial — isolation and human gates remain essential. CAPTCHA solving and fake-account creation are explicit misuse vectors to block.
What works reliably today: web form filling, data entry across legacy apps lacking APIs, automated UI testing, structured web data extraction, and multi-step browser research. What still fails often: building slideshows, calendar management, and multi-app workflows that hit unexpected UI states. Scope to the former; keep a human on the loop for the rest.
Watch out
Never point a computer-use agent at your real machine
It can read the screen and act on whatever is visible — including a logged-in bank tab or a password manager. Always isolate it in a sandboxed VM or container, allowlist its network, and require confirmation for irreversible actions. A single prompt-injection from an on-screen page can turn a research task into a data-exfiltration event.
Try it: Drive a real form with a sandboxed browser agent
Stand up the Anthropic computer-use Docker reference implementation (or install browser-use with a vision-capable model) inside a container or VM — never on your host machine. Give it one scoped task: "Open this public demo form, fill in the sample fields, and report back the confirmation text — but do NOT submit." Then observe the loop: log each screenshot and each action the model emits, and watch how it grounds a click to a pixel.
Afterwards, answer three questions in two sentences each: (1) Where did its visual grounding succeed or miss? (2) What single unexpected UI element (a cookie banner, a modal) would have derailed it, and how would you handle that in your harness? (3) Which one consequential action would you gate behind human confirmation before letting this run unattended? This trains the instinct that matters most here: scoping the task and the environment so capability never outruns safety.
Key takeaways
- 1Computer-use agents operate software through its screen — perceive via screenshots/accessibility trees, act via coordinate-based clicks and keystrokes — letting them drive any app that lacks an API.
- 2Perception has three flavors (screenshot, accessibility tree, DOM) with real trade-offs; production agents fuse vision for grounding with structured data for precision and token efficiency.
- 3By 2026 all three frontier labs ship computer use — Anthropic (since Oct 2024), OpenAI (Operator, superseded by ChatGPT agent in July 2025), and Google (Gemini 2.5 Computer Use) — alongside a Playwright-based open-source stack (Browser-Use, Stagehand, Skyvern).
- 4OSWorld agent SOTA has risen dramatically — from ~28–38% in early 2025 to matching or exceeding the ~72% human baseline by mid-2026 — but benchmark gains don't automatically translate to production reliability; roughly half the improvement comes from better scaffolding, and real tasks add unexpected modals, timeouts, and state the benchmark doesn't cover.
- 5Computer use concentrates the risks of agency: always sandbox, scope to an allowlist, and require human confirmation for irreversible actions — prompt injection from on-screen content is the defining threat.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.What fundamentally distinguishes computer use from classic browser automation like Selenium or Playwright scripts?
2.Which statement about screen-perception approaches is correct?
3.When did Anthropic first release computer use, and what is its current deployment shape?
4.An agent scores 89% on WebVoyager. What is the right conclusion?
Go deeper
Hand-picked sources to keep learning
Official API docs: action types, beta headers, the agent loop pattern, security guidelines, and prompting best practices.
Docker-based reference with the agent loop, tool implementations, and a web UI. The canonical starting point for building with Claude computer use.
OpenAI's overview of the CUA model behind Operator: architecture and launch benchmarks (OSWorld 38.1%, WebArena 58.1%, WebVoyager 87%).
Official docs for the computer_use tool: loop structure, supported models, action types, and ADK integration.
Leading open-source Python browser-agent framework. Wraps Playwright; reached 89.1% on WebVoyager.
2025 survey of perception methods (screenshot, AXTree, DOM), architectures, reliability challenges, and open problems.