The Race to Build AI That Can Actually Use a Computer

There's a demo that every AI company has shown in the past year. The AI looks at a computer screen. It reads what's on it. It moves the mouse. It clicks a button. It types something into a text field. It navigates a website, fills out a form, books a flight, orders food, files a bug report. Every time this demo runs, the audience gets excited. An AI that can use a computer the way you do. No APIs required. No integrations to build. Just point it at any software and let it go. The demo always works. The product almost never does. But that hasn't stopped the three biggest AI companies from pouring resources into it. Anthropic shipped computer use in October 2024. OpenAI launched its Computer-Using Agent (CUA) with Operator in January 2025. Google's Project Mariner has been in limited testing since late 2024. Each takes a fundamentally different approach. Each has significant limitations. And each reveals something important about how AI agents will actually interact with the world. ## Anthropic's Approach: Raw Access Anthropic was first to market with a public API for computer use, released in October 2024 as part of the Claude 3.5 Sonnet update. Their approach is the most raw and the most powerful. Anthropic gives developers direct access to three primitives: screenshots (the model sees what's on screen), mouse control (the model can move the cursor and click), and keyboard input (the model can type). That's it. No abstraction layer. No "actions" framework. Just see, click, type. The model works by taking a screenshot, analyzing what's visible, deciding what action to take, executing that action, taking another screenshot, and repeating. It's essentially a control loop: observe, decide, act, observe, decide, act. This simplicity is a feature. Because Anthropic doesn't try to understand the structure of the application being controlled, their approach works on any software. Desktop apps, web apps, terminal windows, even games. If a human can see it and click on it, Claude can try. With the release of Claude 3.5 Sonnet's updated version for computer use, and then improvements through Opus 4 and Opus 4.6, the capability has gotten meaningfully better. Benchmarks on OSWorld, a standard evaluation for computer use agents, showed steady improvement. Anthropic's models consistently lead on agentic tasks including computer use. The developer-facing API means Anthropic isn't trying to be the consumer product. They're providing the capability and letting others build the product around it. This is the same infrastructure-first approach they took with MCP, and it's working. Companies like Replit, Vercel, and smaller startups have built computer use features on top of Claude's API. The tradeoff: raw access means raw risk. A model with mouse and keyboard control can do anything. Click "delete all." Send an email to the wrong person. Accept a terms of service that commits you to something you didn't read. Anthropic publishes safety guidelines and requires developers to implement their own guardrails, but there's no built-in safety net. ## OpenAI's Approach: The Sandboxed Browser OpenAI took the opposite approach. When they launched Operator in January 2025, it was a consumer product, not a developer API. And it ran in a sandboxed browser, not on the user's actual computer. Operator used a Computer-Using Agent (CUA) model built on GPT-4o with reinforcement learning for browser interaction. The idea was straightforward: you tell Operator what you want (book a dinner reservation, order groceries, fill out a form), and it does it in a controlled browser environment. The sandboxing was a deliberate safety choice. By running in a dedicated browser window, Operator couldn't access the user's real files, applications, or system settings. It could only interact with websites. This dramatically reduced the risk surface compared to Anthropic's full-computer approach. But the sandboxing also limited usefulness. Most computer tasks involve more than a web browser. People need to edit spreadsheets, manage files, use desktop applications, switch between tools. A browser-only agent can't do any of that. Operator also struggled with the fundamental brittleness of web automation. Websites change their layouts constantly. CAPTCHAs block automated browsing. Payment flows require multi-factor authentication that an AI can't handle. Users reported Operator getting stuck on simple tasks that a human could complete in seconds. A restaurant reservation that takes you 30 seconds took Operator three minutes and still failed half the time. By July 2025, OpenAI had folded Operator into ChatGPT as "agent mode" and shut down the standalone product. The pivot was telling. OpenAI realized that a separate agent app wasn't what users wanted. They wanted their existing chat interface to occasionally take actions on their behalf. Lower ambition, higher utility. The current state of OpenAI's computer use capabilities is integrated into ChatGPT rather than being a standalone product. The CUA technology still exists, but it's one tool among many in ChatGPT's expanding toolkit, alongside browsing, code execution, image generation, and file analysis. ## Google's Approach: The Browser Extension Google's Project Mariner took a third path: a Chrome extension that can browse the web and take actions on your behalf, within your existing browser session. Announced in late 2024 and powered by Gemini, Mariner works by observing the active Chrome tab and suggesting or executing actions based on user instructions. Unlike Operator's sandboxed approach, Mariner runs in your real browser, with your real cookies, your real login sessions, and your real data. This gives Mariner a natural advantage for authenticated tasks. If you're already logged into Amazon, Mariner can browse Amazon as you. If you're already logged into your company's internal tools, Mariner can interact with them. No separate login flow required. The downside is that Mariner has been in limited testing for over a year without a broad public release. Google has been cautious, which is unusual for a company that often ships first and fixes later. The caution suggests they're hitting reliability issues that they don't want to expose at scale. Google also has a second computer use effort: Project Astra, which is broader than just browsers and encompasses multimodal AI that can see and interact with the physical world through phone cameras and smart glasses. Astra is further from consumer readiness than Mariner, but it represents Google's longer-term vision: AI that doesn't just use your computer but sees and understands everything around you. The integration advantage Google holds shouldn't be underestimated. They control Chrome (65% browser market share), Android (72% mobile market share), and the most popular productivity suite in the world (Google Workspace). If computer use agents become mainstream, Google has more surfaces to embed them than anyone else. ## What Actually Works Today Let me be blunt about the current state of things: computer use agents are impressive demos and mediocre products. The core problem is reliability. A human using a computer has dozens of microskills that we take for granted. We know that a loading spinner means "wait." We know that a cookie consent banner needs to be dismissed. We know that if a button doesn't respond, we should try clicking it again, or scrolling to make sure it's fully visible, or checking if a popup is blocking it. We handle unexpected states gracefully because we've been using computers for years and have built intuition about how they behave. AI models don't have that intuition. They handle the happy path well: if the website looks exactly like expected and every element is in the right place, the model can handle it. But the web is messy. Layouts shift. Elements overlap. Pop-ups appear. Pages take different amounts of time to load. Each unexpected state is a potential failure point. Benchmark numbers tell the story. On OSWorld, which tests realistic computer use tasks across different operating systems and applications, the best models complete somewhere around 30-40% of tasks successfully. That means 60-70% failure rate. For comparison, humans complete essentially 100% of the same tasks. A 60% failure rate is fine for a research demo. It's not fine for a product that people depend on. If your AI assistant fails to book your flight two out of three times, you'll stop using it after the first week. ## The Technical Bottlenecks Three technical problems are holding computer use agents back. The first is screen understanding. Models process screenshots as images, but computer interfaces are dense with information. A typical web page might have dozens of clickable elements, text fields, dropdowns, and navigation items. The model needs to identify the right element, determine its exact coordinates, and generate the correct action. Current vision models are good at this but not great. Small UI elements, low-contrast text, and overlapping elements cause errors. The second is state management. Computer tasks involve multiple steps across multiple pages. The model needs to remember what it's already done, what the current goal is, and what to do next. If it loses track (a common failure mode), it starts repeating actions or gets stuck in loops. This is fundamentally a context window and planning problem, and it's improving as models get better at long-horizon reasoning. The third is speed. A model that takes 3-5 seconds per action is painfully slow compared to a human who clicks and types at several actions per second. Each action requires taking a screenshot, sending it to the model, receiving a response, executing the action, and taking another screenshot. The latency adds up. A task that takes a human 30 seconds takes the model several minutes. For simple tasks, this makes the AI slower than doing it yourself. ## Where This Is Actually Heading Despite the current limitations, I think computer use agents will become mainstream. Not this year. Probably not next year. But within three to five years. The path to mainstream looks like this: models get better at screen understanding (Gemini and Claude are both improving rapidly on vision tasks). Action latency drops as inference gets faster and models get more efficient. And most importantly, the use cases narrow to domains where 80% reliability is good enough. What does 80% reliability look like? Internal tools where the user can monitor and correct. Testing and QA workflows where the goal is to find bugs, not to be perfect. Data entry tasks where a human reviews the output. Accessibility tools that help users with disabilities control software with voice commands instead of mouse clicks. The vision of an autonomous agent that books your flights, manages your email, and runs your entire computer while you sleep? That's 95%+ reliability territory, and we're nowhere close. But there's a massive market between "research demo" and "fully autonomous," and that's where the practical money is. The company that figures out the right scope, ambitious enough to be useful, narrow enough to be reliable, will win this race. Right now, Anthropic's infrastructure-first approach gives them the best position for developers. Google's Chrome integration gives them the best position for consumers. OpenAI's retreat from Operator suggests they're regrouping. All three will keep pushing. The question isn't whether AI will eventually learn to use a computer well. It's which company gets to "good enough" first, and what "good enough" actually means for users who are judging the technology against their own hands on the keyboard.

The Race to Build AI That Can Actually Use a Computer

Related Articles

The 5 AI Products People Actually Use (And the 50 They Don't)

AI Hiring Is Broken: What 1,000 Job Postings Tell Us About the Market

MCP Is Winning. Here's the Technical Breakdown of Why.

The Open Source AI Definition War: Who Gets to Decide What 'Open' Means?

Related Articles

AI|in about 6 hours
The 5 AI Products People Actually Use (And the 50 They Don't)

AI|in about 5 hours
AI Hiring Is Broken: What 1,000 Job Postings Tell Us About the Market

AI|in about 4 hours
MCP Is Winning. Here's the Technical Breakdown of Why.

AI|in about 3 hours
The Open Source AI Definition War: Who Gets to Decide What 'Open' Means?