OpenAI has spent the better part of the last year positioning itself not just as an AI lab but as the infrastructure layer for autonomous software agents. The pitch is seductive: an AI that doesn’t just answer questions but acts - browsing the web, filling out forms, managing files, coordinating tasks across apps. Operator-style agents, in OpenAI’s framing, are the next product frontier.

The problem is that the underlying models still fail at the exact things that make agentic behavior useful.

The Demo Is Not the Product

Every major AI lab has shown impressive agent demos. A model books a restaurant reservation. Another schedules a calendar event by reading an email thread. These work in controlled conditions, on predictable interfaces, with forgiving failure states. Real-world agentic tasks are none of those things.

What actually happens when you hand an AI agent a multi-step workflow is a compounding error problem. Language models make probabilistic decisions at each step. A small misread of a UI element, a subtly wrong assumption about which account to use, an ambiguous instruction that the model resolves in the wrong direction - these don’t cancel each other out. They stack. By step eight of a twelve-step task, the model can be operating on a completely false premise, confidently.

This isn’t a new critique. Andrej Karpathy and others have noted publicly that reliability at scale remains the core unsolved problem for agents. OpenAI’s own documentation quietly acknowledges that their computer use implementations require human oversight for anything consequential.

The Business Model Pressures the Roadmap

OpenAI is no longer purely a research organization - it’s a company with investor expectations, a pending for-profit conversion, and competitors at every angle. That commercial pressure visibly shapes how agent capabilities get announced versus how they actually ship. Features get named, marketed, and positioned as products before the reliability bar is anywhere near where enterprise use requires.

This matters because enterprises are the buyers OpenAI most wants. And enterprises, unlike consumers, don’t tolerate a 15% error rate on automated workflows. A chatbot giving a slightly wrong answer is annoying. An agent mistakenly canceling a vendor contract or sending a draft email prematurely is a liability event.

What Would Actually Help

The honest version of the agent product right now is narrower: well-scoped, single-domain tasks with explicit checkpoints and easy human override. That’s genuinely useful. It’s just not as exciting as “AI that runs your life.”

The gap between the marketing and the capability isn’t necessarily dishonest - it may reflect genuine internal optimism about how fast reliability improves. But the timeline keeps slipping, and at some point the gap becomes a credibility problem OpenAI can’t demo its way out of.