OpenAI’s GPT-5 Hits a Wall: Why AI Agents Are Proving Nearly Impossible to Properly Evaluate

Submitted by Anonymous (not verified) on Mon, 02/23/2026 - 18:15

OpenAI’s latest and most ambitious model, GPT-5, has encountered an unexpected problem — not in its raw intelligence, but in how the company and its customers can actually measure whether it works. As the San Francisco–based AI giant pushes deeper into autonomous AI agents capable of performing multi-step tasks on behalf of users, the traditional benchmarks that once served as reliable yardsticks for model quality are proving woefully inadequate. The result is a growing crisis of confidence among enterprise buyers and a scramble inside OpenAI to figure out how to demonstrate the value of its most advanced technology.
According to a report from The Information, GPT-5 — which OpenAI has been developing under intense internal pressure — showed disappointing results on certain established benchmarks when compared to its predecessor models. But the real story is more nuanced than a simple regression. The issue is that the benchmarks themselves were designed for an earlier era of AI, one in which models answered questions and generated text rather than autonomously browsing the web, writing and executing code, or managing complex workflows across multiple applications.
The Benchmark Problem: Measuring Yesterday’s AI With Yesterday’s Ruler
For years, the AI industry relied on standardized tests — think MMLU, HumanEval, and GSM8K — to rank models against one another. These benchmarks measured things like factual knowledge, coding ability, and mathematical reasoning in controlled, single-turn interactions. A user poses a question; the model answers. The answer is scored. Simple. But the new generation of AI agents doesn’t operate this way. An agent might be asked to research a competitor’s pricing, compile a spreadsheet, draft an email summarizing the findings, and send it to a distribution list — all without human intervention between steps. No existing benchmark captures this kind of extended, real-world task execution.
The challenge, as The Information reported, is that when GPT-5’s agentic capabilities were tested internally, the results were difficult to interpret. In some cases, the model performed tasks more effectively than GPT-4o or earlier models, but the improvements didn’t show up clearly on traditional metrics. In other cases, the model’s tendency to take autonomous actions introduced new categories of errors — not wrong answers per se, but wrong decisions made during multi-step processes that compounded over time.
Inside OpenAI’s Internal Debate Over What ‘Better’ Means
Sources familiar with the matter told The Information that OpenAI’s research and product teams have been engaged in vigorous internal discussions about how to present GPT-5’s capabilities to the market. The concern is straightforward: if the company releases a model that doesn’t clearly outperform its predecessor on the benchmarks that customers and analysts have come to rely on, it risks a narrative of stagnation — even if the model is genuinely more capable in the ways that matter most for the agentic future OpenAI is betting on.
This tension reflects a broader industry-wide reckoning. Google DeepMind, Anthropic, and Meta have all invested heavily in agent-style AI systems, and all face similar measurement challenges. Anthropic’s Claude, for instance, has been positioned as particularly strong in agentic use cases through its “computer use” capability, which allows the model to interact directly with desktop applications. But Anthropic has also struggled to produce benchmarks that convincingly demonstrate the superiority of this approach over simpler, non-agentic interactions. As Reuters has reported in its ongoing coverage of the AI arms race, enterprise customers increasingly want proof of ROI before committing to expensive API contracts — and that proof requires measurement tools that don’t yet exist in mature form.
Enterprise Customers Are Growing Impatient
The evaluation gap has real commercial consequences. Large enterprise customers — banks, consulting firms, pharmaceutical companies — have been piloting AI agents for tasks ranging from document review to customer service automation. But without reliable ways to measure agent performance, many of these pilots have stalled in the proof-of-concept phase. Procurement teams want quantifiable improvements: How many hours does the agent save? How often does it make errors? How do those errors compare to human mistakes? These questions are surprisingly hard to answer when the agent is performing open-ended, multi-step tasks in unpredictable environments.
OpenAI has attempted to address this with internal evaluation frameworks that simulate real-world agent tasks, but these frameworks are proprietary and not yet standardized across the industry. According to The Information, some OpenAI employees have pushed for the company to release these evaluation tools publicly, arguing that an industry-wide standard would benefit everyone — including OpenAI, whose models they believe would perform well under more realistic testing conditions. Others inside the company have resisted, viewing the evaluation methodology as a competitive advantage.
The ‘Dip’ That Spooked the Team
Perhaps the most striking detail in The Information’s reporting is the acknowledgment of a performance “dip” — a period during GPT-5’s development when the model appeared to regress on certain capabilities. While performance dips during training are not unusual in AI development (models often get worse before they get better as training data and techniques are adjusted), the GPT-5 dip was reportedly significant enough to cause concern among senior leadership. The worry was not just about the technical setback but about what it signaled for OpenAI’s ability to maintain the rapid pace of improvement that has defined the company’s trajectory since the release of ChatGPT in late 2022.
The dip also raised questions about the scaling laws that have governed AI progress for the past several years. The prevailing assumption — that making models bigger and training them on more data will reliably produce better results — has come under increasing scrutiny. Researchers at multiple organizations have noted diminishing returns from simply scaling up, and GPT-5’s uneven performance has added fuel to this debate. If raw scale no longer guarantees improvement, the industry will need new approaches to advance AI capabilities — and new ways to measure that advancement.
Competitors Smell Opportunity
OpenAI’s evaluation struggles have not gone unnoticed by rivals. Anthropic has been aggressively marketing Claude’s agentic abilities to enterprise customers, emphasizing reliability and safety over raw benchmark performance. Google has integrated its Gemini models into Workspace products, positioning agent capabilities as productivity features rather than standalone AI achievements — a strategy that sidesteps the benchmark question by embedding AI into tools people already use and can evaluate through their own experience.
Meanwhile, a growing number of startups are attempting to build the evaluation infrastructure that the major labs have so far failed to standardize. Companies like Braintrust, Patronus AI, and Galileo are developing tools specifically designed to assess agent performance in realistic settings, measuring not just accuracy but also cost-efficiency, latency, and error recovery. These startups are betting that as AI agents become more prevalent, the demand for trustworthy evaluation will become a significant market in its own right.
What Happens When You Can’t Prove Your Product Works
The stakes for OpenAI are enormous. The company, which reportedly reached an annualized revenue run rate exceeding $5 billion earlier this year, is in the process of raising additional capital at a valuation that could exceed $300 billion. That valuation is predicated in large part on the assumption that OpenAI will continue to release progressively more capable models that unlock new commercial applications — particularly in the enterprise agent space. If GPT-5 launches and the market perceives it as a lateral move rather than a leap forward, it could undermine investor confidence at a critical moment.
OpenAI CEO Sam Altman has publicly acknowledged that the era of simply releasing bigger models and watching benchmark scores climb is coming to an end. In recent public appearances, Altman has emphasized the importance of “usefulness” over raw performance metrics, signaling that the company is preparing the market for a shift in how AI progress is communicated. But the question remains whether customers and investors will accept a more subjective definition of improvement — especially when they’re being asked to pay premium prices for access to the latest models.
The Industry Faces a Measurement Reckoning
The GPT-5 evaluation challenge is, in many ways, a microcosm of a much larger problem facing the AI industry. As models transition from answering questions to performing actions, the entire framework for assessing AI capability needs to be rebuilt from the ground up. This is not merely a technical challenge — it is a commercial, regulatory, and philosophical one. Regulators in the European Union and the United States are developing AI governance frameworks that will require companies to demonstrate the safety and reliability of autonomous AI systems. Without standardized evaluation methods, compliance will be difficult and enforcement will be inconsistent.
For now, OpenAI appears to be pressing forward with GPT-5’s development and refining its internal evaluation approach. The company has not announced a public release date, and the timeline may depend in part on whether the team can resolve the measurement problem to its satisfaction. What is clear is that the AI industry has entered a new phase — one in which building a powerful model is only half the battle. The other half is proving it.