Datadog says AI failures stem mainly from capacity limits

Wed, 22nd Apr 2026

Datadog's latest AI engineering report found that capacity limits account for nearly 60 per cent of AI failures in production. The study is based on anonymised usage data from thousands of customers running large language models in production environments.

Around 5 per cent of AI model requests fail in production, with infrastructure constraints rather than model performance responsible for most of those failures. The report suggests operational complexity is becoming a bigger obstacle as businesses move AI systems from trials into live use.

Datadog's figures show that 69 per cent of companies now use three or more models. OpenAI remains the most widely used provider with a 63 per cent share, while adoption of Google Gemini and Anthropic Claude rose by 20 and 23 percentage points respectively.

At the same time, agent framework adoption doubled year on year, adding more components to production systems. The amount of data sent to models in each request is also rising, with average token counts more than doubling for median-use teams and quadrupling for heavy users.

Operational strain

The report argues that these trends are making AI systems harder to run reliably. Failures are increasingly linked to fragmented workflows, repeated retries and inefficient routing across different models and tools.

Yanbing Li, Chief Product Officer at Datadog, drew a parallel with the early growth of cloud computing.

"AI is starting to look a lot like the early days of cloud," said Li. "The cloud made systems programmable but much more complex to manage. AI is now doing the same thing to the application layer. The companies that win won't just build better models - they'll build operational control around them. In this new era, AI observability becomes as essential as cloud observability was a decade ago."

The findings suggest organisations are no longer dealing only with model selection or output quality. They must also manage uptime, routing, resource use and the interaction between models, data pipelines and agent frameworks.

This is especially visible in environments where systems must respond consistently under load. A failure rate of 1 in 20 requests can cause visible disruption in customer-facing applications, while repeated infrastructure bottlenecks can increase costs and slow development teams.

Regional view

Datadog also pointed to similar pressures in Australia and New Zealand, where companies are moving towards multi-model and agent-based deployments.

"In A/NZ, the focus has firmly shifted to running AI reliably in production and multi-model architectures and agentic workflows are becoming standard, but that maturity is exposing significant gaps. A failure rate sitting at around five per cent, largely driven by capacity constraints, is a material concern in industries where uptime and trust are non-negotiable. AI systems are increasingly resembling distributed systems, yet many teams are still not managing them with the operational discipline that demands," said Yadi Narayana, Chief Technology Officer for APJ, Datadog.

Narayana said the operational challenge is being matched by a growing cost issue. As token usage rises, companies face higher spending on model calls and supporting infrastructure, particularly when requests are poorly optimised or duplicated by retries.

"There is also a cost problem hiding in plain sight. Token consumption is climbing fast, while optimisation techniques, like prompt caching and smarter context design, remain largely untapped. The next phase is about closing the gap between how sophisticated these systems have become and how rigorously they're being operated. Organisations will prioritise foundational capabilities like observability, governance, and cost control, over accelerating deployment speed, building AI systems that are reliable, scalable, and accountable," said Narayana.

Broader shift

The report adds to a wider debate in the technology sector about what limits AI adoption once early experimentation gives way to scaled deployment. Many businesses have focused on access to the latest models, but Datadog's data suggests the main bottlenecks increasingly sit in surrounding infrastructure and operational processes.

That view was echoed by Guillermo Rauch, Chief Executive Officer at Vercel, which develops web application tools including Next.js.

"The next wave of agent failures won't be about what agents can't do but what teams can't observe," said Rauch. "We built agentic infrastructure at Vercel because agents need the same production feedback loops as great software. Unlike traditional software, agents have control flow driven by the LLM itself, making observability not just useful, but essential."

The analysis covered customers across industries and geographies using anonymised production usage data. Its central argument is that AI deployment is shifting from a model problem to a systems problem, with reliability, visibility and cost management becoming defining issues as adoption broadens.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google