VentureBeat

Researchers trained an open source AI search agent, Harness-1, that outperforms GPT-5.4 on recalling relevant information

carl.franzen@venturebeat.com (Carl Franzen) — Mon, 08 Jun 2026 22:19:00 GMT

A joint research collaboration between researchers at the University of Illinois at Urbana-Champaign (UIUC), UC Berkeley, and the open source AI-native vector database platform Chroma unveiled Harness-1, a 20-billion parameter open-source search agent built atop OpenAI's gpt-oss-20B open source model that fundamentally redesigns how AI executes complex retrieval tasks.

Harness-1 achieves a massive leap in performance, scoring 73% average on its ability to recall relevant information correctly from a curated dataset, outperforming even GPT-5.4 (70.9%) and the next, most accurate open source search agent, Tongyi DeepResearch 30B, by 11.4 percentage points. (While GPT-5.5 has also been out for more than a month, the researchers didn't test against this model as it wasn't available when they were building theirs.)

Crucially for developers, the model and its environment are available immediately under the highly permissive Apache 2.0 license and model code/weights on Hugging Face.

Harness-1 also serves as proof-of-efficacy of another effort, Tinker, the distributed, web-based AI model training and fine-tuning API developed by Thinking Machines. Tinker was used specifically to train and run inference for Harness-1, highlighting how interactive infrastructure is actively enabling the next generation of autonomous models.

So how did the researchers do it?

Benchmarks Decoded (and Why Harness-1 Could Help Enterprises Tremendously)

To actually put these models to the test, the researchers evaluated Harness-1 and its competitors across eight highly complex search benchmarks. Rather than asking simple trivia questions, these tests required the AI to act like a real researcher sifting through diverse, dense data sources.

The benchmarks spanned several different domains, including open web searches, complex financial filings from the SEC, technical patent databases from the USPTO, and "multi-hop" question-answering tasks where the AI had to logically piece together scattered clues from multiple different documents to arrive at the correct answer.

When the results came in, Harness-1 dominated the open-source competition in its ability to successfully find and curate the right facts. Even more impressively, this relatively small 20-billion parameter model went toe-to-toe with massive, expensive proprietary AI systems. It actually outperformed heavyweights like GPT-5.4, Sonnet-4.6, and Kimi-K2.5 — thought to be the hundreds of billions or trillions of parameters. Only one giant frontier model—Opus-4.6 — managed to narrowly edge it out in overall average performance.

Harness-1 achieves its performance gains by offloading the exhaustive "bookkeeping" of a search session out of the model's working memory and into a structured software environment.

As enterprise use cases grow more sophisticated, demanding that models autonomously sift through thousands of corporate documents or financial filings, these systems frequently succumb to "search amnesia"—forgetting their original queries, looping over rejected documents, or losing track of the specific claims they are trying to verify.

Until now, the prevailing solution to this amnesia has been brute force. Engineers typically force models to constantly reread an ever-expanding, append-only transcript of their own actions, piling every search, read, and thought back into a massive context window.

Harness-1 introduces a paradigm shift away from this method, proving that the bottleneck for true artificial autonomy isn't necessarily the size of the model, but how efficiently its working environment manages state. It highlights once more, as Anthropic's Claude Code has also done, that the raw model is arguably less important than the harness — or set of conditions — through which it runs.

Technology: Doing the Paperwork in the Environment

To understand the technical leap of Harness-1, consider a real-world analogy.

Imagine hiring a brilliant research assistant and placing them in an empty room without a desk, notepads, or filing cabinets. You ask them to write a comprehensive report on a highly complex topic, which requires them to read dozens of books while keeping every single quote, citation, and dead-end search perfectly memorized in their own head. Eventually, no matter how intelligent the assistant is, their cognitive load will max out, and they will start dropping facts or losing the thread of the assignment.

This is exactly how traditional search agents operate today. They are trained as policies over growing transcripts, meaning the model searches, reads, searches again, and appends everything into its own context window.

As lead researcher Patrick (Pengcheng) Jiang of the University of Illinois noted on X: "At some point the model is not just 'searching' anymore. It is also being asked to be a memory system, a note taker, a verifier, and a librarian."

Harness-1 solves this by giving the AI a desk and a filing cabinet—what the research team calls a "state-externalizing harness."

This harness is an active, surrounding environment that takes over the routine bookkeeping, maintaining a recoverable working memory that includes a candidate pool of documents, an importance-tagged curated evidence set, compact evidence links, and verification records.

By separating semantic choices from structural state management, the AI is freed up to do what it does best.

The policy still decides what to search, determines which documents to keep, and knows when to stop, while the environment simply holds the state.

Here is a subsection breaking down the training methodology and how it differs from prior agentic search models:

Training Harness-1: A Masterclass in Data Efficiency

The training pipeline for Harness-1 represents a fundamental shift in how the AI industry approaches agentic learning.

Historically, developers have treated search agents as policies operating over massive, ever-growing transcripts, forcing reinforcement learning (RL) algorithms to simultaneously optimize both semantic reasoning and the raw memorization of a search state.

Harness-1’s creators took a radically different approach: because their custom "harness" handles all the routine bookkeeping—like maintaining evidence links, candidate pools, and verification records—the training process only needed to teach the model how to operate this structured interface.

This division of labor drastically simplified what the underlying 20-billion parameter model actually needed to learn.

The process began with a remarkably narrow Supervised Fine-Tuning (SFT) stage. Rather than scraping petabytes of new behavioral data, the team generated just 899 filtered trajectories using a GPT-5.4 teacher agent that was plugged into the exact same harness environment the student model would eventually use.

The goal of this SFT phase was not to inject vast amounts of domain knowledge into the model, but simply to teach it the mechanical rhythms of a good researcher: how to format tool calls, how to tag documents by importance, and the discipline of verifying a claim before promoting it to the final curated set.

Following SFT, the model underwent Reinforcement Learning (RL) using an algorithm called CISPO, applied over full search episodes capping at 40 turns.

The team designed a highly specific terminal reward function that explicitly separated discovery from selection. The model was rewarded not just for finding a relevant document, but for successfully promoting it into the final answer set, while being penalized if it found the answer but failed to curate it.

The researchers also instituted a "tool diversity" bonus; without this specific incentive, they found the policy would quickly collapse into a lazy, search-heavy strategy where it spammed queries but bypassed the harder work of reading and verifying the text.

What makes Harness-1 truly innovative compared to prior work is its unprecedented data efficiency. The entire model was trained on roughly 4,400 unique items—899 SFT trajectories and 3,453 RL queries.

In stark contrast, competing open-source models required vastly larger datasets to achieve worse results: Context-1 utilized over 17,200 training items, while Search-R1 relied on a staggering 221,300 items to learn search behaviors.

By proving that a smarter external cognitive architecture can replace brute-force data scaling, Harness-1 suggests that the future of agentic AI lies in building better environments for models to work within, rather than just training larger models on more data.

Product: Enterprise Applicability and Generalization

From a product perspective, Harness-1 is delivered as a highly capable 20B agent merged into the openai/gpt-oss-20b base architecture.

For enterprise tech stacks, the applicability is massive because businesses need AI to execute multi-step research across proprietary databases without hallucinating or running up exorbitant compute bills.

Harness-1 manages its frontier-level performance at what the creators describe as "Context-1-level cost and latency." Because the context window is strictly managed by the budget-aware harness rather than continuously expanding, enterprises can deploy this agent autonomously without incurring the exponential token costs typically associated with long-horizon AI tasks.

Even more impressively, Harness-1 proves it can generalize well beyond its training data. According to the research team, it was incredibly cheap to train, utilizing just 899 filtered supervised fine-tuning (SFT) trajectories and a mere 3,453 reinforcement learning (RL) queries.

"Instead of training the model to survive a giant append-only transcript, we train it to use a structured search interface: search, curate, revisit, verify, and submit," Jiang explained.

This leanness proves a critical point for the AI industry: developers do not necessarily need petabytes of new behavioral data if they build a better cognitive framework for the model to operate within.

Licensing: The Power of Apache 2.0

One of the most significant aspects of the Harness-1 release is its licensing. In plain language, Apache 2.0 is a highly permissive, enterprise-friendly software license that fundamentally enables commercialization.

Unlike "copyleft" licenses (such as the GPL) that can force companies to open-source their own proprietary software if they integrate the code, or "research-only" licenses that ban commercial use entirely, Apache 2.0 gives businesses the green light to freely build, modify, and monetize the technology.

For developers and startups, this means Harness-1 can be seamlessly integrated into commercial enterprise search products, internal data retrieval tools, or customer-facing AI applications without fear of legal reprisal.

The only major requirement is that users must include the original copyright notice and explicitly state any significant modifications they make to the source code, positioning Harness-1 as a highly viable foundational building block for the enterprise.

Community Reactions: A Resounding Validation

The announcement has clearly struck a nerve within the developer community, validating the very real pain points engineers face when building agentic systems. Jiang’s multi-part announcement thread on X quickly garnered massive traction, pulling in over 256.1K views, 3.7K likes, 2.9K bookmarks, and nearly 300 reposts within a matter of days.

This high engagement underscores a growing consensus in the AI space that brute-forcing context windows is a losing battle.

When Jiang posted on X, "I’ve been wondering: maybe search agents are bad at search partly because we make them do all the paperwork in their head," the resonance was immediate.

For developers who have spent the last year wrestling with AI agents that confidently forget their primary instructions halfway through a database search, the Harness-1 approach feels like a desperately needed course correction.

Ultimately, the community sentiment highlights a shift in industry priorities. Developers are moving away from asking how large an AI model's context window can get, and instead asking how efficiently an AI model's environment can manage that context for it. By offloading the paperwork, Harness-1 is proving that smaller, smarter systems can outmaneuver the giants—provided they have the right desk to work at.

The Agentic Reckoning: Enterprise AI organizations have a runtime problem, not a model problem — and most are building the wrong solution

Mon, 08 Jun 2026 21:01:00 GMT

In Q1 2026, VentureBeat's Pulse Research surfaced the “Governance Mirage”: the gap between the governance org charts enterprises had drawn and the control layers they had actually built. Forty-three percent said a central team owned AI governance; 23% couldn't agree on who owned it at all; and 31% named vendor opacity as the single biggest obstacle.

This new wave of research asks the next question: Once you've admitted the governance problem, what breaks first when you try to fix it? The answer from our respondents is unambiguous. The failure point is not the model. It's the runtime.

Enterprises are discovering that AI agents built on stateless infrastructure — Python scripts, LangChain chains, ad hoc orchestration — cannot survive the operational realities of production. Container restarts erase context. Token costs breach business cases. Hallucinations in Step 3 compound into catastrophic failures by Step 12. And the majority of engineering teams are spending more time managing this "plumbing" than building the intelligence that was supposed to justify the investment.

What emerges from this survey is a picture of an industry at a critical fork. The organizations that survive the Agentic Reckoning will be those that treat runtime durability as a first-class engineering concern — not an afterthought to be patched with retries and prompting. The ones that don't will find themselves back where RPA left enterprises a decade ago: a graveyard of clever pilots that couldn't survive Day Two.

Methodology

VentureBeat conducted this survey in May 2026 as part of its ongoing Pulse Research series on agentic AI adoption in the enterprise. Respondents were filtered to organizations with 100 or more employees. The final qualified sample consists of 132 verified, highly qualified technology leaders at the forefront of enterprise AI agent deployment.

They span:

Directors of AI/Analytics (8%)	Directors of Engineering/IT (16%)
VP of Data/AI/Analytics (5%)	VP of Engineering/IT (5%)
CIOs/CTOs/CISOs (15%)	Product and Program Managers (13%)
Consultants (9%)	Software and ML Engineers (9%)
Enterprise Architects (8%)	Other (12%)

Industries represented include Technology/Software (42%), Financial Services (20%), Professional Services (8%), Healthcare/Life Sciences (7%), Retail/Consumer (6%), Education (4%), and others.

Given our strict filtering criteria, this cohort provides a robust and authoritative look at emerging agentic infrastructure trends.

Respondent demographics by company size:

Large enterprise (10,000+ employees): 35% of the sample
Mid-to-large enterprise (500–9,999 employees): 48% of the sample
Growth enterprise (100–499 employees): 17% of the sample

These quantitative findings capture a critical moment in infrastructure evolution and are best synthesized alongside VentureBeat’s Q1 2026 governance reports and our deep-dive practitioner conversations conducted throughout the quarter.

Finding 1: The runtime is the problem

The "spine vs. brain" debate is over

The foundational question of enterprise AI in 2026 is whether agent failures trace back to the model's reasoning capability — the Brain — or to the runtime infrastructure's inability to manage state, survive failures, and coordinate execution — the Spine. We asked our respondents directly.

Integration/governance challenges were the biggest problem. But Spine issues were close behind.

However, 17% still say the Brain is the primary failure mode. That’s not a rounding error — it’s a signal. The organizations in this cohort are not disputing the infrastructure problem; they are telling us that the models themselves are not yet reliable enough for the edge cases their workflows are generating. The model-versus-runtime debate is genuinely three-sided. Read together, these three answers are not fully in conflict. The Spine and Gap camps are struggling with infrastructure and governance respectively. The Brain cohort is struggling with something upstream: reasoning reliability at scale.

This is a significant finding. The frontier model wars — GPT-5 vs. Claude 4.7 vs. Grok — are consuming enormous mindshare in the enterprise technology press. Our respondents are telling us that war is, for now, beside the point. The models are smart enough, but the infrastructure around them is not.

"The models are smart enough, but our stateless infrastructure is too fragile to manage long-running, multi-step agentic processes." — Director of Engineering / IT, Financial Services, 10,000–49,999 employees

Finding 2: The DIY tax is eating teams alive

Engineering capacity is being consumed by plumbing, not intelligence

If the Spine is a primary failure mode, what does that cost in practice? We asked respondents what percentage of their team's weekly engineering capacity is consumed by building and maintaining custom "plumbing" — manual retries, state-persistence, checkpointing — rather than actual agentic logic.

The results reveal a market in two distinct camps, with a dangerous middle.

The arithmetic is stark. Seventy-seven percent of respondents are spending meaningful engineering time on infrastructure overhead. Just 23% — those whose frameworks are handling reliability — have escaped the tax. The distribution is notably flat: the Crisis and Efficiency poles are the same sizes as the middle categories (Trap and Maintenance Tax). This is the signature of a market that has partially addressed the worst failures but has not yet escaped the structural overhead.

The Efficiency Zone respondents are not necessarily in a more sophisticated position. In many cases, they may be on managed platforms that abstract away the durability problem — or they may simply not yet have hit the scale at which stateless architectures begin to fail. The Complexity Trap is often where the Efficiency Zone ends.

There’s a direct business consequence for organizations in the Crisis zone. Every engineering hour spent writing retry logic or debugging a "ghost failure" — a silent API timeout that leaves an agent hanging without a traceback — is an hour not spent on the differentiated logic that was supposed to justify the AI investment in the first place.

Finding 3: State amnesia is the production killer

The No. 1 technical obstacle has shifted: Cost and hallucination now lead state failures

When AI agents fail to reach production or scale, what is the primary technical obstacle? We named five candidates, ranging from model hallucination to cost overruns to latency failures.

Hallucination Propagation at 24% compounds silently — reasoning errors in early steps become catastrophic by Step 10. Ghost Failures at 20% are invisible by definition, which means their real prevalence is likely higher than this number suggests.

Finding 4: The observability tax falls heaviest on Microsoft

Platform visibility costs are not equally distributed

Our Q1 2026 research identified vendor opacity as the single biggest obstacle to AI governance — ahead of talent gaps, tooling, and budget. That finding pointed to this question: Which vendor ecosystem, in practice, imposes the highest cost to achieve basic production visibility?

We asked respondents which platform requires the most custom telemetry, manual instrumentation, and "logging glue" to achieve visibility into agentic failures.

Microsoft's position at the top of this ranking is not noise. It is a structural characteristic of the Microsoft agentic ecosystem — the same Azure/Copilot stack that dominates enterprise AI adoption requires the most instrumentation overhead to see inside.

It also reinforces the warning that Brian Gracely, Senior Director at Red Hat, made at VentureBeat’s Boston event in March: that building your control system entirely inside one cloud provider's toolset means "renting a cage." The organizations paying the highest observability tax are precisely those most locked into provider-native tooling.

The implication for teams currently evaluating orchestration architecture is direct: observability cost is a real budget item that should appear in any build-vs-buy analysis. A platform that appears cheaper at the API layer may impose substantially higher engineering costs at the telemetry layer.

Finding 5: The hype-reality gap belongs to OpenAI and Microsoft

Agentic coding marketing is significantly ahead of production reliability.

We asked respondents a pointed question: Which major platform's Agentic Coding marketing is the most disconnected from the actual technical reliability and fault-tolerance of their product? Thirty-two percent said they didn't know — a figure that has held roughly constant across all three waves, suggesting persistent uncertainty is structural, not a sample artifact. Cursor also registered 6% in this wave. Among those with enough production experience to have a view.

Microsoft leads at 45%; OpenAI is second at 22%. The gap is too large to attribute solely to deployment footprint. It suggests that GitHub Copilot Workspaces and AutoGen are generating a specific category of disappointment — probably around the reliability of multi-agent orchestration in production — that accumulates with use. A platform that fewer enterprises are running in production will accumulate fewer credible disappointed practitioners.

The more significant observation is what this gap means for decision-makers evaluating new agentic tooling. The marketing around all major platforms describes agentic autonomy and reliability at a level that production deployments are not yet delivering. The organizations in our survey who have moved beyond pilots are encountering the difference firsthand.

Finding 6: The security mesh is being built from first principles

Enterprises are not waiting for vendors to solve agent security

How are enterprises protecting proprietary research data from AI leakage and prompt-driven exfiltration? The security architecture question is one of the most consequential in agentic AI, because agents — unlike static models — can actively call APIs, traverse file systems, and execute code. The blast radius of a security failure is qualitatively different.

Policy-as-Code is a leading security mechanism, but not by much.

The NHI and Policy-as-Code approaches are meaningfully different in their security philosophy. NHI is identity-centric: The question it answers is "who is this agent and what is it allowed to touch?" Policy-as-Code is rule-centric: The question it answers is "regardless of what the model decides to do, what hard stops exist at the infrastructure level?"

Rough parity across all four mechanisms is the headline finding. This is what market convergence looks like in early motion: No dominant pattern has emerged. Notably, though, Egress-Locked Sandboxing is a relatively new trend in agentic AI deployments, yet it’s already at 22%. As more agents gain terminal-level access to enterprise systems, the cost-benefit of sandboxing is improving. This is notable given the maturity of the identity management and policy-as-code disciplines in traditional IT security. The AI security layer is, for now, being built largely from scratch.

The Egress-Locked Sandboxing number deserves attention despite its smaller share. Sandboxing untrusted code execution is the most technically intensive of the four approaches, but it is also the most direct defense against prompt injection attacks that try to execute malicious code through agent tooling. As agentic systems gain more terminal-level access — a trend our survey confirms is accelerating — this approach may prove more important than its current adoption rate suggests.

"How do we audit agentic tools that have terminal-level access to our proprietary repos?"
— Composite concern expressed by multiple respondents

Finding 7: The complexity cliff is real, and most are climbing it

The migration away from stateless architectures is underway — but fragmented

The central thesis of the Agentic Reckoning is that stateless Python/LangChain architectures cannot survive the complexity cliff — the point at which multi-step, long-running agent workflows begin failing at rates that make production deployment untenable. We asked respondents directly: are you migrating toward durable execution frameworks to solve for state loss?

The answers reveal a market in transition, with meaningful disagreement about the right destination.

The 20% committed to stateless architectures — attempting to solve a structural durability problem through better prompting — are the cohort most likely to encounter State Amnesia and Ghost Failures as their workloads scale. It’s essentially the same trap that RPA teams fell into a decade ago, when brittle process automations were patched with increasingly elaborate rule sets rather than re-architected on more resilient foundations.

The Stateless Commitment cohort deserves a reinterpretation. These teams are not all naive: some are building on managed platforms that genuinely abstract state management. But a portion is patching structural fragility with prompting improvements, and the Ghost Failures data in Finding 3 suggests this approach may be encountering its ceiling.

The combined 59% who are either in Active Migration or in Governance-First Evaluation represent the market's leading edge — organizations that have recognized the architectural problem and are investing to solve it structurally.

Finding 8: The “polyglot orchestration” lead is narrow — the field is fragmented

Architectural conviction is spread across multiple bets

What is the longterm architectural philosophy winning enterprises' strategic investment? We offered four options representing the major bets available in the current market.

The Polyglot Bet's lead suggests that enterprises are seeing advantages of using a flexible approach: Using model-driven architectures where non-deterministic reasoning works well, but using deterministic structures and pipelines where accuracy and mission-critical execution is at stake.

This has direct competitive implications for the frontier labs and cloud providers. The cohort saying the use a Cloud-Native Managed Stack is significant. This likely reflects the enterprise reality that Azure OpenAI Service and AWS Bedrock deployments come with built-in organizational gravity — procurement relationships, security approvals, and existing data pipelines. The Independent Durable Runtime bet at 16% signals that a cohort of teams have rejected both cloud lock-in and frontier lab dependency in favor of full architectural sovereignty.

The Polyglot result also helps explain why the observability and governance problems described in this survey are so persistent. When your architecture deliberately spans multiple orchestration layers and multiple providers, no single vendor's telemetry gives you the full picture. The "Dynatrace for AI" — the unified observability platform called for by Mass General Brigham's CTO Nallan Sriraman at the VentureBeat Boston event — becomes not just desirable but structurally necessary.

"Enterprises trust no single provider enough to give them full control, yet they lack the engineering capacity to build entirely from scratch."
— Survey respondent

Finding 9: User acceptance rate is the emerging production standard

The market is settling on a human-trust metric as its primary A-SLA

What metrics are enterprises actually using to determine whether an AI agent is ready for production? We asked respondents to identify their primary Agentic SLA (A-SLA) indicator — the number that, above all others, tells them whether an agent can ship.

User Acceptance Rate as the dominant production metric is significant because it is a human-trust measure, not a technical performance measure. It does not ask whether the agent ran fast or maintained state. It asks whether a human who reviewed its output chose to accept it. This is, in effect, a field-level Turing test applied at the action level.

The persistence of UAR as the leading metric reflects the reality of where most enterprise agentic deployments still sit: in a human-in-the-loop posture, where agent actions require human review before execution. That is a rational response to the Hallucination Propagation and Ghost Failures described earlier in this survey. Organizations that have not yet solved runtime durability are, sensibly, keeping humans in the loop — and at 132 respondents, there is no evidence this is changing.

Context Fidelity's position at 30% is the most significant finding. It tracks directly with the Active Migration data in Finding 7: As more teams move into durable execution frameworks, the 48-hour+ memory problem becomes their primary production concern. Teams that have solved State Amnesia are now focused on whether their agent can remember what it was doing yesterday. Latency Jitter's collapse from 25% to 11% tells the complementary story: raw speed is no longer the primary anxiety. Correctness and durability have taken its place.

The bottom line: The reckoning is runtime, not reasoning

The data tells a consistent story: There’s a runtime deficit for agents. Enterprises are spending more time on infrastructure plumbing than on agent intelligence, and State Amnesia is still claiming production deployments. But fault lines are visible. The ROI Ceiling has overtaken State Amnesia as the leading production killer — which means the infrastructure problem is no longer purely a technical one. Token economics and orchestration overhead are now consuming enough business value that project sponsors are making the kill decision before engineering teams can solve the durability problem. Hallucination Propagation remains a big problem. The Brain vote in Finding 1 remains significant. And the Polyglot lead is fragile, with varied architectures well represented.

The models are, by most respondents' own assessment, smart enough — but 17% disagree. What is not yet smart enough is the infrastructure surrounding them: the state management, the fault-tolerance, the observability, the identity governance, and the deterministic execution layer that turns a model's judgment into something an enterprise can stake its operations on.

The 39% making the Polyglot Bet represent the current leading edge of enterprise architectural thinking. They are building systems where the model's intelligence is preserved and leveraged, but where the execution layer — the Spine — is deterministic, auditable, and durable by design. They are not waiting for a frontier lab to solve this for them. They are not betting that better prompting will patch infrastructure fragility. They are building the control plane.

The organizations still committed to stateless architectures — still trusting that manual retries and clever prompting can substitute for durable execution — are the ones most likely to contribute to the next wave of this data. Ghost Failures are a primary obstacle. The pattern is familiar: Early adopters diagnose the problem architecturally, migrate to durable runtimes, and escape the failure mode. Late movers inherit it. The Complexity Cliff is not theoretical. It is the wall that most current agentic architectures are already climbing toward.

The reckoning is runtime and economics, not reasoning.

Based on survey responses from 132 qualified enterprise respondents (100+ employees). Sample size is small; data should be treated as directional. Respondents include Directors, VPs, CIOs, CTOs, and Enterprise Architects across Technology, Financial Services, Retail, Healthcare, and other sectors.

When Claude changed, everything changed: Managing AI blast radius in production

Mon, 08 Jun 2026 01:02:33 GMT

Our system did one thing, and it did it well: It turned natural-language questions into API calls.

The users were analysts, account managers, and operations leads. They knew what data they needed, but assembling it manually meant pulling from four dashboards, two BI tools, and a Salesforce report builder. With our system, they typed the request in plain English. A request like "Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city" was translated into an API call that the system could act on:

json

{

"description": "User requested sales volume for the given date range, here is the API call to get the response",

"api_call": "/api/sales_volume",

"post_body": {

"start_date": "2026-01-01",

"end_date": "2026-03-31",

"region": "northeast"

}

The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend — we had integrations with internal reporting portals, Salesforce, and several homegrown services — applied a large language model (LLM)(-generated JSON query to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a chart in the browser.

By mid-2025, the system was generating several hundred reports a month. These reports were consumed by leadership and analysts and circulated to external stakeholders. It had become the default way most teams pulled ad-hoc data.

The contract between the LLM and the rest of the system was a structured JSON object as described in the above example.

json

{

"description": "User requested sales volume for the given date range, here is the API call to get the response",

"api_call": "/api/sales_volume",

"post_body": {

"start_date": "2026-01-01",

"end_date": "2026-03-31",

"region": "northeast"

}

We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, we had grown complacent about the stability and predictability of LLMs in solving what we believed was a simple problem. Model upgrades had become routine, like bumping a minor version of a well-behaved library.

Then we rolled out 4.5. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field. Two failure modes followed.

First, the filter parameters never reached the API. Our system read post_body as the source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error.

Second, the model started asking clarifying questions in its response. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, would sometimes respond with a question instead. Our system had no path for this. It had been built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused downstream systems to break in multiple ways.

We rolled back to 4.0. That was harder than it should have been: Between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure.

Why traditional engineering discipline fails here

Software engineering rests on the ability to bound the effect of a change. When you upgrade a driver or library, you read the release notes to see whether to expect breaking changes. Unit tests circumscribe what could possibly have moved. You can leverage the following property: The system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is bounded by construction.

LLM-backed systems break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends.

This is what we mean by an infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded.

Anatomy of the failure

The post-mortem revealed that our prompt had always been under-specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields.

Earlier versions of the model inferred this constraint from context. Sonnet 4.5, evidently better at being "helpful" in its formatting choices, decided that inquiring for clarification or providing the request body in the description made the response more useful. From the model's perspective, this was a reasonable interpretation of an ambiguous instruction. However, this violated the assumptions under which our system was built.

The bug was not in the model. The bug was in our assumption that the model would continue to fill in our specification gaps as it always had. Three successful upgrades had trained us to believe those gaps were safe.

Structured output modes and tool-use APIs would have caught this specific failure at the schema level. We weren't using them for engineering reasons outside the scope of this article. But schemas only constrain syntax, not semantics. A schema cannot specify that a clarifying question shouldn't appear in a system with no path for clarification, or that a date range should never silently default to all-time. Schemas solve the easier half of the problem.

The evals-first architecture

The discipline that closes this gap is to treat the evaluation suite — not the prompt — as the formal specification of the system. The prompt is an implementation of the spec. The model is an interpreter. The evals are the spec itself, and any model or prompt change is valid if and only if it passes them.

In practice, an eval is a triple: An input, a property the output must satisfy, and a scoring function. For our system, the eval that would have caught the 4.5 regression looks roughly like this:

python

def test_description_contains_no_serialized_payload(response):

desc = response["description"].lower()

forbidden = ["curl", "post_body", "{", "http://", "https://"]

assert not any(token in desc for token in forbidden), \

f"description leaked structured content: {response['description']}"

A few hundred such properties, some written by hand for known-important invariants, some generated as regression tests from real production traffic, some scored by an LLM-as-judge for fuzzier qualities like tone, become a gate. Model upgrades and prompt changes should be treated as pull requests that must turn the suite green before they merge.

Evals are expensive to build and maintain. They drift as your product changes. LLM-as-judge scoring introduces its own variance in outcomes. And the suite can only catch failure modes you have thought to specify — you cannot eval your way to safety against a category of failure you have never imagined. We learned this lesson the hard way: Nobody on our team had ever written an assertion that said "the description field should not contain a curl command," because nobody had thought the model would put one there.

Evals are not a silver bullet. They give you the ability to bound the blast radius of a change in the only way available when the underlying function is a black box: By densely sampling the input-output response you actually care about, and refusing to deploy when that behavior moves.

The roadmap

The engineering community has yet to develop a body of knowledge for writing effective evals. There are no widely accepted standards for what 'coverage' means in natural language input spaces. CI/CD systems were not built to gate probabilistic test outcomes. As agents take on more autonomous work — writing code, moving money, scheduling infrastructure changes — the gap between "the model passed our smoke tests" and "we know what this system will do in production" becomes the central engineering problem of the next several years.

The teams that close that gap will be the ones who stop treating evals as a quality-assurance afterthought and start treating them as the actual specification of what their system is.

Vijay Sagar Gullapalli is Founding AI Engineer at Adopt AI and a USPTO-patented inventor.

Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams.

Agentic AI solved coding — and exposed every other problem in software engineering

Sun, 07 Jun 2026 16:00:20 GMT

Agentic AI is now a core part of the engineering process, driving massive execution leverage and helping us generate more code than ever before. Yet, a difficult question I’ve increasingly heard from business leaders is: if we’re shipping code faster than ever, why aren’t our products improving at the same rate?

The reason is that writing code was never the rate limiter. Defining the right requirements, integrating with complex systems, and maintaining software under real-world conditions has always been the hard part. And when agents flood an organization with lots of new code, the hard part only gets harder. Agents compress execution time. They do not compress ambiguity, accountability, or operational complexity.

As AI-generated code scales, human review is becoming a massive new bottleneck, and engineers are losing the context needed to catch agent mistakes. The companies that understand this will move forward deliberately and even create new roles because of AI. The ones that don’t will default to a simpler, far more destructive conclusion: Reduce headcount and increase AI spend.

The playbook

Irreversible structural decisions demand caution, precisely because the technology is moving so fast. Enterprise engineering leaders need a deliberate playbook to navigate the chaos. Here's how to start:

Phase 1: Financial and risk governance

Protect the downside — secure the infrastructure and cap the financial bleeding.

Treat governance as a tier-one risk: The pressure to integrate AI is real, but giving teams the freedom to experiment without a centralized structure creates fragmented processes, duplicated work, and runaway costs. Organizations will need to establish shared standards while still allowing teams to adapt and explore within defined boundaries. This means treating agent configuration like production infrastructure — versioning, reviewing, and testing prompts and skills before rolling them out gradually.
Enforce least privilege for non-human actors: Never allow an agent to simply inherit the full permissions of its human operator. Human engineers are granted broad access because they possess contextual judgment and bear ultimate accountability. Deploying agents with human-level access without careful consideration introduces an accountability gap into your systems. Implement strict separation between read and write/execute access, and mandate human-in-the-loop approval gates for destructive or production-altering actions. As agents transition from suggesting code to autonomously executing tasks, they must be rigorously incorporated into your security model.
Watch your wallet: Protect your overall AI budget by enforcing quotas and rate limits for both engineering and production. Cautionary tales are increasingly common: Uber capped its AI spend after burning its 2026 budget by April, and, according to Axios, an unnamed company incurred a staggering $500 million Anthropic bill in a single month due to runaway agentic loops.

Phase 2: Technical strategy

Build the engine: Choose the right models and measure their success.

Go multi-model and multi-vendor: No single model excels at every task. It's important to precisely characterize the behavior and performance boundaries across models to understand where each excels, routing specific tasks to the systems best equipped to handle them. Standardizing on a single vendor or model sacrifices capabilities and introduces a critical single point of failure. No organization should absorb that level of concentration risk in its core engineering function.
Pay for the frontier: Treat AI as engineering leverage, not just another SaaS expense. Pay for premium frontier models that deliver the highest quality output and reduce costly rework. Ultimately, the cheapest model isn't the one with the lowest token price — it’s the one that maximizes efficiency while minimizing your downstream risk.
Measure what actually matters: Deployments, lines of code, and pull requests were never good metrics for productivity, and with AI, they are actively misleading. Instead, aim for metrics that are attached to business outcomes (feature adoption, retention) and engineering durability (change failure rate, escaped defects, code survival over time). For AI efficiency, measure task success per dollar and rework time. Token counts are convenient for leaderboards but they cannot tell you if the tokens were well spent.

Phase 3: Talent and organization

Realign your human capital to manage the new bottleneck.

Shift engineers from syntax to systems: As agents handle the bulk of code generation, human review and architectural alignment are the new bottlenecks. Organizations must deliberately upskill their workforce to transition from syntax-writers to systems-thinkers and agent-managers. Engineers need the training and mandate to guide agentic processes, manage complex cross-system integrations, and hold the overarching architectural vision that agents can struggle to maintain.
Redefine performance and incentives: When an individual engineer can generate the output of a former squad, traditional metrics like story points or sprint velocity can become ineffective overhead. Consider realigning your evaluation frameworks to better reward expanded business impact, cross-system reliability, and effective agent orchestration. If you want systems-thinkers who cover more strategic surface area, are willing to explore and take risks, and build products in a durable way, you must reward them for higher level impact, not sheer volume of output.
Don’t cut headcount before your strategy adapts: If you haven't integrated agentic workflows, measured augmented output in production, and reworked your roadmap around faster execution, you do not actually know whether your needs and capabilities align. Cutting headcount before establishing that baseline isn't discipline — it’s blindness. The goal is not simply smaller teams, but teams capable of covering more strategic surface area.

Enterprise AI adoption requires human elasticity

AI is not a replacement for engineering judgment; it is a force multiplier for it. In well-structured systems, it safely accelerates delivery. In poorly understood systems, it accelerates failure. We are already seeing the fallout: Outages, rising technical debt, and unexpected cost spikes driven by poorly governed adoption. These are operational failures, not theoretical risks.

The mistake organizations are now making isn’t adopting AI too slowly — it’s adopting it without understanding where it breaks.

For the C-suite, understanding this dynamic is no longer optional — it is the determining factor in how a business navigates this era. The challenge is that execution velocity is outpacing the industry's ability to manage the consequences. We have handed engineering teams the ultimate power tool. The old adage demands that you measure twice and cut once. Instead, too many firms are opting to just cut.

Joe Bertolami is CTO and co-founder of Clifton AI.

Microsoft AI chief says company was “set free” from OpenAI to pursue superintelligence

michael.nunez@venturebeat.com (Michael Nuñez) — Fri, 05 Jun 2026 22:55:38 GMT

For three years, Microsoft's artificial intelligence story has been inseparable from OpenAI. The partnership — cemented by a cumulative investment exceeding $13 billion — gave Microsoft early access to the most advanced AI models on the planet, catapulting its Copilot products into the enterprise mainstream and adding hundreds of billions of dollars to its market capitalization. To the outside world, Microsoft's AI strategy was OpenAI.

Mustafa Suleyman wants to change that narrative.

In an exclusive sit-down interview with VentureBeat at Microsoft Build 2026, the CEO of Microsoft AI disclosed that a contractual change with OpenAI roughly six months ago granted his division the formal authority to pursue what he openly calls "superintelligence" — using Microsoft's own researchers, its own data pipelines, and its own custom silicon.

"We were only sort of set free from our contract with OpenAI about six months ago to formally pursue superintelligence," Suleyman said. "So this is very early days."

The comment, delivered matter-of-factly backstage at the Fort Mason Center here, offers the clearest signal yet of a strategic inflection point unfolding inside the world's most valuable public company. Microsoft is not abandoning OpenAI. But it is building something alongside it — and, eventually, something that could stand entirely on its own.

Microsoft's first in-house model family signals a new level of AI ambition

The most tangible evidence of that shift arrived the same day. Microsoft announced a family of seven new AI models developed entirely in-house by its AI Superintelligence Team, spanning reasoning, code generation, image creation, transcription, and voice synthesis. The models — branded under the "MAI" family name — are Microsoft's most ambitious first-party AI release to date.

The flagship, MAI-Thinking-1, is a 35-billion-active-parameter reasoning model that Microsoft says matches leading models in its weight class on key software engineering benchmarks and demonstrates advanced mathematical reasoning. Suleyman emphasized one point repeatedly: the model was trained from scratch on clean, commercially licensed data, without distillation from third-party frontier models — a direct, if unstated, contrast to the widespread industry practice of using outputs from competitors' systems to train cheaper alternatives.

"We train our reasoning models from scratch," Suleyman wrote in a blog post accompanying the announcement. "We don't distill from other labs and we don't rely on unlicensed or opaque data."

The rest of the family fills out a multimodal portfolio designed for enterprise deployment: MAI-Code-1-Flash, a lightweight coding model built specifically for GitHub Copilot and VS Code; MAI-Image-2.5, which supports both text-to-image and image editing; MAI-Transcribe-1.5, which Microsoft claims is the most accurate transcription model available, operating across 43 languages; and MAI-Voice-2, a multilingual speech-generation system. All of the models ship through Microsoft Foundry, the company's model-hosting and deployment infrastructure, and for the first time, developers can tune model weights themselves through third-party platforms including OpenRouter, Fireworks, and Baseten.

But Suleyman made clear in the interview that the seven models are a proof of concept, not a finished product. The real project is the lab itself.

"Our job is to make sure that when we look out to 2030 and beyond, we have the capacity not just to buy models from third parties, but to build the absolute frontier, the best models in the world," he said. "That's a long transition."

What "set free" from OpenAI actually means for Microsoft's AI future

To understand what Suleyman means by "set free," you need to understand the unusual contractual architecture that has governed Microsoft's AI efforts for years.

When Microsoft invested billions into OpenAI beginning in 2019, the partnership came with a specific arrangement: OpenAI would build the frontier models, and Microsoft would serve as the exclusive cloud provider, integrating those models into its products and reselling them through Azure. The deal gave Microsoft extraordinary commercial leverage — access to the world's most advanced AI without having to build it — but it also created a dependency. Microsoft was explicitly barred from pursuing its own AGI research, and the agreement even capped how large a model the company could train, restricting it from building systems beyond a certain computing threshold measured in FLOPS.

That arrangement was formally renegotiated. As Fortune and Axios reported in November, a revised deal with OpenAI removed those restrictions, clearing the way for Suleyman to launch the MAI Superintelligence Team and pursue what he calls "humanist superintelligence." The result, in Suleyman's telling at the time, was a "best-of-both environment, where we're free to pursue our own superintelligence and also work closely with them."

By the time he sat down with VentureBeat at Build 2026, roughly six months had passed since that self-sufficiency effort formally began. Microsoft had already started shipping in-house models — including MAI-Image-2-Efficient, a lighter-weight image generation model released in April — but the seven MAI models announced at Build are the team's most ambitious release yet: a full multimodal family spanning reasoning, code, image generation, transcription, and voice.

Even so, Suleyman does not view the shift as a rupture with OpenAI. He described Microsoft's current position as one of abundance, not scarcity.

"There's no immediate urgent need to fill a gap in three months' time or six months' time," he said. "We have OpenAI, we have Anthropic, we have thousands of models inside Foundry. So there's already a huge amount of optionality available to us."

The framing is telling. Microsoft's push into first-party frontier models is not born out of a crisis in the OpenAI relationship but out of a strategic calculation: as AI becomes the most consequential technology layer in enterprise computing, the company cannot afford to depend entirely on partners for the foundational capability. "Over the next five years, we have to be able to produce state-of-the-art frontier-scale models," Suleyman said. "That's our mission."

Suleyman says the shift from chatbots to autonomous AI agents has already begun

If the seven MAI models represent the technical ambition, a new capability called Frontier Tuning represents the commercial logic. Announced alongside the models at Build, Frontier Tuning allows enterprise customers to customize MAI models using their own proprietary data, workflows, and domain terminology, all within their own secure compliance boundary. The system uses reinforcement learning environments — what Microsoft calls "training gyms for AI" — that let agents learn directly from real workplace tasks without affecting production systems.

The results Microsoft shared are striking. An MAI model tuned for Excel reportedly matches GPT 5.4 performance while operating at up to ten times greater efficiency. Early enterprise adopters are seeing similar gains: when tuned for one unnamed organization's exacting standards, the MAI model achieved the highest win rate of any model tested at roughly one-tenth the cost.

Suleyman framed Frontier Tuning as part of a broader evolutionary stage — a move from intelligence to action. "We've basically moved beyond just conversation," he told VentureBeat. "Now we're moving to action."

He introduced a new framework for thinking about that progression: the shift from IQ (factual intelligence) to EQ (emotional intelligence, or the ability to follow tone and style instructions) to what he calls AQ — the "Actions Quotient."

Future AI agents, in Suleyman's telling, won't just answer questions. They will log into enterprise software, navigate complex multi-application workflows, and execute tasks across Excel, Word, Teams, Jira, Adobe InDesign, and customer relationship management systems — just as a human employee would.

"You should be able to show up on day one and almost provision credentials to a new AI agent," he said. "The model needs to be able to move across all of these different environments, and that's actually the great strength of Microsoft."

The Build 2026 announcements bore this out in concrete product terms. Microsoft Scout, the company's first "Autopilot" agent, operates as an always-on background assistant built on the open-source OpenClaw technology. It runs with its own governed identity inside Microsoft Entra, so its actions are auditable and attributable. Windows 365 for Agents gives AI agents their own managed Cloud PCs, allowing them to interact directly with applications and browsers inside enterprise environments. And the Foundry platform received major updates — including hosted agents with sub-100-millisecond cold starts, a new Microsoft Agent Framework, and one-click publishing to Teams and Microsoft 365 Copilot.

Why Microsoft believes enterprise data is the next AI training frontier

Suleyman also articulated why he believes Microsoft's position is uniquely defensible — and the argument has less to do with model architecture than with where work actually happens.

"We've sort of hoovered up all of the obvious pools of training data," he said, referring to the industry's early scramble to ingest the open web. "In the next phase, we actually want to be able to give these agents to companies to train on their specific tasks with the data that they have inside of their own big workflows."

The claim is subtle but consequential. The first wave of generative AI was trained on publicly available text — books, websites, Reddit posts, code repositories. That data is now largely exhausted, and its use is increasingly contested in court.

The next wave, Suleyman argues, will be trained on enterprise-specific data: the internal workflows, decision traces, and institutional knowledge that define how real organizations operate. Microsoft, which serves 493 of the Fortune 500 through Azure according to Suleyman, is already embedded inside those workflows through Microsoft 365, Teams, Dynamics 365, and the broader Azure ecosystem. Frontier Tuning is the mechanism that converts that positional advantage into model performance.

"People underappreciate that that's going to be the next domain," Suleyman said.

The early partner list for Frontier Tuning reflects the ambition: Mayo Clinic, where Microsoft is co-creating a frontier AI model for healthcare using de-identified clinical data; EY, which is tuning a tax-advisory agent for deployment to 75,000 professionals globally; Land O'Lakes, where Frontier Tuning delivered what the company's product development scientist called "meaningful improvements in grounded outputs and style compliance"; and Pearson, which is using tuned models to provide learning-science-aligned feedback in its Communication Coach product.

The Mayo Clinic partnership may be the most significant. Microsoft and Mayo Clinic are collaborating to build a healthcare-specific frontier model that combines Mayo's clinical expertise and longitudinal patient insights with Microsoft's AI capabilities. The model will be owned by Mayo Clinic and deployed first within Mayo's own environment before being made available to other organizations through Foundry.

Microsoft's custom AI chips and GPU buying spree reveal the scale of its compute advantage

None of this works without an industrial-scale compute infrastructure, and Suleyman was unusually candid about the hardware economics underlying Microsoft's strategy.

"We are the largest buyer of GPUs on the planet," he said. "We're the largest buyer of GB200s and GB300s in the world."

Microsoft will continue purchasing Nvidia accelerators "for many, many years to come," Suleyman said. But the company is simultaneously building its own custom silicon. Maia 200, Microsoft's second-generation AI accelerator, is already running in production across data centers in Iowa and Arizona, with deployments planned for Italy, Australia, and South Korea. According to Microsoft, Maia 200 delivers the best tokens-per-dollar-per-watt in the company’s fleet.

Suleyman put a finer point on the economics in the interview: Maia 200 is 30 percent more cost-efficient than Nvidia's GB200, he said. And when Microsoft co-optimizes its own MAI models to run natively on Maia silicon, the company sees an additional 1.4x improvement in performance per watt. "It is going to be cheaper in years to come to build on MAI models with Maia 200 and Maia 300 inside of Azure," he said.

That claim — if it holds at scale — has profound implications for the competitive landscape. It means Microsoft is not merely buying its way to AI dominance through Nvidia; it is building a vertically integrated stack in which its own models, running on its own chips, inside its own cloud, tuned on its customers' own data, could offer performance and cost characteristics that no competitor can replicate.

Suleyman rejects the idea that AI models are becoming commodities

Suleyman also pushed back sharply against one of the most popular narratives in Silicon Valley: that AI models are rapidly commoditizing.

"A lot of people are saying models are commoditizing," he said. "I don't think that's true."

His argument hinges on what he calls "quality tokens" — the proposition that the composition, curation, licensing, and deduplication of training data matter at least as much as raw scale. Microsoft's new MAI models, he said, were trained on a pre-training mix composed of approximately 50 percent high-quality code, with the remainder drawn from commercially licensed and carefully curated sources.

The result, he argued, is a distinct "lineage" of models optimized for coding, reasoning, and agentic behavior — fundamentally different from models optimized for consumer chat, cultural content, or multilingual breadth.

"We're going to see very distinct lineages that reflect different training objectives of different companies," he said. "Quality tokens matter more than just brute-force scale."

This is a strategically important argument for Microsoft to make. If models are commodities — if any lab can match the frontier within months using cheaper compute and distilled training data — then the model layer becomes a race to the bottom, and Microsoft's billions in compute investment offer no durable advantage. But if model quality is a function of data discipline, research depth, and institutional patience, then the lab-building approach Suleyman is pursuing becomes a genuine competitive moat.

He used a specific metaphor to describe that approach, one borrowed from optimization theory: the "hill-climbing machine." The phrase describes a system that continuously improves — cycle after cycle — by applying more compute, better data, and sharper evaluation. "The goal here is to build what we think of as a hill-climbing machine," he wrote in his blog post. "An organization that can continuously improve, cycle after cycle." The metaphor is revealing because it describes a process, not a destination. Suleyman is not promising that Microsoft will build the world's best model next quarter. He is arguing that Microsoft is building the system — the research culture, the data pipelines, the silicon co-optimization, the evaluation infrastructure — that will produce progressively better models over years.

Inside Microsoft's five-year plan to become a self-sufficient AI superpower

The strategic picture that emerges from Suleyman's comments — and from the full scope of the Build 2026 announcements — is of a company preparing for a future in which AI capability is not rented from a partner but generated internally, at scale, across every layer of the stack.

Microsoft still needs OpenAI. The partnership continues to power Copilot, Azure AI services, and ChatGPT's infrastructure. Suleyman acknowledged as much, describing Microsoft's portfolio of model providers as a source of strength, not a problem to be solved.

But the direction of travel is unmistakable. With its own frontier models, its own custom silicon, its own reinforcement learning environments for enterprise tuning, and its own autonomous agent infrastructure, Microsoft is constructing a parallel path — one that, by 2030, could make the company a fully self-sufficient frontier AI lab embedded inside the world's largest enterprise software platform.

"Our ultimate goal is what we call Humanist Superintelligence," Suleyman wrote in his blog post. "That means advanced AI systems designed to serve people and organizations, not replace them."

Whether that goal is achievable — or even clearly definable — remains one of the great open questions in technology. And Suleyman expressed more confidence than caution when asked about the trajectory of progress. "I really think we're at the tip of the iceberg," he said. "The models are so much more powerful than we know how to extract intelligence from them."

But confidence and execution are different things. Building a frontier lab is not an announcement; it is a decade-long commitment that requires retaining elite researchers, maintaining scientific rigor under commercial pressure, and producing results that justify the staggering capital expenditure.

Google learned this with DeepMind — which Suleyman himself co-founded in 2010, before joining Microsoft — and even that lab, widely regarded as one of the best in the world, spent years navigating the tension between pure research and product delivery.

Suleyman seemed aware of the contradiction. "If you rush it, you'll screw it up," he said.

The sticker on his laptop reads: "Patience and urgency." It is a paradox that Microsoft now has five years — and several hundred billion dollars — to resolve.

Microsoft's AI Futurist explains how he uses Copilot — and the real-world problems enterprises are solving with agents

carl.franzen@venturebeat.com (Carl Franzen) — Fri, 05 Jun 2026 19:31:00 GMT

Microsoft used its Build 2026 conference this week to push a clear message: agents are rapidly moving into production throughout enterprise systems, and the winning platform will be the one that gives them reliable context, governance, identity, memory — and secure access to enterprise data.

The company announced Microsoft IQ as a context layer across GitHub Copilot, Microsoft Foundry and Copilot Studio; Work IQ APIs coming June 16; Fabric IQ for structured business data; Foundry IQ for retrieval across enterprise knowledge and the live web; and Web IQ as a new agent-facing web search stack.

Microsoft also introduced Scout, a personal work agent, and a whopping seven new in-house AI models in its growing MAI family across modalities and use cases, including MAI-Thinking-1.

Those announcements sit directly in Marco Casalaina’s lane. Casalaina is Microsoft’s VP Products, Core AI and AI Futurist. He leads Microsoft’s AI Futures team and previously led teams across Azure AI, including Azure OpenAI, Vision, Speech, Decision, Language, Responsible AI and AI Studio.

Before Microsoft, he led Salesforce’s Einstein AI team and earned a computer science degree from Cornell University. CRN reported that he joined Microsoft in early 2022 as vice president of products for Azure Cognitive Services, meaning he has now been at the company for more than four years.

VentureBeat spoke with Casalaina ahead of Build about Microsoft’s agent strategy, the company’s model-choice philosophy, how Microsoft IQ fits with MCP, and why he believes enterprises need far more than just access to powerful models. The interview below has been edited for clarity and condensed from the transcript.

VentureBeat (VB): To start, can you explain your role at Microsoft and what “AI Futurist” means in practice?

Marco Casalaina (MC): I am VP Products of what we call Core AI. Core AI is our set of tools for AI developers, and that includes Foundry, Visual Studio, VS Code, GitHub and GitHub Copilot. That’s our overall group.

My Silicon Valley title is AI Futurist, and that has a very concrete meaning here. I’ve worked with other folks who are considered futurists, like Peter Schwartz, and that can be a little bit more fuzzy. For me, what it means concretely is that I am the first person to try anything new here.

I am constantly getting things from all over Microsoft, not even just Foundry, because I work with really everybody across the company. Pretty much everybody sends me the new things at all times. Even today, I got something brand new just before this call. I’m usually the first person to try anything new here, which is pretty cool. I get to see a lot of really cool stuff.

A friend of mine, who is head of AI at Intuit, calls me an “adjacent possiblist.” I consider my futurist concept to be about a year out from now — the immediate future of what’s about to happen next. That’s what I focus on.

VB: Where are you looking at the agentic state of things, and in particular Microsoft’s position as enterprises and individuals rush to adopt agentic AI?

MC: We can look at it from bottom to top. At the very base of the stack is our commitment to model choice. All along, we’ve had the OpenAI GPT frontier models. Now we have a really solid partnership with Anthropic, where we’re offering the Claude models. We just launched Claude Opus 4.8 on Azure — on Foundry, I should say — and at Build, we are introducing our new MAI model.

The MAI models are a set of frontier models that we’re building in-house. They are made for token efficiency, optimization and customization. We are specifically making them for our customers to customize on their own data sets.

One level above that, we are announcing hosted agents in Foundry. That is our managed agent capability in Foundry. It automatically handles scaling, containerization and those kinds of things. It is an environment where you can manage agents.

One level above that is the Foundry control plane. At least for the agents you build, you want to have control over them. This gives you observability into their cost, tokens and correctness. You can do continuous evaluations and sample interactions with those agents, run evals and make sure they are continuing to work and not drifting.

The big news is going to be the GA of what we call the IQs here at Microsoft. There are currently three, and there will be four. There is Foundry IQ, which is basically for knowledge — largely unstructured knowledge. There is Fabric IQ. We have a ton of customers who have entrusted a lot of data to the Microsoft Cloud in Fabric, Power BI and related technologies. Fabric IQ is about making an agent-facing interface for this data, so agents can get to it without literally going through a Power BI report. That’s ridiculous.

Work IQ is about the Microsoft ecosystem. You can look at Work IQ as the agentic face of all the Microsoft apps: Outlook, Teams, Word, SharePoint and all those kinds of things. How does an agent interact with those things? That is Work IQ.

And finally, the fourth IQ is Web IQ. We are releasing our new agent-facing web search capability. It can search the web, search through videos and even do some kinds of browsing tasks automatically. It is super fast, and it kind of has no face. It’s headless. The interface is intended for agents.

We will also be announcing Agent Optimizer. That includes a new type of evaluation that allows you to evaluate much more granularly whether an agent is actually working and working correctly. The optimization step can go back in and make modifications to the prompt, obviously with your consent, and modify your agent so it works more correctly going forward. Effectively, it creates a feedback loop to make agents work better.

VB: Microsoft has sometimes been criticized for murky and clunky product naming. Where do these IQ products sit? Are enterprise users supposed to go to IQ first, or is IQ more for developers to connect to?

MC: All of the IQs are headless. The concept of IQ is that each one provides a different type of context to an agent specifically. Largely, it will be developers interacting with the various IQs — developers and the agents they build.

The IQ brand is really about agent context. End users largely won’t interact with the IQs. It is true that if you use Microsoft 365 Copilot today, you’ll notice a little thing that says it is using Work IQ. So it is a little bit visible, but the customer or end user doesn’t have to go find the IQ. Their system or developers hook that up.

VB: Is the IQ family essentially Microsoft’s version of MCP? Is it using MCP, or is it something different?

MC: All of the IQs are indeed exposed as MCP servers. You have correctly characterized MCP as basically an agent-facing or self-describing API. It’s not that fancy. That’s really what it is, with some authentication layers and capabilities built in, which is super useful.

Something like Work IQ — really all the IQs — have to be authenticated. In order for Work IQ to see my email, Teams messages, documents and stuff like that, I have to be able to authenticate it on behalf of me.

That gets us to another core differentiator that we will be announcing at Build, which is agent identity. We have this Entra system, and Entra is, I believe, the world’s largest used identity system for human users. For some time now, you have been able to declare an agent to have an identity in there. Now, agents will be able to have their own identity, their own Teams box, their own email inbox and stuff like that.

These agents will use Work IQ to check their own email, check their own documents and that sort of thing.

VB: Enterprises are not one-size-fits-all on models. Microsoft supports many leading models through Foundry and Azure, while also building its own. Is Microsoft a model company, an infrastructure company or a connector between models and work products?

MC: The answer is yes. We are obviously the hyperscaler. We are absolutely committed to model choice, and we will continue to offer the frontier models from all of the major players: OpenAI, Anthropic, Mistral, Black Forest, xAI — you name it. They are all going to be represented in there.

At the same time, we have what is now called our Microsoft AI Superintelligence Team, formed by Mustafa Suleyman, and we are building our own frontier models as well. Like I said earlier, we are really gearing these models toward optimization — token efficiency, bang for the buck and customization.

These are things our customers have been asking for: the ability to more finely customize models, whether that is fine-tuning or continued pre-training. Continued pre-training is literally changing the weights of the model, whereas fine-tuning is adding a little layer on top.

We have these capabilities in Foundry: fine-tuning, distillation and those kinds of things. I would note, by the way, that our MAI models are not distilled. Some model providers, especially some of the less scrupulous ones, will distill other models into theirs, and that can have unusual effects. We don’t do that. The data provenance of our models is of primary importance to us.

When we come out with these models, we want our customers to know that the data provenance is clean in terms of the rights to the data, where it came from and all that kind of stuff.

The choice thing also goes above the model layer. When we talk about Foundry hosted agents, we have the Microsoft Agent Framework. You talk about agent orchestration — how you make agents work together when you have multiple agents — and Microsoft Agent Framework is an excellent framework for that.

However, I can make a LangGraph or LangChain Foundry hosted agent. I can make a CrewAI Foundry hosted agent. I can use any number of orchestration frameworks and put that up as a Foundry hosted agent, and it becomes a first-class Foundry agent.

That means I get the observability. It shows up in the Foundry control plane. I can do evaluations on it. I can do traces on it. I can get all those things from the Foundry control plane with an agent built in really any framework I choose.

VB: Some companies are interested in Chinese and open-source models. How much of Microsoft offering its own models is about giving customers an American version of that?

MC: I can’t speak to that exactly. Of course, we offer DeepSeek models and Qwen models in Foundry, so we offer all of these choices today, and our customers can make that choice.

The MAI models are really focused on token efficiency and customizability. That is what our customers are demanding, and that is the gap we are filling.

VB: As agents take on longer tasks and more specialized work, will enterprises keep expanding the number of models they use, or will there be a winnowing?

MC: I do see it expanding. We are not just focused on tokens per se. A token is not a token is not a token. One token is not necessarily equivalent across these things. It is all about what you are doing with each token and the efficiency of that. It comes back to what kind of value you are getting for the cost. That is a lot of the rationale behind why we are developing our own MAI models.

Part of my job is to travel all around the world. I’ve been all over the place. For example, I’ve been working with Bayer. One of the things we are measuring is not just token usage, but number of users — monthly active users and daily active users — because we have a lot of first-party capabilities like Microsoft 365 Copilot. Over the last year, we’ve seen a 6x increase in monthly active users. We have over 20 million users of Microsoft 365 Copilot alone.

That is on the agents you use. In terms of the agents you build, Bayer put up its own agent system on Foundry, and now it has 20,000 of its own employees on it.

A few weeks ago, I was in Sydney, Australia, hanging out with AEMO, the Australian Energy Market Operator. They operate the electrical grid of Australia. They showed me that they had built agents to manage grid operations.

This is a human-centered thing. They have grid operators sitting in centers in West Sydney, Brisbane and places like that, and they are bombarded with alerts. I wouldn’t believe it if I hadn’t seen it myself. The alerts are constant. They built a system to triage those alerts. Is this alert a super major thing, or is it just that a transformer is getting a little hot? It also says, here is when we had this problem last time, and here is how we resolved it last time. Maybe now we need to replace this component, or whatever.

Ultimately, it is the grid operators making the choice. A lot of our philosophy here is human empowerment. These human-centered agents are the ones that are working best among our customers. What I saw at AEMO and Bayer is this notion of human empowerment: taking away some of the grunt work, or in the case of AEMO, taking billions of alerts and reducing them to something much more manageable and actionable for the people involved.

We are moving past the era where agents are just answering questions. AI in general is moving past that. We are not just answering questions anymore. We are moving toward a place where AI can really meaningfully help you do your work.

VB: How do observability, tokenomics, ROI analysis and agent governance fit into Microsoft Foundry?

MC: That is what the Foundry control plane is all about. We introduced it in November of last year. If you looked at my own Foundry control plane — I’ve built a ton of these agents, and I am a developer by background — you would see all of my agents that are running and the ones that are paused.

I can see how many tokens they’ve used over the last day, week or month. I can look at trends. I can look at costs, because the cost will be different depending on what underlying model I’m using. If I’m using our model router, it can route to different models depending on the complexity of the inbound prompt.

We also have Azure cost management overall. Azure has had cost management for over a decade, before the AI thing even happened. This integrates with overall Azure cost management.

It is not just narrowly about what your AI is doing. Your AI will be using storage resources, data resources and other compute resources around that AI. You can get a complete picture of not just the cost and token usage of the AI itself, but everything around it.

When you think about governance, that also extends to evaluation. One of the things we are releasing in preview is rubric-based evaluation. Rubric-based evaluation is much more granular.

Let’s say you have built a restaurant reservation agent. The things you want to test about that agent are not really groundedness. Groundedness is the opposite of hallucination, and that is very question-answering. For a restaurant reservation agent, you want to test very granular things. If you say, “Make me a table for two tomorrow,” did it come back and ask, “What time would you like the table?” Before it gave you a table for two tomorrow at 6 p.m., did it actually check that the table was available, or did it randomly give you a table without checking first?

There are very granular things you want to test about that specific use case. You don’t just want to test whether the agent works. You want to test whether the agent works right.

That is what we are approaching with our new rubric-based evaluation system. You will see that in Satya’s keynote. I have been using it myself lately, and I’m very happy about it. I’ve been waiting for this.

VB: Microsoft is also partnering with companies like Anthropic and allowing Claude to work with Microsoft 365. How important is Copilot to this story? Why would someone turn to Copilot over other options?

MC: Microsoft 365 Copilot is a huge advantage for us. As I mentioned, we crossed the 20 million user mark on Copilot relatively recently.

The great thing about that is that it is the face. When you go into Foundry and make an agent, there is a button that says “publish to Copilot” — actually, it says “publish to Copilot in Teams,” because you can put it in Teams too.

The idea is that you want to put these agents where your users are. A lot of people who use the Microsoft ecosystem are in Teams, or they are using Copilot. I can create a custom agent, as many of my colleagues have, and now it is in Copilot, which I use maybe 50 times a day.

Since January, Copilot has become more and more capable. I now use it to draft my email. I am not just using it for question answering. I’m starting to use it to manage my calendar and draft emails. I really do this every day now.

When I want to use a custom agent — for example, to file my expenses, because we have a custom agent for that now — I can access that agent not in some random standalone interface, but in Copilot or Teams, where I already am.

That surface area that people are already engaging with is a major advantage.

VB: As people offload more repetitive work to AI, what are they able to spend more time doing?

MC: Let’s consider something I did yesterday. I got an email from a customer named Frankie, and he asked me a question about Foundry hosted agents. I knew the answer because I had talked to my colleague Jeff Holland, who is the head of our hosted agents product management. I had asked Jeff the same question two weeks ago.

Where or how I asked him, I don’t remember. Was it in Teams? Was it email? Was it a meeting? I don’t really remember. But I knew the answer to the question Frankie was asking.

So I went into Copilot and said, “Answer Frankie’s question about how hosted agents scale, and reference the conversation I had with Jeff a couple of weeks ago on this same topic.” And it did it. It drafted the email.

Over time, I have taught Copilot my style. I don’t do the bold-print thing. I tell it: don’t use em dashes and that kind of stuff. I have a certain style in the way I write emails. It’s a little terse, to be perfectly honest, but I want it to be the way I write.

It drafted this thing. It searched through my Teams messages, my emails and the transcripts of my meetings with Jeff. It used Work IQ, as a matter of fact. It found the answer, drafted the email and provided a link to the documentation that specifically covered the question Frankie was asking.

I looked at the draft and thought, yep, that’s it.

Yes, I could have composed this email myself. I knew the answer to the question. I could have looked up the documentation. If I dug around, I’m sure I could have found the conversation I had with Jeff in whatever medium that was. I could have done that stuff. It probably would have taken me, I don’t know, an hour to find all the information and compose it.

Instead, I did it in about a minute. I had a draft, I looked at it, I was happy with it, I pressed send, and that was the end of that.

It really is about giving people time back. It is not even just grunt work. It is all this time you spend looking things up and finding things. Now, I can make it take an action. It didn’t just answer the question. It fully drafted the email and copied Jeff.

VB: Do you fear for your job? How has AI changed your own work?

MC: I don’t fear for my job. My job has changed. For one thing, I do a lot more now, both in my business life and personal life.

This weekend I was using Web IQ, the new Web IQ. I’ve been car shopping. My car’s lease is coming up, and there is a very specific car I’m trying to find, which is hard to find. It’s a Hyundai Ioniq 6, which Hyundai, for whatever reason, has stopped offering in the United States. I’m going to get one, though.

I set my agent to the task, using Web IQ, of finding all the Hyundai Ioniq 6s available in the entire Bay Area — everywhere, all the way out to Sacramento, all the way as far south as Gilroy. I set it to this task, and then I went on a hike.

When I got back, I had a big long list of all the Hyundai Ioniq 6s, at least the 2024 and 2025 models, available in the entire Bay Area. From that, I started calling down these dealers.

Even in my personal life, I’m using it constantly. It saves me a ton of time. That would have taken me hours, to go through every single dealer’s inventory like this. But Web IQ could do that, and it was super quick.

VB: Any final thought for developers around this news?

MC: Foundry is really the place. This is the place where you can build your agents, scale your agents, test your agents and improve your agents. That’s what it’s all about, and it’s happening.

AI agents are learning on the job — just not for your whole team

Fri, 05 Jun 2026 17:51:03 GMT

When someone on a team corrects an AI agent — better prompts, better feedback, better context — that improvement disappears the moment a colleague opens the same tool. The correction doesn't transfer, and the next person starts from zero.

The problem compounds in multi-agent workflows, where teams expect agents to share context across users and tasks. Without a shared memory layer, every team member effectively trains a different version of the same agent — and those versions never sync.

That gap shows up in the numbers. According to Asana's own research, 75% of knowledge workers use AI on the job, but only 5% of companies have reported productivity gains.

“Model providers are getting really, really good at improving reasoning and retry loops, but what they’re not good at is bringing the enterprise work context in a way that human beings can reason about for shared memory,” Asana Chief Product Officer Arnab Bose told VentureBeat.

Asana had been building toward an agentic platform that centers context and shared memory. Its Agentic Work Management platform ensures that if any team member corrects an agent, that correction applies to everyone else on the team.

“That context graph is automatically provided to agents operating inside Asana’s system so you don’t have to have every human member of the team become an expert at prompt engineering or context engineering,” Bose said.

Bose said the shared memory architecture matters beyond Asana's own product; it's the design decision enterprises need to make for any multi-agent system.

Shared memory also becomes important when enterprises begin moving from simple single agents to multi-agent workflows that need to share context and behaviors.

Memories for a multi-agent, multi-platform workflow

The models powering agents are stateless by design, so memory becomes a dedicated layer outside of a context window. While this area of AI innovation is marching towards maturity, the question of what gets stored, who controls it, and how it stays consistent when different agents and users write to the same instance remains largely unsolved.

This is manageable for use cases with only one user. However, in enterprise agentic workflows, the idea is for agents to work with the entire team. Most platforms have agents that still act for individuals, which leads to task repeating and inconsistent versions of reality and spreading mistakes. Agents could then also contradict each other.

Sriharsha Chintalapani, co-founder and CTO of Collate, said in an email to VentureBeat that the lack of shared memory is a major obstacle for multi-agent workflows particularly around consistency.

"Agents are sensitive to the quality of their prompts," Chintalapani said. "Someone with a strong understanding of the task will generally get more accurate results than someone less experienced. Partly that’s because they’re able to construct more detailed prompts, but also because they’re able to give the agent better feedback. The agent remembers the corrections it’s received and applies that knowledge to successive prompts. The more accurate the feedback, the better the agent will perform for that user. "

He added that organizations should stop treating shared memory solely as a prompt engineering problem and think of building systems that repeat context across every conversation.

Neej Gore, chief data officer at Zeta Global, said in a separate email that shared context becomes a living memory that "compounds intelligence across the enterprise."

The opportunity may lie in building AI agents that retrieve memory relationally, pulling in relevant context based on what's being asked — an approach Chintalapani says few organizations outside the largest model providers are equipped to build.

Personal versus team agents

AI agents already proliferate enterprises; it’s just that many of these operate as personal agents doing work specific to individual users. Most prompts start from one person, any files are uploaded by one account, and even for agents living in a company-wide system mostly learn individual user preferences.

Most enterprise AI workflow platforms recognize that memory is important but approach it through different lenses. For example, Microsoft’s Copilot takes an individual-first approach by learning a user’s role within the organization, tone preferences and working patterns, which are then stored as personal memories for the agent to apply across the different Microsoft 365 surfaces.

For engineering and orchestration teams evaluating agentic platforms, the shared memory question is now a procurement criterion — not just a technical nicety. An agent that learns only for the person using it will require ongoing individual upkeep. One connected to a team-wide memory layer builds institutional knowledge automatically.