
In late April 2026, Anthropic published the results of Project Deal: a one-week experiment in which 69 of its San Francisco employees handed full authority over their personal marketplace transactions to Claude. The agents negotiated, made offers, and closed deals — 186 of them, worth just over $4,000 — without any human intervention beyond physically exchanging the goods at the end. Snowboards, books, kitchen gear, a bag of "19 perfectly spherical orbs of possibility" (ping pong balls, $3, sold).
It is, on its face, a simple proof point: AI agents can handle real-world commercial negotiation. But buried in the experiment is a finding that should make every CTO deploying agentic systems uncomfortable.
A Craigslist run entirely by Claude
The setup was deliberately mundane. Each participant got a $100 budget — paid out as gift cards at the end, settled against their net buying or selling. In December 2025, Claude interviewed every employee about what they wanted to sell and what they wanted to buy, using a structured interview format derived from Anthropic's own internal tools. From that, an agent was provisioned for each person.
Then the agents were turned loose in Slack channels. They posted listings, made offers, counter-offered, and struck deals. The humans only re-entered the loop when it was time to physically swap the goods.
The experiment built directly on Project Vend, Anthropic's earlier — and rougher — test in which Claude ran a small office store. Project Vend exposed how badly an agent operating without explicit boundaries can drift. Project Deal added structure: defined budgets, defined goals, a contained marketplace, a one-week clock.
That 186 deals went through without intervention isn't trivial. Each one is a real economic transaction, with real preferences, real counter-parties, and real money. Anthropic's own assessment: "We were struck by how well Project Deal worked."
The A/B test that wasn't supposed to be the headline
Quietly, Anthropic ran a parallel experiment inside Project Deal. Some participants were assigned agents powered by Claude Opus 4.5 — the frontier model. Others got agents running on Claude Haiku 4.5 — the smallest, cheapest model in the lineup. Participants weren't told which they had.
The objective outcomes diverged predictably. Opus agents struck better deals: more transactions, better prices, more efficient negotiation. By every measurable axis, the smarter model produced the better outcome.
The interesting finding wasn't that the smarter model won. It was that users with the cheaper model didn't notice they were losing.
In post-experiment surveys, participants assigned to Haiku-powered agents reported satisfaction levels indistinguishable from those running Opus. They couldn't tell. They had no internal benchmark for what a good deal looked like — they had only their agent's reports about what their agent had done. From inside the experience, Haiku felt as competent as Opus.
Why this is the headline finding
Most of agentic AI's current deployment patterns assume the user can grade the work. A sales agent that drafts a bad outreach gets corrected. A coding agent that writes a broken function gets caught at review. A research agent that misses a key paper gets a follow-up prompt.
But the moment you delegate a closed-loop task — "go negotiate this," "go book this," "go decide which vendor to use" — that supervisory feedback breaks down. The user sees only what the agent reports. They don't see the deals that didn't happen, the price they could have gotten, the counter-offer the agent didn't make. They have no oracle.
In Project Deal, the oracle existed: Anthropic could compare Opus and Haiku outcomes directly because they ran both in parallel. In production, no business has that luxury. You ship one agent. Users tell you it works. You learn nothing about the gap between what your agent does and what the best-possible agent could do.
This is the failure mode Project Deal made visible — not "AI can't negotiate," but "users can't tell when their AI agent is mediocre."
Three things builders should take from this
If you're deploying agentic systems — internally for ops, externally for customers, or in any setup where the agent acts on someone's behalf without a human in the verification loop — Project Deal sharpens three operating principles.
1. Model tier is not a procurement decision. It's an outcome variable.
The default play in production agent deployments is to spec the cheapest model that passes basic acceptance tests. If the agent answers questions correctly in QA, ship it. Project Deal punctures that logic: the same agent running on a smaller model produced systematically worse business outcomes despite passing the basic competence bar. The competence bar and the outcome bar are different bars.
For closed-loop deployments — the ones where the agent transacts, decides, or commits — the cost calculus has to include the deals you don't close, not just the API spend. Often the frontier model pays for itself on a single deal margin.
2. You need an evaluation harness your users can't see.
If users can't grade an agent's quality, you have to grade it yourself, continuously. That means running a parallel evaluation track in production: shadow-running competing models against the same task and scoring outputs against ground truth where available, or against expert human review where it isn't. The eval doesn't go to users — they don't need to know which model handled their request — but you need it for your own decision-making.
This is the part of agentic infrastructure that routinely gets cut from MVPs and re-discovered painfully six months later, when a competitor's agent quietly produces 15% better outcomes and your users still report 4-out-of-5 satisfaction.
3. Keep humans in the supervisory loop, not the negotiation loop.
Project Deal worked because Anthropic kept its humans in the right place: setting goals, defining preferences, exchanging the physical goods. The negotiation itself — the part agents are good at — was fully automated.
The wrong mental model is "humans approve every action." That kills the throughput advantage agents bring. The right model is "humans set the bounds and review the outcomes." For AI sales agents reaching out on behalf of a sales rep, that means humans set ICP, message strategy, and acceptable concessions; the agent runs the conversation; the human reviews flagged exceptions and aggregate performance, not every email. For workflow automation more broadly, it means defining quality gates and escalation triggers, not approval steps.
Project Deal as a template
The most useful thing about Project Deal isn't its findings — it's its shape. A small, contained, time-boxed agentic deployment, with hard limits ($100 budget, one week, defined goods), measurable outcomes (deals closed, dollars transacted, participant satisfaction), and a built-in A/B for evaluation.
Most companies trying to deploy agentic systems for the first time skip this stage. They build a general-purpose agent, ship it to a broad user base, and try to evaluate it through usage telemetry alone. By the time they realise the model tier was wrong or the supervisory pattern was broken, the agent has been live for months with quiet underperformance and no way to attribute it.
A Project-Deal-shaped pilot — narrow scope, hard boundaries, measurable outcomes, parallel evaluation — gets the answers in weeks. Anthropic ran theirs in seven days. The output was a clear technical finding that would have taken a public deployment six months to surface, if it surfaced at all.
That is the pattern worth copying. Not the marketplace. The methodology.
What to do this quarter
If there's a single takeaway for anyone running a technology org right now, it's this: pick one closed-loop task you've been considering for agentic automation — vendor selection, lead qualification, contract negotiation, support escalation, internal procurement — and design a Project-Deal-shaped pilot for it. Define the scope. Define the budget. Define the success metrics. Run two model tiers in parallel for the first cycle. Compare outcomes against each other and against the human-only baseline.
The answers won't be the same as Anthropic's. But the methodology — and the discomfort it produces when you find out what your users couldn't tell you — is the part of agentic AI worth taking seriously now, not later.
How we work in this space
- AI AgentsExplore →
Autonomous AI systems that handle real workflows — sales, support, research, outbound — at scale.
- AI Sales AgentsExplore →
Autonomous sales agents that prospect, qualify, and book meetings without rep involvement.
- AI AutomationExplore →
End-to-end AI-powered automation across operations, marketing, sales, and reporting workflows.


