Vending Bench Stress Test: 50,000 requests per minute

At this point, the idea of AI replacing jobs should not be a conspiracy theory. AI will replace jobs, but not all jobs. More importantly, as models improve, AI will increasingly be able to run entire businesses without a human in the loop. That is the natural ceiling of agentic capability if we want to fully leverage these systems in the context of human labor.

To probe how close we actually are to that ceiling, a YC-backed startup, Andon Labs, designed Vending-Bench 2, a deliberately unglamorous but brutally revealing experiment. Instead of testing models on short reasoning tasks or synthetic planning puzzles, they asked a harder question: can an AI agent run a small business, continuously, over a long period of time?

The Experiment Setup (Brief)

In Vending-Bench 2, frontier language models are placed in control of a simulated vending-machine business for roughly one year of simulated time. The agent must make all operational decisions, including pricing, inventory restocking, supplier selection and negotiation, cost management, and cash allocation. There is no single "correct" move, only delayed economic feedback. Performance is measured by the money balance over time, making this a long-horizon, compounding decision problem.

Benchmarking and the "Perfect" Strategy

Rather than optimizing for clever tricks, Andon Labs outlines what a near-optimal strategy actually looks like in the blog:

Avoid unnecessary actions
Stabilize pricing early
Lock in reliable suppliers
Manage cash conservatively
Minimize decision churn

In other words, the winning strategy is boring, disciplined, and economically coherent. This is exactly the kind of behavior humans learn through experience and exactly the kind of behavior many agents fail to sustain.

Why Most Agents Collapse (Key Drawbacks)

This is where the benchmark becomes interesting from an agentic AI perspective:

1. Long-Horizon Drift

Many agents lose strategic coherence over time. Small early mistakes are not corrected, they compound. The agent continues acting as if outdated assumptions are still true.

2. State Mismanagement

Agents struggle to maintain an accurate internal state of the business, including cash, inventory, and supplier reliability. Persistent memory exists, but correct memory updating does not.

3. Over-Action Bias

Instead of learning when not to act, agents often intervene too frequently by renegotiating suppliers, changing prices, or restocking unnecessarily, slowly bleeding capital.

4. Poor Economic Abstractions

Most models reason well locally but fail to internalize concepts like burn rate, cash runway, or risk buffering, which are core primitives for real businesses.

5. No Self-Correction Loop

Once performance degrades, many agents lack a mechanism to pause, reassess strategy, and recover. They keep executing flawed policies until failure.

Looking Forward

The journey to building successful autonomous organizations is likely just a better model and a better architecture away. Drawbacks such as context drift, messy state management, over-action bias, and poor economic abstractions are not fundamental blockers. They are engineering and modeling problems that will be addressed with stronger models and smarter system design.

Vending-Bench 2 was not a failed attempt in any sense. It was signal, clear evidence that fully autonomous organizations are closer than many anticipate.