The Death of the Vibe Check
Why empirical benchmarks are the only way to survive the frontier model race
For the past year, the discourse surrounding large language models has been dominated by 'vibes'. We see a developer post a screenshot of a clever coding solution, or a writer marvel at a particularly lyrical paragraph, and we collectively decide that Model X is 'better' than Model Y. This is a dangerous way to build a business. When Anthropic released Sonnet 5, the industry was ready to repeat this cycle of shallow impressions. But impressions do not scale, and they certainly do not provide the rigorous data needed to integrate these tools into professional workflows. To understand if a model actually improves, we must move past the anecdotal and toward the repeatable.
Building the Bench
The solution is not to trust a single, static benchmark like MMLU, which often becomes a game of memorisation for training sets. Instead, we need custom, repeatable evaluation harnesses. Using Claude Code, a testing framework called the 'How I AI Bench' was constructed in under forty-five minutes. This wasn't a survey of general knowledge, but a targeted strike on specific utility: PRD quality, prototype generation, agentic task completion, and agent personality. The methodology combined human scoring—weighted at 70% to capture the qualitative essence of a good output—with LLM-as-a-judge scoring at 30% to provide a baseline of logic and structure. This hybrid approach acknowledges that while machines can check for syntax, humans still define what constitutes a 'good' idea.
Vibes are for enthusiasts; benchmarks are for engineers.
When running Sonnet 5 blind against its competitors—including GPT-5.5 and Gemini 3 Pro—the results challenged the prevailing narrative. We often assume that the largest model is the best model, but the data suggests a more fragmented reality. For certain tasks, like writing a Product Requirements Document (PRD), a specific model might excel because of its adherence to structure. For others, like complex coding prototypes, a different model might win on its ability to reason through recursive errors. The 'winner' depends entirely on the job description you give the agent.
- Product Requirements: Claude Sonnet 5 for its structural discipline.
- Complex Prototyping: High-reasoning models with better error-correction loops.
- Daily Agent Interaction: Models with high 'personality' scores that don't feel robotic.
The takeaway for agency owners and product leads is clear: stop asking your team if a model 'feels' better. Start building your own evaluation datasets. If you cannot measure the delta between Sonnet 4.6 and Sonnet 5 in your specific use case, you are not using AI; you are just playing with a very expensive toy. The competitive advantage of the next decade will go to those who can quantify the performance of their autonomous agents.
Replace subjective impressions with repeatable, task-specific evaluation harnesses to drive real ROI.