Wednesday, 1 July 2026

The Deep Feed

Benchmarks, Leverage, and the New Mathematical Frontier

52 min read · 6 pieces
In this issue
01 The Death of the Vibe Check 12 min
02 The Three Ladders of Leverage 10 min
03 The Mathematical Frontier 15 min
04 The Hidden Cost of Sonnet 5 5 min
05 The Agentic Demo 7 min
06 The AI Compass 3 min
Editor's Letter

Tonight, we move past the superficial hype of AI releases to examine the mechanics of actual utility. From the rigorous testing of new frontier models to the shifting role of the product manager, we explore how the tools are actually being used by those in the trenches.

01 Lenny's Newsletter

The Death of the Vibe Check

Why rigorous benchmarking is the only way to survive the frontier model arms race

By Claire Vo · 12 min read
Editor's note: As models iterate weekly, 'gut feel' is no longer a viable strategy for professional implementation.

For months, the industry has operated on a diet of 'vibe checks'. A developer tries a new model, finds it writes a decent email or a clean function, and declares it a winner. This approach is dangerous. It ignores the statistical variance and the specific failures that emerge under pressure. When Anthropic released Sonnet 5, the temptation was to simply feel its way through a few prompts. Instead, a more disciplined approach was required: building a repeatable, objective evaluation harness.

The How I AI Bench

To move beyond anecdotal evidence, a custom evaluation harness—the How I AI Bench—was constructed. This wasn't a simple automated script, but a hybrid system. It combined human scoring (70%) with LLM-as-judge scoring (30%). The logic is sound: humans are better at detecting the subtle 'soul' or personality of an agent, while LLMs are efficient at checking technical correctness against a rubric. By running 64 generations blind against competitors like GPT-5.5 and Gemini 3 Pro, the true performance delta becomes visible.

Trusting a single scoring method is a recipe for delusion; you need the friction between human intuition and machine logic.

The results of such a test reveal that model performance is highly task-specific. A model that excels at generating a Product Requirements Document (PRD) might fail miserably at maintaining a consistent agentic personality during a long-running task. We see a divergence between 'intelligence' as a general concept and 'utility' as a functional metric. Sonnet 5, for instance, shows specific strengths in prototype generation that distinguish it from the more generalist approaches of its peers.

Model Recommendations by Task
  • Claude Sonnet 5: Best for PRD generation and rapid prototyping
  • GPT-5.5: Strong contender for general reasoning tasks
  • Gemini 3 Pro: Useful for specific multimodal or large-context needs

The era of the 'magic' model is over. We are entering the era of the 'evaluated' model. For any agency or product team, the competitive advantage will not come from knowing which model is 'better' in a general sense, but from knowing which model is mathematically superior for your specific, repeatable workflows.

Key Takeaway

Stop relying on gut feelings; build repeatable, hybrid evaluation harnesses to determine true model utility.

02 Lenny's Newsletter

The Three Ladders of Leverage

How the role of the Product Manager is being rebuilt from the ground up

By Colin Matthews · 10 min read
Editor's note: The PM role is shifting from coordination to technical execution.

The traditional Product Manager has spent years as a professional coordinator. They sit in the middle of engineering, design, and stakeholders, ensuring everyone is aligned and the roadmap is clear. This role is dying. As AI tools become capable of handling the administrative and connective tissue of product development, the PM who only coordinates will find themselves redundant. The new PM must be a builder.

Ascending the Rungs

Leverage in the AI era can be categorised into three distinct ladders: Personal, Product, and Systems. Personal leverage is about your own output—using AI to draft documents or research topics faster. Product leverage is about the speed of shipping—using AI to prototype real code or query databases without waiting for an engineer. Systems leverage is the highest form, where you build repeatable, automated workflows that allow AI to complete multi-step tasks with minimal oversight.

The most successful PMs are moving from people-management to agent-management.

To reach the top of these ladders, a PM needs more than just a ChatGPT subscription. They need technical literacy. This means understanding how to use tools like Cursor for coding, how to interact with Model Context Protocol (MCP) for data querying, and how to set up automated evaluations to ensure the quality of AI-generated work. The gap between 'non-technical' and 'technical' is being bridged by these very tools.

The Leverage Framework
  • Rung 1: Assistance (AI helps you write/research)
  • Rung 2: Delegation (You pass tasks to AI and review)
  • Rung 3: Autonomy (AI completes multi-step tasks and self-checks)

This shift does not mean PMs need to become software engineers. It means they need to become 'AI-native builders'. They need to know how to leverage an agent to run a test, how to use a coding assistant to verify a hypothesis, and how to build systems that scale their impact far beyond their own hours in a day.

Key Takeaway

Move up the ladders of leverage by transitioning from a coordinator to an AI-augmented builder.

03 Dwarkesh Podcast

The Mathematical Frontier

What AI's progress in formal logic tells us about the future of intelligence

By Dwarkesh Patel · 15 min read
Editor's note: Mathematics is the canary in the coal mine for AGI.

Mathematics is currently the fastest-moving frontier for AI. While LLMs have made strides in language, the formal, logical rigor required for high-level mathematics presents a different kind of challenge. Yet, we are seeing AI discover new proofs and solve complex problems. This isn't just a win for mathematicians; it is a window into the nature of intelligence itself. If an AI can master the abstract structures of math, what does that imply for the rest of the cognitive world?

The Spiky Frontier

Progress in AI is not a smooth upward curve; it is 'spiky'. There are specific domains where models perform at superhuman levels, such as certain types of geometry or combinatorics, while remaining unable to handle basic reasoning in other areas. This spikiness suggests that intelligence is not a single, monolithic capability, but a collection of specialized skills that can be mastered independently. The 'aha' moment of AGI will likely not be a single event, but a series of these spikes appearing in sequence.

AI is not just solving problems; it is revealing the fractal nature of intelligence.

One of the most significant tensions lies in the 'verification loop'. A mathematical breakthrough is only useful if humans can understand and verify it. If an AI produces a proof for the Riemann hypothesis that is ten thousand pages long and relies on logic no human can follow, has it actually contributed to human knowledge? We face a future where AI might expand the boundaries of what is true, while simultaneously shrinking the boundaries of what is understandable.

The implication for students and professionals is profound. The value of rote learning and standard problem-solving is evaporating. In a world where AI can bridge the gaps between disparate mathematical fields, the human role shifts toward curation, the formulation of the right questions, and the conceptual synthesis of what the machines discover.

Key Takeaway

AI's success in mathematics suggests that intelligence is a collection of specialized spikes rather than a single, unified capability.

04 Simon Willison

The Hidden Cost of Sonnet 5

Tokenization, pricing, and the reality of the new API economics

By Simon Willison · 5 min read
Editor's note: A technical look at why 'same price' doesn't mean 'same cost'.

When a new model is released, the headline usually focuses on performance: 'Faster, smarter, better'. But for the developers building on these APIs, the real story is in the tokenization. Anthropic's release of Sonnet 5 includes a significant change to the underlying tokenizer. While the nominal price per million tokens remains identical to Sonnet 4.6, the way text is broken down into those tokens has changed radically.

The Tokenization Tax

Testing shows that the same input text now produces approximately 30% to 40% more tokens than before. For English text, this represents a roughly 1.4x increase in effective cost. For Python code, the increase is about 1.28x. This is a classic case of hidden inflation. The 'price' is the same, but the 'purchasing power' of your dollar has been slashed. For companies running massive-scale inference, this 30% delta is the difference between a profitable product and a money-losing one.

A price freeze is not a discount if the underlying unit of measurement has shrunk.
Effective Cost Increases (Estimated)
  • English Text: ~42% increase
  • Spanish Text: ~33% increase
  • Python Code: ~28% increase
  • Simplified Mandarin: ~1% (Negligible change)

This change also highlights a strategic shift. By optimizing the tokenizer for certain languages or structures, providers can influence how different types of users consume their services. It also suggests that as models become more complex, the overhead of processing them grows, and the cost of that complexity is being passed directly to the consumer through the math of the tokenizer.

Key Takeaway

Always audit the tokenizer, not just the price list; new models can be significantly more expensive in real terms.

05 Simon Willison

The Agentic Demo

Closing the loop between code generation and visual verification

By Simon Willison · 7 min read
Editor's note: The next step for coding agents is the ability to show their work.

One of the greatest friction points in working with coding agents is the 'black box' problem. An agent tells you it has completed a task, but you have no way to verify the result without manually running the code and navigating the interface. This creates a cycle of constant manual checking. The solution is to force the agent to produce a visual record of its work—a video demo.

Automated Storyboarding

The new 'shot-scraper video' command enables this by using a YAML-based storyboard. This allows a developer—or more importantly, an AI agent—to define a sequence of actions: click this, wait for that, fill in this field. The tool then uses Playwright to execute these actions and record a high-quality MP4 video. This transforms the agent's output from a mere text block of code into a verifiable, visual artifact.

A coding agent that can record its own demo is an agent that can be trusted.

The power of this approach lies in the feedback loop. If an agent can read its own `--help` documentation, it can construct its own storyboards. It can write the code, run the server, execute the video recording, and then present the video alongside the code. This reduces the cognitive load on the human developer, who can now 'watch' the agent's work rather than just reading its logs.

We are moving toward a workflow where the primary interaction with software isn't typing commands, but reviewing demonstrations. The ability for agents to provide these visual proofs is a critical step in the transition from simple code completion to fully autonomous software engineering.

Key Takeaway

Visual verification via automated video demos is the key to building trust in autonomous coding agents.

06 Simon Willison

The AI Compass

Mapping the ideological landscape of the generative era

By Simon Willison · 3 min read
Editor's note: Understanding where you stand on the ethics of AI is the first step to navigating it.

As AI technologies become more integrated into the fabric of society, the debate around their use has moved from the fringes of academia to the center of public discourse. It is no longer just about 'if' we should use AI, but 'how' we should use it, who should own the models, and what the ethical implications of autonomous systems are. This complexity has created a fractured ideological landscape.

Archetypes of Thought

The 'AI Compass' is an attempt to categorise these diverse viewpoints. Rather than a simple binary of 'pro-AI' or 'anti-AI', it identifies 30 distinct archetypes. These range from the 'Garage Tinkerer', who focuses on local, open-source experimentation, to more institutional or regulatory-focused viewpoints. This taxonomy helps clarify where different players sit on the spectrum of ethics, control, and utility.

The debate is no longer a binary; it is a spectrum of 30 distinct philosophies.

Understanding your own position on this compass is not about finding a 'correct' answer, but about understanding your own biases and priorities. Are you more concerned with the decentralisation of power, or the safety of the models? Do you prioritise rapid innovation, or the preservation of human-centric data rights? Identifying your archetype allows for more productive engagement with the opposing views.

Why Archetypes Matter
  • Clarifies personal ethical priorities
  • Identifies potential points of friction in policy
  • Helps navigate the fragmented AI community
Key Takeaway

AI ethics is not a binary debate; use archetypes to understand the complex spectrum of modern thought.

Endnote
Tonight's readings suggest a singular theme: the transition from novelty to utility. We are seeing the end of the 'vibe check' era, where we were merely impressed by what AI could do. In its place, a more rigorous, disciplined, and technical reality is emerging. Whether it is the need for mathematical verification, the requirement for economic transparency in token pricing, or the shift in the professional identity of the product manager, the pattern is clear. The 'magic' is being replaced by mechanics. To succeed in this new environment, one cannot simply be a spectator of the AI revolution; one must become a practitioner of its underlying systems. The tools are becoming more capable, but they are also becoming more demanding of our technical competence and our ethical clarity.
Are you using AI to assist your current workflow, or are you building the systems that will replace it?
The Deep Feed · A nightly magazine · Wednesday, 1 July 2026