The Deep Feed

01 — Lenny's Newsletter

The Death of the Vibe Check

Why empirical benchmarks are the only way to survive the frontier model race

By Claire Vo · 12 min read

Editor's note: As AI models move from novelty to utility, 'gut feeling' is becoming a liability for serious builders.

For the past year, the discourse surrounding large language models has been dominated by 'vibes'. We see a developer post a screenshot of a clever coding solution, or a writer marvel at a particularly lyrical paragraph, and we collectively decide that Model X is 'better' than Model Y. This is a dangerous way to build a business. When Anthropic released Sonnet 5, the industry was ready to repeat this cycle of shallow impressions. But impressions do not scale, and they certainly do not provide the rigorous data needed to integrate these tools into professional workflows. To understand if a model actually improves, we must move past the anecdotal and toward the repeatable.

Building the Bench

The solution is not to trust a single, static benchmark like MMLU, which often becomes a game of memorisation for training sets. Instead, we need custom, repeatable evaluation harnesses. Using Claude Code, a testing framework called the 'How I AI Bench' was constructed in under forty-five minutes. This wasn't a survey of general knowledge, but a targeted strike on specific utility: PRD quality, prototype generation, agentic task completion, and agent personality. The methodology combined human scoring—weighted at 70% to capture the qualitative essence of a good output—with LLM-as-a-judge scoring at 30% to provide a baseline of logic and structure. This hybrid approach acknowledges that while machines can check for syntax, humans still define what constitutes a 'good' idea.

Vibes are for enthusiasts; benchmarks are for engineers.

When running Sonnet 5 blind against its competitors—including GPT-5.5 and Gemini 3 Pro—the results challenged the prevailing narrative. We often assume that the largest model is the best model, but the data suggests a more fragmented reality. For certain tasks, like writing a Product Requirements Document (PRD), a specific model might excel because of its adherence to structure. For others, like complex coding prototypes, a different model might win on its ability to reason through recursive errors. The 'winner' depends entirely on the job description you give the agent.

Model Recommendations by Task

Product Requirements: Claude Sonnet 5 for its structural discipline.
Complex Prototyping: High-reasoning models with better error-correction loops.
Daily Agent Interaction: Models with high 'personality' scores that don't feel robotic.

The takeaway for agency owners and product leads is clear: stop asking your team if a model 'feels' better. Start building your own evaluation datasets. If you cannot measure the delta between Sonnet 4.6 and Sonnet 5 in your specific use case, you are not using AI; you are just playing with a very expensive toy. The competitive advantage of the next decade will go to those who can quantify the performance of their autonomous agents.

Key Takeaway

Replace subjective impressions with repeatable, task-specific evaluation harnesses to drive real ROI.

02 — Not Boring

The 250-Year Horizon

Looking past the quarterly report to the foundations of American innovation

By Packy McCormick · 10 min read

Editor's note: A perspective shift from the immediate noise of the news cycle to the long-term trajectory of a nation.

In 1776, the concept of the United States was an experiment in fragility. There was no real-time communication, no modern medicine, and no standardized way to measure time. A person in New York lived in a different temporal reality than someone in Virginia, separated by the speed of a horse. When we look back at that era, we aren't just looking at a different set of technologies; we are looking at a different way of being human. The absence of electricity, refrigeration, and antibiotics meant that life was local, seasonal, and incredibly precarious.

The Speed of Information

The most significant change in the last two and a half centuries isn't just the presence of machines, but the presence of a shared present. Today, information moves at the speed of light. We inhabit a global, synchronized moment. This connectivity has enabled the scale of modern industry, but it has also changed the nature of innovation. In the early days, progress was a slow accumulation of local breakthroughs. Today, it is a high-velocity, global competition for dominance in sectors like nuclear energy, AI, and biotechnology.

Information used to move only as fast as a horse; now, we live in a shared, instantaneous present.

As America enters its 250th year, the question is no longer about survival, but about ownership of the next era. The companies that will define the next century—those building the energy grids, the defense systems, and the intelligence layers—are increasingly staying private for longer. This creates a barrier to entry. The wealth and influence generated by the next great technological shifts may be concentrated in a very small circle of early investors, leaving the broader public to watch the revolution from the sidelines.

Drivers of the Next 250 Years

Energy Sovereignty: Scaling nuclear power to meet the demands of an AI-driven economy.
Private Capital Access: New investment vehicles that allow broader exposure to category leaders.
Technological Autonomy: Building the hardware and software stacks that underpin modern life.

To own the future, one must think in centuries, not quarters. The innovators of the next 250 years will be those who recognize that the foundational technologies—energy, compute, and biology—are the new geography. Just as the telegraph and the steam engine redefined the 19th century, the ability to manipulate the atom and the bit will define the 21st. The challenge for the modern citizen is to decide whether they are a participant in this transition or merely a spectator.

Key Takeaway

True innovation happens at the intersection of long-term vision and foundational technology.

03 — Dwarkesh Podcast

The Biology of Risk

Why the OpenAI Foundation must prioritize the end of airborne disease

By Dwarkesh Patel · 15 min read

Editor's note: An analysis of the most significant dual-use risk in the age of AI: the intersection of compute and biology.

The debate surrounding AI safety often splits into two camps: those worried about existential rogue agents and those focused on immediate harms like bias or misinformation. Both are valid, but both miss the most immediate and catastrophic intersection of technology: AI and biology. As we build models capable of autonomous biological discovery, we are simultaneously creating a tool that can cure every known disease and a tool that can engineer a perfect pathogen. The stakes are not theoretical; they are biological.

The Dual-Payoff Principle

When deciding where to deploy massive amounts of capital—such as the billions held by the OpenAI Foundation—we should look for 'dual-payoff' interventions. These are actions that provide massive everyday benefits while simultaneously reducing extreme tail risks. Most safety interventions are purely defensive, acting as insurance against a bad outcome. But if we focus on ending airborne transmission, we achieve both. We unlock trillions in global GDP by eliminating seasonal flu and chronic respiratory illness, and we simultaneously make the world immune to the next engineered pandemic.

The best way to make AI go well is to solve the problems that make humanity vulnerable.

The path to this goal is through physical infrastructure. While AI can design new vaccines or better air filtration systems, the actual mitigation of risk happens in the real world. We need to move from digital intelligence to physical resilience. This means using AI to automate the wet-lab processes of discovery, but coupling that with a massive deployment of defensive technologies—better sensors, advanced air purification in public spaces, and rapid-response manufacturing of therapeutics.

Strategic Priorities for AI Foundations

Autonomous Biological Discovery: Using AI to map the biological design space.
Physical Infrastructure: Investing in the hardware of defense.
Economic Resilience: Reducing the productivity loss caused by endemic disease.

If the goal of AI is to improve the human condition, then we cannot ignore the biological substrate upon which that condition rests. A world of infinite compute and zero biological security is a house built on sand. The winners of the AI era will be those who realize that intelligence is only as useful as the stability of the world it inhabits.

Key Takeaway

Solve for the intersection of AI and biology to capture both immense economic value and existential security.

04 — The Marginalian

The Composer of the Deep

Marie Tharp and the courage to see what wasn't there

By Maria Popova · 14 min read

Editor's note: A study in how scientific breakthrough requires both data and the imagination to interpret it.

In 1952, the ocean floor was considered a featureless, homogenous void—a blue bathtub at the bottom of the world. The prevailing scientific consensus was that the Earth's crust was static and stable. But Marie Tharp, a cartographer working with fragments of sonar data, saw something else. She didn't just see depths and measurements; she saw a melody. To her, the fathograms—the jagged lines of depth data—looked like musical staff lines. She was a violinist, and she began to read the ocean floor as a score.

Reading the Silence

Tharp was not permitted to join the actual oceanographic expeditions; she was a woman in a field that did not yet welcome her. Instead, she sat in an office, splicing together strips of blue linen paper, magnifying the data fortyfold. She was working with 'strobe data'—incomplete, disconnected points of information. This is where the limit of the computational mind meets the beginning of the compositional mind. A computer can plot a dot, but it cannot see the rift valley. It takes a human to look at the gaps between the dots and hypothesize the connection.

In the void of data, the compositional mind begins, demanding a virtuosity of interpretation.

What she discovered was a rift valley running through the center of the Atlantic, a jagged line that suggested the Earth was not static, but was actively pulling itself apart. This was the tectonic record of a great inhale. Her discovery of plate tectonics was not just a triumph of data collection, but a triumph of pattern recognition. She saw the 'music' of the Earth's movement in the silence of the data.

The Elements of Discovery

Data Synthesis: Combining disparate, incomplete sources into a coherent whole.
Pattern Recognition: Seeing the underlying structure beneath the noise.
Intellectual Courage: Proposing a reality that contradicts the status quo.

Tharp’s story is a reminder that data alone is never enough. We are surrounded by information, but information is not insight. Insight requires the ability to bridge the gaps, to imagine the structure that must exist to produce the data we see. In an age of automated data processing, we must not lose the capacity for the 'compositional mind'—the ability to see the symphony in the noise.

Key Takeaway

Data provides the notes, but human intuition composes the meaning.

05 — The Marginalian

The Art of Surrender

Why resistance is the enemy of a life well-lived

By Maria Popova · 11 min read

Editor's note: A philosophical look at the tension between the controlling mind and the vital body.

We are a species defined by our desire to conquer. We conquer territories, we conquer diseases, and most relentlessly, we attempt to conquer ourselves. We treat our emotions, our impulses, and our anxieties as problems to be solved or obstacles to be overcome. But as Henry Miller once noted to Anaïs Nin, the attempt to conquer a problem often only serves to increase the resistance. In our drive for control, we often kill the very vitality we are trying to manage.

The Trap of Stasis

Anaïs Nin observed that many people fail in life because they attempt to elect a single state of being and remain in it. They seek a permanent state of happiness, a fixed identity, or a controlled environment. Nin argued that this is a form of death. Life is not a destination to be reached, but a process of constant becoming. To live fully is to accept the shifting imperatives of the soul—the strange, often conflicting needs that arise in different seasons of life.

Happiness is not a holiday experience; it is the result of being used by life.

D.H. Lawrence took this further, suggesting that true living requires an obedience to the 'urge' of life. This is not a passive surrender, but an active engagement with the discomfort of growth. Real happiness, in Lawrence's view, includes the ache, the sorrow, and the struggle. It is the result of being 'driven and goaded' by existence. When we fight our own internal imperatives, we bruise ourselves to death. When we obey them, we find our rhythm.

Principles of Vitality

Embrace Impermanence: Accept that states of being are temporary.
Listen to the Body: Recognize that physical impulses are the roots of true vision.
Accept Conflict: Understand that internal tension is a sign of life, not failure.

The modern world is designed to minimize friction. We have apps to smooth our social interactions, drugs to level our moods, and systems to automate our decisions. But in removing friction, we risk removing the very things that make us feel alive. A life without resistance is a life without depth. To live with maximum aliveness, we must stop trying to solve the mystery of existence and start participating in it.

Key Takeaway

Growth requires the courage to abandon control in favour of engagement.

06 — Simon Willison

Optimising the Agent

How DSPy is turning prompt engineering into a rigorous science

By Simon Willison · 6 min read

Editor's note: A technical look at the shift from manual prompting to algorithmic optimisation.

For a long time, 'prompt engineering' has felt like a dark art. Developers would tweak a few words here or there, hoping to nudge a model toward a better response. It was a process of trial and error, highly subjective and notoriously difficult to scale. But as we move toward complex agentic systems, this manual approach is hitting a wall. You cannot build a reliable production system on a foundation of 'vibes' and lucky guesses. We need a way to treat prompts like code: something that can be tested, versioned, and optimised through formal methods.

The DSPy Approach

The DSPy framework represents a fundamental shift in how we interact with LLMs. Instead of writing long, brittle instructions, we define the logic of our program and then use an optimizer to find the best prompts for the task. It treats the prompt as a parameter to be tuned, much like the weights in a neural network. This allows for a more systematic approach to improving agent performance. In a recent test on the Datasette Agent—a tool designed to answer SQL questions about data—this method revealed flaws that manual testing had missed.

Prompt engineering is evolving from a linguistic craft into a computational discipline.

One specific insight gained from using DSPy involved the way the agent handled database schemas. The baseline prompt instructed the agent not to call a 'describe_table' function if it already had the necessary information. However, because the schema listing only provided table names without column names, the agent frequently fell into error-retry loops, trying to guess column names like 'page_count' or 'first_name'. A systematic evaluation identified this specific tension, allowing for a targeted fix: either include the columns in the initial schema or soften the instruction.

Benefits of Algorithmic Prompting

Reproducibility: Tests can be run repeatedly against a gold-standard dataset.
Scalability: Optimisations can be applied across hundreds of different tasks simultaneously.
Precision: Identifies specific logical failures that are invisible to human testers.

The era of the 'prompt whisperer' is ending. The era of the 'prompt engineer'—the person who builds the systems that optimise the prompts—is beginning. For anyone building AI-driven products, the goal must be to move away from manual tweaking and toward automated, metric-driven optimisation. If you cannot measure the failure, you cannot fix the system.

Key Takeaway

Treat prompts as optimisable parameters, not as static text.