Table of Contents
The landscape of artificial intelligence is shifting so rapidly that traditional development cycles often fail before they even finish. For many startups, the standard path—collecting massive datasets and fine-tuning a foundation model—is a race against obsolescence. By the time a custom model is deployed, a new frontier release often renders the expensive effort redundant. Ian Fischer, co-founder of Poetic and a former researcher at Google DeepMind, proposes a more resilient architecture: building "stilts" for AI through recursive self-improvement. By creating reasoning harnesses that sit atop existing models, developers can ensure their systems remain at the state-of-the-art, regardless of which foundation model currently leads the market.
Key Takeaways
- The Fine-Tuning Trap: Traditional fine-tuning is becoming a liability for startups, as new frontier models often outperform custom-tuned older models within months.
- Recursive Self-Improvement: Poetic uses a "meta-system" that automatically generates reasoning strategies, allowing AI to improve itself without the massive compute costs of retraining.
- Superior Benchmarking: This approach has allowed a seven-person team to outperform tech giants on rigorous tests like ARC-AGI v2 and Humanity’s Last Exam.
- Resilient Architecture: Unlike fine-tuned models, Poetic’s reasoning harnesses are compatible with future model releases, providing a hedge against the "bitter lesson" of AI development.
The Obsolescence of Traditional Fine-Tuning
For years, the gold standard for AI startups was to take an open-weights model and fine-tune it on proprietary data. However, Fischer argues that this strategy is increasingly risky. The capital expenditure required for fine-tuning is immense, and the results are often ephemeral. When a provider like OpenAI or Anthropic releases a new version of their flagship model, it frequently "blows the fine-tuned version out of the water," effectively lighting millions of dollars of investment on fire.
The "bitter lesson" in AI suggests that general methods that leverage compute eventually win out over human-engineered niches. Poetic’s approach acknowledges this reality. Instead of fighting the tide of foundation model progress, Poetic builds systems that are "frontier-agnostic." When a new model is released, the existing reasoning harness can be ported to it, often resulting in an even larger performance leap without the need for a total rebuild.
"Whatever model comes out, you can be taller than that one with Poetic... you’re just totally vaccinated against the bitter lesson."
Defining Recursive Self-Improvement
At the heart of Poetic’s technology is the concept of recursive self-improvement. While many companies use Reinforcement Learning (RL) to improve models, Poetic operates at a meta-level. Their system, known as the Poetic Meta-System, is designed to solve hard reasoning problems by automatically generating optimized "harnesses"—a combination of specialized code, prompts, and data strategies.
The "Stilts" Metaphor
Fischer describes their technology as stilts. Poetic doesn't attempt to build the foundational "ground" (the LLM itself); instead, it provides the mechanism to stand higher than the current baseline. This allows a relatively small team to achieve outsized results. By using a cheaper underlying model, like Gemini Pro, and applying their reasoning harness, Poetic achieved better results than more expensive, "deeper" models at half the operational cost.
Beyond Prompt Engineering
It is a mistake to view these harnesses as mere collections of clever prompts. While automated prompt optimization (like DSPy or GPTScript) provides some gains, the real breakthroughs come from reasoning strategies written in code. Fischer notes that while prompt optimization might offer incremental improvements, shifting to complex reasoning strategies can move performance from a 5% success rate to over 95% on difficult tasks.
Proven Results on Frontier Benchmarks
The efficacy of recursive self-improvement is best demonstrated through specialized benchmarks that test "out-of-distribution" reasoning—problems the models haven't seen in their training data. Poetic has consistently topped leaderboards, often surpassing the efforts of massive research labs with significantly more resources.
- ARC-AGI v2: Poetic reached the top of the leaderboard shortly after the release of Google’s DeepMind results. They achieved a 54% score at a cost of $32 per problem, compared to DeepMind’s 45% score at a much higher cost.
- Humanity’s Last Exam: This benchmark consists of 2,500 questions so difficult they challenge PhDs. Poetic reached 55% accuracy, surpassing the previous state-of-the-art set by Anthropic’s Claude 3.5 Sonnet.
- Cost Efficiency: While foundation models cost hundreds of millions to train, Poetic’s optimization runs for these benchmarks cost less than $100,000.
"We don't want to go in and monkey around with things... it's the AI's job to understand the data set and figure out where the failure modes are."
Building for the Future of AI Agents
The transition from simple chatbots to autonomous agents requires a level of reliability that current models struggle to provide out-of-the-box. For founders building agentic startups, the challenge lies in robustness. A human developer can manually optimize an agent for months, but they are limited by their own ability to anticipate every failure mode. Poetic’s meta-system outsources this "data understanding" to the AI itself.
Automated Strategy Generation
If a startup already has an agentic system, Poetic can optimize specific components—such as reasoning strategies or data extraction methods—to make them more reliable. This automation is critical for companies targeting vertical industries where a 90% success rate isn't good enough and 99% is required for commercial viability.
The S-Curve of Intelligence
Fischer suggests that every model has its own S-curve of capability. As Poetic’s meta-system improves, and the underlying frontier models improve, these curves shift higher. This creates a compounding effect that moves the industry closer to Artificial General Intelligence (AGI) or superintelligence. By hitting the "intelligence ceiling" first through optimization, startups can capture markets before foundation models naturally evolve to meet those needs.
Conclusion: The Imperative to Build
The rapid evolution of AI can be paralyzing for developers, but Fischer argues that the best response is constant experimentation. Whether it is using the latest models to build an app in a weekend or leveraging self-improving harnesses for complex enterprise tasks, the barrier to entry has never been lower. The future belongs to those who do not limit their imagination to what a model can do today, but rather what a system can achieve when it is designed to improve itself.
For companies facing "impossible" reasoning hurdles, the path forward is no longer just about more data or bigger clusters. It is about recursive optimization—building the stilts that allow your application to stay above the rising tide of foundation model progress. As Fischer notes, the goal is to find the boundaries of what is possible and push past them every day.