Inside AI Product Development: OpenAI vs Anthropic CPOs Reveal How They Build the Future

Discover exclusive insights from OpenAI CPO Kevin Weil and Anthropic CPO Mike Krieger on building successful AI products. This comprehensive interview reveals the hidden challenges and breakthrough strategies behind ChatGPT, Claude, and the future of AI product development.

Key Takeaways

The 60% Rule: AI products can still provide significant value even when their underlying models achieve only about 60% accuracy. This often requires careful product design and integrating human involvement in the workflow, rather than solely relying on perfect model performance.
Evaluation Skills are Critical: The most important skill for AI product managers has become writing effective evaluations of AI systems. This is so crucial that many companies are now running internal training programs to teach this capability.
Enterprise AI is Different: Deploying AI in enterprise environments involves distinct challenges. These deployment cycles often extend beyond six months due to complex procurement processes, and the objectives of the buyers frequently take precedence over end-user satisfaction for a product's success.
Models Aren't Intelligence-Limited: Current AI models are capable of performing much better than they currently do. Their primary limitations stem from the quality of their evaluation, not from inherent restrictions in their underlying intelligence or capabilities.
Non-Deterministic Design Challenge: Building products around AI systems that don't always produce the same output (stochastic systems) demands fundamentally different design approaches. There's a strong emphasis on establishing clear feedback loops and designing for graceful handling of potential failures.
Prototyping Revolution: Leading product managers are now leveraging AI models themselves to rapidly create and compare various user interface approaches. This allows for quick prototyping and testing even before traditional design tools are used by designers.
Future is Proactive and Asynchronous: The next generation of AI products will proactively identify and present insights, and will be able to manage longer-term tasks over hours or even days, rather than requiring immediate user responses.
Universal Translation Reality: Advanced voice AI modes are already enabling real-time business conversations between people who don't share a common language. This capability fundamentally alters global interaction possibilities.

Timeline Overview

1:18 - 3:02 - Career Reactions: Both leaders share how friends and colleagues reacted when they took their current roles at AI companies

3:02 - 8:02 - Enterprise Surprises: Mike discusses his biggest surprises working with enterprise customers and the lengthy procurement cycles

8:02 - 15:48 - Development Cycles: Deep dive into how AI product teams navigate uncertain capabilities and iterate with research teams

15:48 - 18:49 - Task Performance: Discussion of the "60% success rate" threshold and when AI products become valuable despite imperfection

18:49 - 22:46 - Building Intuition: Advice on developing skills for AI product management, emphasizing evaluation writing and data analysis

22:46 - 33:13 - User Education: Strategies for teaching both consumer and enterprise users how to effectively interact with AI products

33:13 - 36:56 - Future Vision: Predictions about proactive AI, asynchronous interactions, and multimodal capabilities

36:56 - end - Kids and AI: Observations about how young people naturally interact with voice AI and surprising user behaviors

The Unpredictable Reality of AI Product Development

Building AI products feels like "peering through the mist," according to Kevin Weil. Every two months, computers gain capabilities they've never had before in human history. This creates a fundamental challenge that separates AI product management from traditional software development in ways most people don't fully grasp.

Unlike conventional products built on fixed technology foundations, AI product managers constantly adapt to emergent model capabilities that nobody—not even the research teams creating them—can predict with certainty. Research teams themselves often don't know if new features will work at 60% or 99% accuracy until training completes. The product you'd build for each scenario differs dramatically, making traditional product roadmaps nearly impossible.

Kevin describes the surreal experience of checking in with research teams: "Hey guys, how's it going? How's that model training? Any insight on this?" The response is typically: "It's research, we're working on it. We don't know either." This uncertainty extends throughout the entire organization, creating a collaborative discovery process rather than predictable development cycles.

Mike Krieger compares this to Apple's WWDC announcements potentially disrupting Instagram's roadmap. "It's like that but your own company is the one kind of disrupting you from within," he explains. This internal disruption happens continuously rather than annually, forcing product teams to maintain flexibility while still shipping meaningful features to users.

The most successful approach involves embedding designers early in research processes, focusing on learning rather than shipping perfect products every time. Anthropic and OpenAI both practice "co-design, co-research, co-finetune" workflows where product teams partner directly with researchers from the earliest stages. This collaboration produces demos and informative prototypes that spark product ideas rather than following predictable development processes.

Sometimes researchers casually mention capabilities they've had for months but didn't consider important, creating unexpected product opportunities. Kevin recalls meetings where product teams express wishes for specific capabilities, only to hear researchers respond: "Oh no, we can do that. We've had that for three months." These moments of accidental discovery highlight how much untapped potential exists within current AI systems.

The stochastic nature of this development cycle means traditional product management instincts work only about half the time. When teams approach familiar territory—like shipping the final version of Advanced Voice Mode or Canvas—conventional product skills apply. But the beginning phases of new capabilities resemble nothing like traditional software development, requiring entirely new frameworks for thinking about product development.

Why 60% Accuracy Can Still Win

Both leaders challenge the widespread assumption that AI needs near-perfect performance to create meaningful value. This misconception has prevented many organizations from deploying AI solutions that could dramatically improve productivity and user experiences today.

GitHub Copilot serves as the perfect counterexample. Kevin estimates it launched on GPT-2 level performance—far from perfect at coding tasks and generations behind current capabilities. Yet it became the first AI product to demonstrate substantial economic value because it saved developers significant typing time, even when the generated code needed editing or refinement.

The crucial insight lies in designing specifically for imperfection rather than hoping for perfection. Successful AI products need robust human-in-the-loop workflows, clear confidence indicators that help users understand when the AI is uncertain, and graceful failure modes that don't break user experiences when mistakes occur.

When models can effectively express uncertainty and request human assistance—saying something like "I'm not sure about this approach, can you help me with this specific part?"—the combined human-AI performance often exceeds what either could achieve alone. This collaborative approach turns AI limitations into features rather than bugs.

Mike observes that 60% performance often proves "lumpy" in practice—models excel at some tasks while failing completely at others. This creates significant opportunities for targeted applications where the model's strengths perfectly align with specific user needs, even if overall benchmark performance seems mediocre. Smart product managers identify these sweet spots and build focused solutions rather than trying to solve everything at once.

Enterprise deployments reveal this complexity daily through real user feedback. The same model receives diametrically opposite reactions from different companies on the same day: one organization calls it completely transformative, solving problems they'd struggled with for months, while another says it's significantly worse than existing alternatives they're already using.

Success depends heavily on specific use cases, internal data sets, organizational context, and prompting approaches. Companies that invest time in understanding their specific requirements and tailoring AI implementations accordingly see dramatically better results than those expecting plug-and-play solutions.

This reality has important implications for product strategy. Rather than waiting for models to reach theoretical perfection, successful AI companies focus on identifying specific domains where current capabilities create immediate value, then expanding gradually as models improve. The organizations that embrace this approach gain competitive advantages while others wait for perfect solutions that may never arrive.

The Evaluation Revolution

Writing effective evaluations has emerged as the single most critical skill for AI product managers, fundamentally reshaping what it means to be successful in this role. Both OpenAI and Anthropic have recognized this shift by implementing comprehensive internal bootcamps specifically focused on evaluation design and execution.

Kevin explains how traditional product management roles have converged around this capability: "We ended up realizing that the job of a PM in 2024-2025 building AI-powered features looks more and more like research PMs than traditional product surface PMs." The quality of AI features now depends entirely on evaluation and prompting quality rather than conventional product metrics.

This shift has practical implications for day-to-day work. A product manager might successfully develop a feature to 80% completion using traditional methods, then discover they need sophisticated evaluation skills to achieve the crucial final 20% through fine-tuning and prompt optimization. Without these skills, even promising features fail to reach production quality.

Mike emphasizes the importance of examining actual model outputs rather than relying solely on aggregate performance scores. Teams often get trapped by misleading metrics, celebrating improvements from 78% to 80% accuracy without investigating whether those gains represent genuine progress. Sometimes deeper analysis reveals that higher scores reflect evaluation problems rather than model improvements.

Even more surprisingly, examination of "golden answers" in evaluation datasets sometimes reveals errors that humans wouldn't make. This means achieving 100% accuracy on some evaluations would actually indicate model failure rather than success. Building robust evaluation systems requires understanding these nuances and continuously refining assessment criteria.

The evaluation challenge becomes exponentially more complex as AI systems tackle longer-form, more ambiguous tasks. While evaluating mathematical calculations remains straightforward—either the answer is correct or incorrect—assessing whether an AI agent successfully "found a good hotel in New York City" requires sophisticated rubrics that account for personal preferences, context, and subjective quality measures.

Mike draws an analogy to performance reviews: "Evals start looking more like performance review. Did the model meet your expectation of what a competent human would have done? Did it exceed it because it did it twice as fast or discovered some restaurant you wouldn't have known about?"

The models themselves can assist in evaluation creation, offering a bootstrapping solution for teams getting started. Kevin suggests asking models directly: "What makes a good eval? Can you write me a sample eval for this task?" This approach works surprisingly well for initial evaluation design, though human refinement remains essential.

Both leaders stress that successful evaluation requires hands-on data analysis rather than abstract theorizing. The most effective product managers regularly examine failure cases, understand specific error patterns, and iterate on evaluation criteria based on real-world performance rather than theoretical frameworks.

Anthropic has made evaluation skills so central to their hiring process that candidates must demonstrate the ability to transform poor evaluations into effective ones during interviews. This practical test reveals thinking processes and technical capabilities that traditional product management interviews miss entirely.

Enterprise vs Consumer: Two Different Worlds

Enterprise AI deployment operates on entirely different timelines, success metrics, and decision-making processes than consumer products, creating challenges that traditional product managers rarely encounter. These differences often surprise product leaders transitioning from consumer-focused companies.

Mike discovered that enterprise customers routinely request 60-day advance notice for any product changes—a timeline that product teams themselves rarely possess given the rapid pace of AI development. This mismatch between enterprise planning cycles and AI innovation speed creates ongoing tension that requires careful navigation.

The enterprise buyer-user dynamic fundamentally complicates traditional product thinking in ways that can frustrate even experienced product managers. Organizations might build the most intuitive, powerful product that every employee genuinely loves using, but if it doesn't meet the buyer's specific compliance requirements, budget constraints, or strategic goals, the entire deal fails regardless of user satisfaction.

Enterprise feedback cycles stretch over months through complex procurement processes that bear no resemblance to consumer product launches. Mike jokes about customers entering "some requisition state" where promising deals disappear for half a year before deployment even begins, with minimal communication during the waiting period. This forces product teams to think fundamentally differently about iteration cycles and user feedback loops.

The sales process itself becomes part of the product experience. Kevin describes meetings where enterprise customers express satisfaction with product functionality but immediately follow with unexpected requirements: "This is great, we're really happy. The one thing we need is for you to tell us 60 days before you launch anything." His response: "I also would like to know 60 days ahead of time" captures the inherent unpredictability of AI development cycles.

However, enterprise deployment offers unique advantages that consumer products cannot replicate. Organizations can provide dedicated training sessions, comprehensive educational materials, and structured rollout processes that help users understand AI capabilities more effectively than consumer products that rely on intuitive interfaces alone.

Power users within enterprises often become internal evangelists, creating custom GPTs, specialized prompts, or training materials that make AI more accessible for their colleagues. This organic adoption pattern helps scale AI education throughout organizations more effectively than top-down mandates or external training programs.

Non-technical users experiencing AI chat interfaces for the first time in structured enterprise environments provide particularly valuable insights for broader user education strategies. Their unfiltered reactions and learning patterns reveal assumptions that technical teams take for granted.

The enterprise market also reveals the importance of industry-specific customization. The same AI model might transform workflows in one company while failing completely in another organization with different data structures, compliance requirements, or operational processes. Success requires understanding these contextual factors rather than assuming universal applicability.

Mike observes that enterprise customers often inherit AI initiatives from previous product managers, creating additional complexity around goal definition and success measurement. New stakeholders must first understand what problems the AI deployment was meant to solve before they can evaluate whether it's working effectively.

Prototyping at the Speed of Thought

The most successful AI product managers have fundamentally transformed their workflow by adopting AI models as sophisticated prototyping tools, creating competitive advantages that traditional product development processes cannot match. This shift represents one of the most practical immediate applications of AI within product development itself.

Instead of spending hours in meetings debating abstract UI approaches, leading product managers now prompt Claude or ChatGPT to generate and compare multiple interface concepts before designers even open Figma or other design tools. This approach transforms speculative discussions into concrete evaluations of working prototypes.

Mike emphasizes this capability as significantly underutilized across the industry, representing a massive opportunity for teams willing to adapt their processes. Product managers who master AI-assisted prototyping gain substantial competitive advantages in iteration speed, design exploration breadth, and stakeholder communication effectiveness.

The prototyping revolution enables testing dozens of variations at speeds that would have been impossible with traditional design processes. Teams can explore radically different approaches, evaluate them systematically, and arrive at meetings with concrete options rather than abstract concepts that require extensive explanation and imagination from stakeholders.

This capability extends beyond visual interfaces to include content strategy, user flows, feature specifications, and even evaluation criteria. Product managers can rapidly generate multiple versions of product requirements documents, compare different feature prioritization approaches, or prototype complex user journeys before investing significant team resources.

The models excel at generating variations based on specific constraints or requirements. Product managers can specify target user types, technical limitations, business objectives, or design principles, then receive multiple tailored approaches that address those specific needs. This targeted variation generation helps teams explore solution spaces more systematically than traditional brainstorming approaches.

Advanced practitioners combine AI prototyping with traditional user research methods, using AI-generated prototypes as conversation starters in user interviews or usability tests. This hybrid approach allows teams to test more concepts with users while maintaining human-centered design principles.

The speed advantage compounds over traditional development cycles. While conventional processes might generate 3-5 design concepts over several weeks, AI-assisted prototyping can produce and refine dozens of approaches in days or hours. This acceleration allows teams to explore much broader solution spaces and identify optimal approaches faster.

However, successful AI prototyping requires developing new skills around prompt engineering, concept evaluation, and result synthesis. Product managers must learn to communicate design requirements clearly to AI models, evaluate generated concepts effectively, and combine AI-generated ideas with human insight and domain expertise.

The most effective teams use AI prototyping as a starting point for human creativity rather than a replacement for design thinking. AI-generated prototypes spark ideas, reveal possibilities, and accelerate initial exploration, but human judgment remains essential for evaluating feasibility, user value, and strategic alignment.

Navigating Non-Deterministic Interfaces

Building products around stochastic, non-deterministic systems requires fundamental design philosophy changes. Traditional software guarantees identical outputs for identical inputs. AI eliminates this certainty, challenging 25 years of interface design assumptions.

Product managers must design feedback mechanisms to close loops when models go astray. How do you collect rapid feedback about AI behavior? What guardrails prevent problematic outputs? How do you understand aggregate model behavior across millions of daily interactions?

This mirrors human psychology more than traditional software. Kevin notes that human intuitions about other people often help understand model behavior. If someone starts down the wrong conversational path, recovery becomes difficult—models exhibit similar patterns.

Users adapt surprisingly quickly to non-deterministic interfaces, but product teams must consider both their own adaptation and their users' learning curves. The magic of new AI capabilities quickly becomes expected baseline performance as users adjust expectations.

The Proactive and Asynchronous Future

The next evolution of AI products will fundamentally shift from reactive question-answering systems to proactive intelligence partners that anticipate needs and operate across extended time horizons. This transformation represents perhaps the most significant change coming to AI user experiences.

Mike envisions AI systems that actively monitor authorized information streams, automatically identify relevant trends and opportunities, and surface insights without explicit user requests. Instead of starting each day by asking AI what happened overnight, users will receive proactive briefings that include meeting preparations, trend analyses, and relevant research on upcoming projects.

This proactive capability extends beyond simple notifications to intelligent preparation and context synthesis. AI assistants might analyze upcoming calendar events, research attendees' backgrounds and interests, prepare talking points for important meetings, and identify potential areas of collaboration or conflict before users even realize they need this information.

The shift toward asynchronous operation breaks one of the most limiting constraints of current AI interactions: the expectation of immediate responses. Instead of demanding instant answers, users will assign longer-form, more complex tasks with instructions like: "Research this market opportunity, analyze competitive threats, create a preliminary business case, validate your assumptions, and present your findings with confidence levels in two hours."

This temporal expansion enables far more sophisticated AI assistance than current real-time interactions allow. AI systems can conduct thorough research, validate conclusions against multiple sources, identify potential counterarguments, and refine their analysis before presenting results. The quality of output improves dramatically when AI systems have time to "think" rather than responding instantly.

Kevin draws parallels to human reasoning processes, explaining how scientific breakthroughs and complex problem-solving require extended thinking time. Current AI systems operate primarily in "System 1" thinking mode—fast, intuitive responses based on pattern recognition. The future involves "System 2" thinking—deliberate, analytical reasoning that can work through complex problems systematically.

O1's reasoning capabilities provide early glimpses of this future, where AI systems pause to consider problems, form hypotheses, test assumptions, and refine conclusions before responding. Kevin imagines scaling this from 30-60 seconds to hours or days for particularly complex challenges.

The combination of proactive and asynchronous capabilities will enable entirely new categories of AI assistance. Users might assign broad objectives like "help me prepare for next quarter's product planning" and receive comprehensive analyses that include market research, competitive intelligence, team capacity assessments, and strategic recommendations developed over several days.

This evolution requires new interface paradigms that manage multiple concurrent AI tasks, provide visibility into ongoing work, and enable users to interact with AI systems that operate on different timescales simultaneously. The challenge involves designing experiences that feel natural and manageable rather than overwhelming users with constant AI activity.

The implications extend beyond individual productivity to organizational transformation. Teams could assign AI systems to monitor industry trends, analyze customer feedback patterns, or identify operational inefficiencies continuously, creating institutional intelligence that enhances human decision-making rather than replacing it.

The Next Generation's AI Relationship

Young people interact with AI voice modes in ways that surprise even AI product leaders. TikTok videos show teenagers pouring their hearts out to voice interfaces, using them for emotional support and creative collaboration in completely natural ways.

Kevin's children exemplify this generational shift. While he was lucky to choose bedtime stories as a child, his kids demand real-time story creation with specific visual elements. They expect AI to generate custom entertainment experiences tailored to their exact preferences instantly.

This generation won't distinguish between AI and traditional software—they'll simply expect intelligent, responsive, creative tools. Their comfort with non-deterministic interfaces will drive product development in directions current users can barely imagine.

Model Personality as Product Feature

Users develop sophisticated relationships with AI models, recognizing nuanced personality differences between versions. Mike observes people befriending Claude, developing two-way empathy, and adapting to personality changes in model updates.

This creates new product challenges. If users prefer one model version's personality over another, that becomes a feature regression despite technical improvements. Model behavior becomes a core product attribute requiring careful consideration.

A viral Twitter phenomenon emerged where users asked models: "Based on everything you know about me, what would you say about me?" The responses created fascinating self-reflection moments, demonstrating how people increasingly view AI as entity-like rather than tool-like.

These personality relationships will likely drive model selection. People might choose OpenAI versus Anthropic products based on personality preferences rather than just capability differences—similar to friendship preferences in human relationships.

What's Next?

The future of AI product development will be shaped by proactive intelligence, asynchronous operation, and increasingly sophisticated human-AI relationships. Success will depend on mastering evaluation skills, designing for uncertainty, and understanding the profound behavioral changes AI enables.

Both leaders emphasize that we're still in the early stages of this transformation. The experiences that seem magical today will feel primitive within months as model capabilities advance and our design sophistication improves.

The next generation already shows us glimpses of this future—they expect AI to be creative, responsive, and deeply personalized. Product teams that can match this intuitive expectation will define the next era of human-computer interaction.