Table of Contents
Keras creator François Chollet reveals why the 50,000x scaling of AI models barely moved the needle on true intelligence and unveils his blueprint for achieving AGI through meta-learning systems.
Key Takeaways
- The 50,000x scaling from 2019 to 2024 only improved ARC benchmark performance from 0% to 10% while humans score 95%+
- Test-time adaptation emerged in 2024 as the paradigm shift beyond pre-training scaling, enabling models to modify behavior during inference
- Intelligence is the efficiency ratio between past information and operational area over novel future situations, not accumulated skills
- Static benchmarks like human exams measure memorized skills rather than fluid intelligence, creating misleading progress signals
- ARC-2 requires compositional reasoning that exposes current AI limitations - baseline models score 0%, reasoning systems achieve 1-2%
- Two types of abstraction drive cognition: continuous value-centric (perception/intuition) and discrete program-centric (reasoning/planning)
- Transformers excel at Type 1 abstraction but struggle with Type 2, explaining why they can't reliably sort lists or add digit sequences
- Future AGI requires combining deep learning's fast approximations with discrete program search for genuine invention beyond automation
- The "kaleidoscope hypothesis" suggests intelligence extracts reusable abstractions from experience that recombine for novel situations
Timeline Overview
- 00:00–00:56 — The Falling Cost of Compute: Two orders of magnitude cost reduction per decade since 1940; compute and data as primary AI bottlenecks driving deep learning success
- 00:57–01:58 — Deep Learning's Scaling Era: GPU abundance enabled self-supervised text modeling; scaling laws showed predictable improvement with model/data size increases
- 01:59–03:01 — The ARC Benchmark Introduction: 2019 release of Abstraction and Reasoning Corpus highlighting difference between memorized skills and fluid general intelligence
- 03:02–05:00 — The 2024 Paradigm Shift: AI research pivoted to test-time adaptation enabling models to modify behavior during inference rather than static pre-trained responses
- 05:01–07:11 — Defining Intelligence: Minsky vs McCarthy views; intelligence as process vs skill as output; road-building company analogy for adaptive capability
- 07:12–08:56 — Benchmark Limitations: Why exam-based benchmarks mislead; shortcut rule causing target achievement while missing broader intelligence goals
- 08:57–10:57 — ARC-1 Scaling Resistance: 50,000x parameter increase yielded minimal improvement; test-time adaptation became necessary for fluid intelligence demonstration
- 10:58–12:54 — ARC-2 Compositional Reasoning: More sophisticated tasks requiring deliberate thinking; baseline models achieve 0% while static reasoning reaches 1-2%
- 12:55–14:57 — Human vs AI Performance: Random people including Uber drivers solve ARC-2 tasks that stump advanced AI; majority voting by 10 people yields 100% accuracy
- 14:58–16:59 — ARC-3 Interactive Agency: Departure from input-output format to assess exploration, goal-setting, and autonomous achievement in novel environments
- 17:00–21:59 — Kaleidoscope Hypothesis: Universe composed of similar patterns; intelligence mines reusable abstractions from experience for novel situation adaptation
- 22:00–25:59 — Type 1 vs Type 2 Abstractions: Continuous domain perception/intuition versus discrete program-centric reasoning; transformers excel at former but struggle with latter
- 26:00–28:59 — Discrete Program Search: Combinatorial search over symbolic operations enables invention beyond automation; all creative AI systems rely on discrete search
- 29:00–31:59 — Fusing Intuition with Reasoning: Type 1 approximations guide Type 2 search to overcome combinatorial explosion; chess-playing analogy combining pattern recognition with calculation
- 32:00–33:43 — Meta-Learning System Architecture: Programmer-like AI assembling task-specific models from global abstraction library; continuous improvement through experience accumulation
- 33:44–34:30 — NDEA Research Lab: New lab focused on independent AI discovery and invention to accelerate scientific progress through deep learning guided program search
The Scaling Paradigm's Fundamental Limitations
- The pre-training scaling approach dominated AI research through the 2010s based on consistent improvements across benchmarks as model size and training data increased. This created widespread belief that general intelligence would spontaneously emerge from sufficient scale, leading the field to focus obsessively on ever-larger models and datasets.
- François Chollet's ARC benchmark revealed the critical flaw in this approach: despite 50,000x scaling in parameters from 2019 to 2024, performance improved from 0% to only 10% while humans consistently score above 95%. This massive disconnect exposed that scaling was optimizing for pattern memorization rather than genuine reasoning capability.
- The confusion between memorized skills and fluid intelligence created a fundamental measurement problem where benchmark improvements didn't translate to general problem-solving ability. Models became increasingly sophisticated at retrieving and applying pre-recorded templates while remaining unable to adapt to truly novel situations.
- Static inference models lacked the critical capability for on-the-fly recombination of learned concepts. At training time, they acquired useful abstractions, but at test time could only fetch and apply predetermined responses rather than synthesizing new solutions for unprecedented problems.
- The limitations became apparent when models that dominated language and vision benchmarks failed completely on tasks that four-year-old children could solve intuitively. This performance gap signaled that fundamental architectural changes were needed rather than continued scaling of existing approaches.
- The 2024 pivot to test-time adaptation represented the field's recognition that general intelligence requires dynamic behavior modification during inference, marking the end of the pre-training scaling era as the primary path to AGI.
Test-Time Adaptation as the New Paradigm
- Test-time adaptation emerged in 2024 as AI research community's response to scaling limitations, focusing on models that could modify their own behavior dynamically based on specific data encountered during inference rather than relying solely on pre-trained parameters.
- This paradigm shift encompassed techniques like test-time training, program synthesis, and chain-of-thought reasoning where models attempt to reprogram themselves for specific tasks. The approach moved away from querying pre-loaded knowledge toward genuine learning and adaptation at inference time.
- OpenAI's O3 model demonstrated the paradigm's potential by achieving human-level performance on ARC for the first time through test-time adaptation, though this required task-specific fine-tuning rather than general capability. The breakthrough validated that fluid intelligence required dynamic adaptation mechanisms.
- Every AI system performing meaningfully above zero on ARC utilizes some form of test-time adaptation, proving that static models cannot demonstrate genuine reasoning regardless of their parameter count or training data volume. This pattern holds consistently across all current high-performing systems.
- The shift represents a fundamental change from scaling pre-computed responses to enabling real-time problem-solving, aligning AI development more closely with human cognition where novel situations trigger active reasoning processes rather than template retrieval.
- Test-time adaptation provides the missing recombination capability that allows models to synthesize new solutions from existing knowledge, bridging the gap between memorized patterns and genuine understanding that characterized the scaling paradigm's limitations.
Redefining Intelligence Beyond Task Performance
- Chollet distinguishes between the Minsky view of AI (machines performing human tasks) and the McCarthy view (machines handling unprepared-for problems), arguing that true intelligence involves dealing with novel situations rather than accumulating task-specific skills across predetermined domains.
- The road network versus road-building company analogy illustrates this distinction: having roads between specific points enables travel along predetermined routes, while having road-building capability enables connection of new destinations as needs evolve. Intelligence represents the construction capability rather than the infrastructure.
- Intelligence emerges as an efficiency ratio between available information (past experience and developer-imparted priors) and operational area over potential future situations featuring high novelty and uncertainty. This mathematical framing emphasizes adaptation capability over skill accumulation.
- The process-versus-output distinction prevents the category error of attributing intelligence to crystallized behavior programs. Skills represent intelligence outputs rather than intelligence itself, similar to how roads represent road-building outputs rather than construction capability.
- This redefinition challenges conventional AI evaluation methods that measure task performance rather than learning efficiency, adaptation speed, or novel problem-solving capability. Most benchmarks test skill demonstration rather than intelligence measurement.
- Understanding intelligence as adaptive capability rather than skill collection fundamentally changes AI development priorities from building increasingly capable task-specific systems toward creating systems that can rapidly acquire new capabilities when encountering unprecedented challenges.
The Benchmark Problem and Shortcut Rule
- Human-designed benchmarks suffer from fundamental assumptions appropriate for human evaluation but misleading for machine assessment, particularly the assumption that test-takers haven't memorized all questions and answers beforehand. This creates exploitable shortcuts for AI systems with access to vast training data.
- The shortcut rule demonstrates that optimizing for single success measures often achieves targets while missing broader objectives, as seen in Kaggle competitions where winning solutions prove too complex for production use or chess AI development that achieved superhuman performance without advancing human intelligence understanding.
- Exam-based benchmarks measure task-specific skill and knowledge rather than fluid intelligence, leading to systems that excel at automation (reproducing known solutions) rather than invention (creating novel solutions). This measurement mismatch has driven decades of AI development in directions that don't advance general intelligence.
- Static benchmarks become saturated as models improve, losing their ability to distinguish between genuine intelligence advances and brute-force scaling improvements. Most existing benchmarks reached saturation levels that couldn't differentiate between human-level reasoning and sophisticated pattern matching.
- The feedback signal from inappropriate benchmarks determines research directions and resource allocation, making benchmark design a critical factor in AI development rather than a technical detail. Measuring the wrong capabilities leads to building the wrong systems.
- Chollet's ARC benchmarks attempt to address these problems by focusing on novel problem-solving using minimal prior knowledge, though even ARC-1 proved insufficient for evaluating the full spectrum of intelligence capabilities, necessitating ARC-2 and the planned ARC-3 developments.
ARC-2's Compositional Reasoning Challenge
- ARC-2 released in March 2024 specifically targets the test-time adaptation paradigm by requiring compositional generalization rather than pattern recognition, featuring tasks that demand deliberate thinking while remaining feasible for ordinary humans without specialized knowledge.
- The benchmark validation involved testing 400 random people including Uber drivers, UCSD students, and unemployed individuals in San Diego, proving that regular folks could solve all tasks with majority voting by groups of 10 people achieving 100% accuracy without prior training.
- Baseline language models like GPT-4.5 and Llama achieve 0% on ARC-2, demonstrating that memorization-based approaches provide no benefit for compositional reasoning tasks. Even static reasoning systems using single chain-of-thought generation perform only marginally better at 1-2%.
- The performance gap between human capability (essentially 100% with group consensus) and current AI systems (near 0% for most approaches) indicates substantial room for improvement and validates that current systems lack fundamental reasoning capabilities that humans possess naturally.
- ARC-2 enables granular evaluation of test-time adaptation systems, revealing that even advanced models like O3 haven't reached human-level performance despite their success on ARC-1. This granularity provides better feedback signals for research progress.
- The benchmark's resistance to brute-force approaches ensures that progress requires genuine advances in compositional reasoning rather than computational resource increases, making it a more reliable indicator of intelligence development than scalable pattern-matching tasks.
ARC-3's Interactive Agency Assessment
- ARC-3 represents a significant departure from input-output pair formats toward assessing agency, exploration, interactive learning, goal-setting, and autonomous achievement in novel environments where AI systems must discover objectives and mechanics independently.
- The benchmark drops AI into unique environments without revealing controls, goals, or gameplay mechanics, requiring systems to figure out everything from scratch including determining what they're supposed to accomplish, similar to human problem-solving in completely unfamiliar contexts.
- Efficiency becomes central to ARC-3 evaluation, with models graded not just on task completion but on action efficiency compared to human performance. Strict limits on allowable actions prevent brute-force exploration strategies that don't demonstrate genuine understanding.
- Every interactive task builds on core knowledge priors (objectness, basic physics, geometry) rather than specialized domain knowledge, ensuring that success depends on general reasoning ability rather than pre-training on specific environments or game types.
- The planned early 2026 launch timeline with July developer preview allows researchers to begin experimenting with interactive agency assessment, potentially revealing new bottlenecks in current AI systems that static benchmarks cannot capture.
- ARC-3 addresses limitations of current benchmarks that focus on static problem-solving rather than dynamic interaction, exploration, and goal discovery that characterize real-world intelligence applications requiring autonomous operation in novel environments.
The Kaleidoscope Hypothesis and Abstraction Mining
- The kaleidoscope hypothesis proposes that apparent novelty and complexity in the universe emerges from recombinations of a limited set of fundamental "atoms of meaning" that appear across different domains and scales, from trees resembling neurons to electromagnetic fields paralleling hydrodynamics.
- Intelligence functions as the ability to mine experience for reusable abstractions that transfer across situations, identifying invariant structures and repeated principles that can be recombined on-the-fly to create situation-specific models for novel problems.
- The hypothesis explains why nothing is truly novel despite appearances - every new situation contains familiar elements that can be understood through previously extracted abstractions, enabling intelligent systems to make sense of unprecedented scenarios through creative recombination.
- Abstraction acquisition involves efficiently extracting reusable building blocks from past experience, while on-the-fly recombination enables selecting and combining these components into models adapted for current situations. Both processes must operate efficiently to demonstrate intelligence.
- The efficiency emphasis distinguishes intelligence from mere capability - systems requiring hundreds of thousands of hours to acquire simple skills or exhaustive enumeration to find solutions demonstrate low intelligence regardless of eventual success. Intelligence measures learning and deployment efficiency.
- This framework explains why scaling pre-trained models without recombination capability failed to achieve general intelligence - they could acquire abstractions during training but lacked mechanisms for creative recombination during inference, limiting them to template retrieval rather than novel synthesis.
Type 1 vs Type 2 Abstraction Systems
- Chollet identifies two fundamental abstraction types underlying all cognition: Type 1 (value-centric) operating over continuous domains through distance functions, and Type 2 (program-centric) operating over discrete domains through exact structure matching and isomorphism detection.
- Type 1 abstraction drives perception, pattern recognition, and intuition by comparing instances via continuous distance functions and merging them into common templates by eliminating irrelevant details. This mirrors modern machine learning approaches and transformer architecture capabilities.
- Type 2 abstraction underlies reasoning, planning, and systematic thinking by comparing discrete programs (graphs) and identifying exact structural relationships rather than approximate similarities. This resembles software engineering refactoring and formal logical reasoning processes.
- Transformers excel at Type 1 abstraction, enabling breakthrough performance in perception, intuition, and pattern recognition tasks, but struggle with Type 2 abstraction, explaining their difficulty with simple discrete tasks like sorting lists or adding digit sequences.
- Human intelligence combines both abstraction types seamlessly, using Type 1 intuition to guide Type 2 reasoning and vice versa. Chess playing exemplifies this integration where pattern recognition narrows move options that systematic calculation then evaluates.
- The left-brain/right-brain metaphor captures this distinction with one hemisphere specializing in perception and intuition (Type 1) while the other handles reasoning and planning (Type 2), though both systems typically collaborate in human cognition.
Discrete Program Search for Inventive AI
- Discrete program search represents combinatorial exploration over graphs of symbolic operations from domain-specific languages, contrasting with continuous optimization approaches that manipulate interpolated parameter spaces through gradient descent methods.
- All AI systems demonstrating genuine invention and creativity rely on discrete search mechanisms, from 1990s genetic algorithms designing novel antennas to AlphaGo's move 37 and DeepMind's recent AlphaFold evolution system, proving that search enables invention while pure deep learning achieves automation.
- Program synthesis involves searching discrete program spaces rather than fitting parametric curves, trading machine learning's data efficiency disadvantage for extreme compute efficiency requirements that grow combinatorially with problem complexity.
- The fundamental trade-off between machine learning and program synthesis mirrors their complementary strengths: ML requires dense data sampling but enables efficient curve fitting, while program synthesis needs only 2-3 examples but faces combinatorial explosion in program space exploration.
- Deep learning excels at automation by interpolating between known examples, but invention requires exploring discrete possibilities that don't lie on continuous manifolds. This explains why scaling transformers couldn't achieve breakthrough innovation despite massive capability improvements.
- Future AI systems must combine both approaches to overcome their individual limitations, using continuous approximations to guide discrete search and prevent combinatorial explosion while maintaining the inventive capability that only symbolic reasoning provides.
Fusing Intuition with Symbolic Reasoning
- The integration of Type 1 and Type 2 abstraction requires using fast approximate judgments from continuous systems to guide discrete search processes, preventing combinatorial explosion while preserving the inventive capability of symbolic reasoning.
- Chess exemplifies this integration where players use pattern recognition (Type 1) to identify promising moves, then apply systematic calculation (Type 2) to evaluate selected options rather than exhaustively analyzing all possibilities. Intuition makes reasoning tractable.
- The map-drawing analogy illustrates the integration process: embedding discrete objects with discrete relationships into latent spaces where continuous distance functions enable fast approximate judgments about complex relationships, keeping combinatorial explosion manageable during search.
- Deep learning provides the continuous approximation capability needed to guide program search, offering rapid but approximate evaluations of program space regions that would otherwise require exhaustive exploration through purely symbolic methods.
- This fusion enables AI systems to maintain the precision and inventiveness of symbolic reasoning while avoiding the computational intractability that makes pure program synthesis impractical for complex problems requiring extensive search.
- The approach mirrors human problem-solving where unconscious pattern matching guides conscious reasoning, suggesting that successful AGI will require similar architectural integration rather than purely symbolic or purely connectionist approaches.
Meta-Learning System Architecture
- Chollet envisions AGI as programmer-like systems that approach new tasks by synthesizing custom software, blending deep learning modules for perception-type problems with algorithmic modules for reasoning-type problems under the guidance of discrete program search.
- The meta-learning architecture maintains a global library of reusable abstractions that continuously evolves through experience, similar to software engineers using GitHub to share and reuse code libraries for common functionality across different projects.
- When encountering new problems, the system searches its abstraction library for relevant building blocks, synthesizes new components as needed, and uploads successful innovations back to the library for future use, creating a virtuous cycle of capability improvement.
- Deep learning based intuition guides the program search process by providing fast approximate judgments about program space structure, enabling efficient exploration of vast combinatorial possibilities without exhaustive enumeration that would be computationally prohibitive.
- The system architecture enables rapid assembly of working models for novel situations, paralleling how experienced software engineers quickly create solutions by combining existing tools and libraries rather than building everything from scratch for each new problem.
- Continuous self-improvement occurs through both library expansion (adding new abstractions) and search refinement (improving intuition about program space structure), creating systems that become increasingly capable over time through accumulated experience.
NDEA's Scientific Discovery Mission
- NDEA (the new research lab) was founded on the premise that accelerating scientific progress requires AI capable of independent invention and discovery rather than automation systems that operate within existing knowledge boundaries but cannot expand them.
- The lab's approach combines deep learning guided program search to build programmer-like meta-learning systems, believing that scientific discovery requires invention capability beyond what pure automation can provide, no matter how sophisticated the automation becomes.
- Their first milestone involves solving ARC-2 using systems that start with zero knowledge about the benchmark, proving that their approach can achieve fluid intelligence through learning rather than task-specific optimization or fine-tuning on known problems.
- The ultimate goal focuses on empowering human researchers and accelerating scientific timelines rather than replacing human scientists, positioning AI as a tool for expanding research capability rather than substituting for human creativity and insight.
- NDEA's philosophy distinguishes between deep learning's strength in automation and the additional capabilities needed for scientific discovery, arguing that breakthrough research requires systems that can venture beyond known solution spaces into genuinely unexplored territories.
- The lab represents Chollet's bet that combining symbolic reasoning with neural approximation will unlock the inventive capability needed for scientific advancement, testing this hypothesis through concrete benchmarks before attempting real-world scientific applications.
Conclusion
François Chollet's analysis reveals why the scaling paradigm failed to achieve AGI despite massive computational and data increases: it optimized for memorization rather than fluid intelligence. His ARC benchmarks expose the fundamental gap between pattern matching and genuine reasoning, while test-time adaptation offers a promising path forward. However, true AGI requires combining deep learning's continuous approximations with discrete program search's inventive capability, creating meta-learning systems that can synthesize novel solutions rather than merely retrieving pre-trained responses. This architectural fusion, guided by efficiency principles and evaluated through progressively challenging benchmarks, represents the most promising current path toward artificial general intelligence.
Practical Implications
- Focus AI development on test-time adaptation capabilities rather than purely scaling pre-trained models
- Design benchmarks that measure fluid intelligence and novel problem-solving rather than memorized skill demonstration
- Invest in hybrid architectures combining continuous neural networks with discrete symbolic reasoning systems
- Develop meta-learning systems that can synthesize task-specific solutions rather than applying pre-trained templates
- Build global abstraction libraries that enable knowledge sharing and reuse across different problem domains
- Prioritize efficiency metrics for both learning and deployment rather than just final task performance
- Create evaluation frameworks that assess compositional generalization and creative recombination abilities
- Implement discrete program search guided by neural approximations to overcome combinatorial explosion challenges
- Build systems that can discover goals and mechanics in novel environments rather than operating in predetermined contexts
- Focus on invention capability that expands knowledge boundaries rather than automation that operates within existing limits
- Develop benchmarks that resist brute-force approaches and require genuine reasoning for success
- Integrate Type 1 (continuous) and Type 2 (discrete) abstraction systems rather than relying on either approach alone