Table of Contents
The landscape of AI-assisted development shifted dramatically today with the simultaneous release of Anthropic’s Opus 4.6 and OpenAI’s GPT-5.3 Codex. For engineers and technical founders, this isn't just about incrementally faster models; it represents a bifurcation in how we interact with AI code generation. To cut through the noise and move beyond "hot takes," we sat down with Morgan Linton—CTO of Bold Metrics and a veteran engineer—to put these models through a rigorous head-to-head challenge.
The goal was not merely to review the changelogs but to execute a tactical build: recreating the multi-billion dollar prediction market app, Polymarket. The results revealed distinct personalities for each model, offering a roadmap for developers trying to decide which tool belongs in their tech stack.
Key Takeaways
- Philosophical Divergence: GPT-5.3 Codex acts as a rapid, interactive collaborator (the "Founding Engineer"), while Opus 4.6 functions as a thoughtful, autonomous system (the "Staff Engineer").
- Configuration is Critical: Unlocking Opus 4.6’s full potential requires specific updates to your
settings.jsonfile, particularly to enable experimental agent teams. - The Polymarket Challenge: While Codex built a functional prototype in under four minutes, Opus 4.6 delivered a superior, architecturally sound product with 96 unit tests and a polished UI—albeit at a much higher token cost.
- Token Consumption: Opus 4.6 is token-hungry, utilizing over 150,000 tokens for a single complex build, compared to Codex’s lean efficiency.
A Tale of Two Philosophies
The most striking realization from testing both models is that Anthropic and OpenAI are no longer running the same race. They are optimizing for fundamentally different engineering workflows.
GPT-5.3 Codex is designed for speed and interactivity. It prioritizes "progressive execution," allowing the developer to steer the ship mid-course. It is the ideal partner for "vibe coding"—where you want to iterate fast, fix bugs on the fly, and maintain a tight feedback loop. It asks, "How fast can we ship this?"
Conversely, Opus 4.6 has leaned heavily into agentic behavior. It features a massive context window (one million tokens) and is designed to reason over entire repositories. It operates less like a pair programmer and more like a team of autonomous agents. When given a task, it asks, "Should we do this, and what is the most robust architecture to support it?"
With Codex 5.3, the framing is an interactive collaborator. You steer it mid-execution, stay in the loop, course correct as it works. With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
Getting Under the Hood: Setup and Configuration
Many developers attempting to use Opus 4.6 immediately upon release may inadvertently be running older versions or missing out on the flagship "Agent Teams" feature due to configuration oversight. To ensure a fair comparison, proper setup is required.
Configuring Opus 4.6
To leverage the multi-agent capabilities, you must manually update your configuration settings. Simply updating the CLI is often insufficient.
- Verify Version: Run
npm updateand check your version. You should be on version 2.1.32 or higher. - Edit Settings: Navigate to
~/.claude/settings.json. - Enable Agents: You must add the following flag to unlock the new capabilities:
"Claude Code Experimental Agent Teams": 1. - Tmux Integration: For developers using Warp or similar terminals, ensure
Tmuxis installed. You can set the agent display to "split panes" in the JSON file to visualize agents working in parallel.
API-Level Adaptive Thinking
For those integrating Opus 4.6 via API, Anthropic has introduced "Adaptive Thinking." This allows developers to set an effort level. Setting the effort to max removes constraints on thinking depth, allowing the model to reason exhaustively before outputting code. Notably, if you attempt to use "max" effort on older models, the API will return an error.
The Showdown: Rebuilding Polymarket
To test the practical application of these philosophies, we tasked both models with building a competitor to Polymarket. The prompts were tailored slightly to leverage each model's strengths:
- Opus Prompt: "Build a competitor to Polymarket. Create an agent team to explore this from different angles: technical architecture, prediction market mechanics, UX, and testing."
- Codex Prompt: "Build a competitor to Polymarket. Think deeply about technical architecture, market mechanics, UX, and testing."
The Build Process
GPT-5.3 Codex lived up to its reputation for speed. Upon receiving the prompt, it immediately identified that the repository was empty and began scaffolding. Within three minutes and 47 seconds, it had deployed a functional application. It executed a "YOLO" style of coding—fast, functional, and direct.
Opus 4.6 took a radically different approach. It immediately spun up four parallel research agents. One researched order book matching engines, another studied Polymarket’s mechanics, a third focused on UX, and a fourth devised a testing strategy. It consumed over 25,000 tokens per agent just during the research phase. It didn't begin writing a single line of code until it had synthesized findings from all four agents.
The Results
When the dust settled, the difference in output was stark.
Codex (The Prototype):
The Codex build was functional. It passed 10 out of 10 unit tests and created a basic working market where users could buy and sell shares. However, the UI was spartan. Even when prompted to "redesign this like Jack Dorsey," the update was superficial—a minor font tweak and a "monochrome" palette without true design sensibility. It felt like a solid MVP built by a junior engineer.
Opus (The Product):
The Opus build was comprehensive. Instead of 10 tests, it wrote 96 unit tests covering edge cases in the order book and matching engine. The architecture was a modular monolith using Next.js 14. Visually, it was stunning right out of the box—featuring dark mode, hover states, proper data visualization, and a sophisticated card hierarchy. It even hallucinated (correctly) a leaderboard and portfolio section that hadn't been explicitly requested but made sense for the product context.
Verdict: The Cost of Quality
The "better" model depends entirely on your constraints regarding time and budget.
Choose GPT-5.3 Codex if: You need speed. If you want to iterate quickly, have a tight feedback loop, and prefer to steer the AI as it codes, Codex is the superior collaborative tool. It is cost-effective and low-latency.
Choose Opus 4.6 if: You need architectural depth and autonomy. Opus 4.6 is the "Senior Staff Engineer" you send off with a vague requirement, knowing they will return with a robust, well-tested, and polished solution. However, this comes at a cost. The Opus build consumed approximately 150,000 to 250,000 tokens—roughly 5 to 10 times the cost of the Codex build.
It’s a different personality type. Claude asks, 'Should we do this?' GPT-5.3 is like, 'How fast can I ship this?'
Conclusion
We are entering an era of specialized AI engineering. Teams will likely find themselves using both models: Codex for the daily grind of function writing and rapid prototyping, and Opus 4.6 for complex refactors, system architecture, and "greenfield" projects where deep reasoning is non-negotiable.
For engineering leaders, the recommendation is clear: give your teams access to both. The cost of tokens is negligible compared to the productivity gains of having both a rapid prototyper and a thoughtful architect at your fingertips.