Skip to content

CES 2026 - OpenAI Makes the Most Accessible Code (for an LLM)

At CES 2026, the GAAD Foundation and ServiceNow unveiled the AI Model Accessibility Checker. OpenAI’s GPT-5.2 secured the top spots for accessible code generation, while Google’s Gemini 3.0 Pro ranked last, revealing a major gap in inclusive software development priorities.

Table of Contents

At CES 2026, the GAAD Foundation, in partnership with ServiceNow, unveiled the results of its inaugural AI Model Accessibility Checker (AAC), a new benchmark designed to evaluate the quality and inclusivity of code generated by large language models (LLMs). While OpenAI’s GPT-5.2 series secured the top rankings for producing accessible web code, Google’s Gemini 3.0 Pro finished last among the 36 models tested, highlighting a significant disparity in how major AI providers are prioritizing inclusive software development.

Key Points

  • Top Performer: OpenAI’s GPT-5.2 models claimed the top four spots in the AAC benchmark, demonstrating superior adherence to web accessibility standards.
  • Critical Failure: Despite Google owning the industry-standard Lighthouse testing tool, its Gemini 3.0 Pro model ranked 36th out of 36 models tested.
  • Common Errors: Color contrast violations accounted for 80% to 90% of the accessibility failures detected across all models.
  • Market Reality: Current data shows 94% of web pages and 72% of common mobile app user journeys still fail basic accessibility tests.

Benchmarking AI Code Generation

The AAC initiative marks a shift in how the tech industry evaluates artificial intelligence. Rather than solely measuring speed or conversational accuracy, this benchmark assesses the output code against established web accessibility standards. The initiative is led by the GAAD Foundation, the organization behind Global Accessibility Awareness Day, to encourage foundational model companies to prioritize inclusivity at the code level.

According to the test results released at CES, OpenAI has established a clear lead. Five of the top ten performing models belonged to OpenAI, with their GPT-5.2 series occupying the top four positions. In contrast, Anthropic’s Claude models placed in the middle of the pack, while Google suffered a notable defeat.

"The biggest shocker of all to me was Gemini 3.0 Pro came in dead last—36 out of 36. They have Lighthouse. If they just trained on Lighthouse, they would get a perfect score."— Joe Devon, Chair of the GAAD Foundation

Devon noted the irony that Google, which develops Lighthouse—a premier developer tool for automated accessibility and SEO testing—failed to integrate those same standards effectively into its flagship AI model’s training data.

Methodology and Technical Findings

The AAC utilizes axe-core, an automated testing engine, to evaluate the generated code. Because automated tools can typically detect only 30% to 50% of accessibility issues, the benchmark focuses heavily on programmatic errors that machine validation can catch reliably.

The analysis revealed that the vast majority of AI-generated code failures stem from basic design implementation:

  • Color Contrast: Approximately 80% to 90% of all flagged issues related to insufficient contrast between text and background, making content difficult for visually impaired users to read.
  • Missing Labels: A significant number of models failed to generate proper labels for form elements and interactive controls.
  • HTML Structure: The benchmark analyzed over 1,000 HTML pages across 28 categories to see how models handled structural elements.

Interestingly, the study found a divergence between accessibility and other typographic standards. Models that performed well on accessibility metrics often performed poorly on the correct usage of em-dashes, suggesting that current training data may treat semantic code quality and typographic nuance as separate optimization tracks.

The Business Imperative and Demographic Shifts

Beyond the technical benchmarks, the release of the AAC highlights a stagnant progress curve in digital inclusion. Devon cited the WebAIM Million report, which indicates that the percentage of inaccessible web pages has only improved from 97% to roughly 94% over the last six years. Similarly, a "State of Mobile App Accessibility" report by ArcTouch found that 72% of common user journeys in top mobile apps result in poor or failing experiences.

While AI creates efficiency, the failure to prioritize accessibility poses a long-term business risk, particularly regarding shifting demographics. With the Millennial generation entering their 40s, the prevalence of age-related disabilities—such as vision and hearing loss—is set to explode.

"We have 50% of the population... that's above 40. This is a demographic explosion that's going to happen, and the companies that are not focused on accessibility are going to learn very quickly. It's just like that tipping point where they're going to be like, 'Oh, okay. We actually have to pay attention to this because most of the world has some kind of disability that's age-related.'"— Joe Devon, Chair of the GAAD Foundation

Devon argues that AI developers must treat accessibility not as a compliance checklist, but as a dataset of "edge cases." In machine learning, solving for edge cases—the diverse range of human abilities—typically results in a more robust and capable model for all users.

Future Developments

The GAAD Foundation intends to evolve the AAC from a static benchmark into an interactive feedback loop. Future versions of the tool will ostensibly feed failure data back to the models to see if they can self-correct and generate improved code in subsequent attempts.

As the industry moves toward the 15th anniversary of Global Accessibility Awareness Day this May, the pressure is now on foundational model providers like Google and Anthropic to close the gap with OpenAI. The data suggests that without intentional intervention in training sets, AI risks perpetuating the digital barriers that have plagued the web for decades.

Latest

2 Hours with AI. Better Than 8 Hours with Teachers

2 Hours with AI. Better Than 8 Hours with Teachers

American education is stuck in an industrial "factory model." McKenzie Price’s Alpha School uses adaptive AI to fix it. With just two hours of academic work, students reach the 99th percentile, freeing up time to master life skills and high-agency projects.

Members Public
14 Habits for an Optimised Morning & Evening Routine - Arthur Brooks

14 Habits for an Optimised Morning & Evening Routine - Arthur Brooks

Harvard's Arthur Brooks argues happiness isn't luck—it's management. By understanding psychology as biology, we can master our emotions. Explore 14 habits to optimize your morning and evening routines, blending neurobiology with ancient wisdom for a life of purpose.

Members Public
NFA Live! Bitcoin in 2026

NFA Live! Bitcoin in 2026

It's January 2026. Institutional adoption is at an all-time high, yet prices remain stagnant. We explore the decoupling of news and price action, the "Coldplay Effect" on altcoins, and why investors are rethinking strategies amidst the "bear market blues."

Members Public
Unemployment Rate Drops to 4.4%

Unemployment Rate Drops to 4.4%

The unemployment rate has dropped to 4.4%, easing fears of a rapid economic downturn. However, a complex dynamic persists: hiring is slowing significantly while layoffs remain low. This divergence fuels market gains while crypto struggles in a restrictive environment.

Members Public