Skip to content

Google Bypasses Publisher AI Opt-Outs, Only Blocks DeepMind Training

Photo by ayumi kubo / Unsplash

Table of Contents

Recent court testimony reveals Google continues training its search AI features on publisher content despite explicit opt-outs, creating a difficult choice for website owners between AI protection and search visibility.

Key Takeaways

  • Google confirmed in court that publisher opt-outs only block DeepMind's AI training, not Google Search's AI features like AI Overviews.
  • The "Google-Extended" directive in robots.txt files doesn't prevent content from being used in search-specific AI products.
  • Publisher opt-outs have cut Google DeepMind's training data in half, removing 80 billion of 160 billion content tokens.
  • The only way publishers can fully prevent AI training is to completely remove their sites from Google's index, sacrificing all search visibility.
  • The Department of Justice is examining these practices as part of its ongoing antitrust case against Google's search dominance.

The Opt-Out Loophole Revealed

In a significant revelation during Google's ongoing antitrust trial, a senior executive has confirmed that the company trains its search-specific AI products on web content even when publishers have explicitly opted out of AI training. Eli Collins, Vice President at Google DeepMind, testified that while publishers can block their content from being used to train DeepMind's AI models, this protection doesn't extend to Google's search organization.

When directly questioned by Diana Aguilar, an attorney for the Department of Justice, Collins acknowledged that "once you take the Gemini [AI model] and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training." This testimony has revealed a significant loophole in Google's publisher controls, raising serious concerns about content rights and digital competition.

How Publisher Opt-Outs Actually Work

In September 2023, Google introduced the "Google-Extended" user agent for robots.txt files, allowing websites to request that Google not use their content for large language model (LLM) training. This mechanism was presented as a way for publishers to protect their content from being used in generative AI without harming their SEO performance.

However, the court testimony has now clarified that this opt-out mechanism only applies to Google DeepMind's AI training activities. The Google Search organization, which develops and deploys features like AI Overviews, operates under different rules and can continue to train its AI models using publisher content regardless of opt-out preferences.

This distinction has had a measurable impact on Google's AI training data. According to documents presented in court, publisher opt-outs have reduced Google DeepMind's training corpus by half, removing approximately 80 billion tokens (snippets of content) from a total of 160 billion tokens. When Judge Amit Mehta asked Collins to confirm this reduction, he stated, "That is correct."

The Publisher's Dilemma

This revelation creates a difficult choice for website owners and publishers. The only way to completely prevent Google from using content for AI training across all its divisions is to remove the site from Google's index entirely. However, this drastic step would eliminate all search visibility, effectively cutting off a major source of traffic and revenue.

Publishers have already expressed concerns that Google's AI summarization features could discourage users from visiting their websites, impacting their revenue streams. Now they face the added frustration of learning that their attempts to opt out of AI training have been only partially effective.

The situation highlights the power imbalance between Google and content creators. While Google has technically provided an opt-out mechanism, its limited scope forces publishers to choose between protecting their content from all AI training and maintaining their search visibility-a choice many cannot afford to make.

Antitrust Implications and Future Outlook

This revelation comes at a critical time in Google's antitrust battle. Having been found guilty last year of illegally monopolizing the online search market, the company now faces potential structural remedies, including forced divestiture of its Chrome browser and restrictions on default search engine deals.

The Department of Justice is examining these AI training practices as part of its broader case against Google's search dominance. The concern is that by controlling both search and increasingly powerful AI features, Google could further entrench its market position while using publisher content in ways those publishers never intended or authorized.

For publishers and website owners, this situation underscores the need for more transparent and comprehensive controls over how their content is used in AI systems. As generative AI becomes more integrated into search and other digital services, the tension between content creators and technology platforms is likely to intensify, potentially leading to new regulatory frameworks or industry standards.

Google's practice of training search AI on opted-out publisher content reveals a significant gap between what website owners believe they're controlling and what's actually happening with their content, highlighting the growing tensions between AI advancement and content rights.

Latest