robots.txt opt-outs are unenforceable and AI companies are ignoring them, leaving website owners with no practical way to prevent their content from being scraped for training

legallegal0 views3/23/2026

Website owners who add AI crawler blocks to their robots.txt files are discovering it makes little difference. The robots.txt protocol is a voluntary convention, not a legal requirement — crawlers can simply ignore it. In December 2025, OpenAI removed robots.txt compliance language from its ChatGPT-User crawler documentation. Multiple AI companies have been documented ignoring robots.txt exclusions entirely. And even when a company's primary crawler respects robots.txt, the content often exists on third-party sites, caches, or data brokers that the original publisher does not control. This matters because it means content creators have zero meaningful control over whether their work becomes AI training data. A photographer who hosts a portfolio site, a journalist who publishes articles, or a developer who maintains technical documentation cannot prevent their work from being ingested. Google compounds the problem: publishers who opt out of Google's AI training crawlers also lose visibility in regular search results, because Google uses the same indexed content for both traditional search and AI Overviews. This creates a coercive dynamic where opting out of AI training means opting out of search traffic — the primary discovery mechanism for most websites. The structural reason this persists is that the internet was built on a model of open access, and robots.txt was designed for a world where the worst-case scenario of ignoring it was getting indexed by a search engine. There is no technical enforcement mechanism — robots.txt is a polite request, not a locked door. Legal remedies are uncertain: the Computer Fraud and Abuse Act was weakened by the 2022 Van Buren Supreme Court decision, and scraping publicly available data may not violate any existing statute. The Really Simple Licensing (RSL) framework launched in September 2025 with 50+ publishers is an attempt to standardize licensing, but adoption is voluntary and does not solve the enforcement gap. Cloudflare launched an AI bot blocking tool in July 2025, but determined scrapers can still circumvent technical blocks.

Evidence

OpenAI removed robots.txt compliance language from ChatGPT-User crawler docs in December 2025 (https://ppc.land/openai-revises-chatgpt-crawler-documentation-with-significant-policy-changes/). Multiple AI companies ignoring robots.txt (https://www.tomshardware.com/tech-industry/artificial-intelligence/several-ai-companies-said-to-be-ignoring-robots-dot-txt-exclusion-scraping-content-without-permission-report). Google uses opted-out content for AI Overviews (https://www.niemanlab.org/2025/05/google-is-using-content-from-publishers-who-opt-out-of-other-ai-training-to-power-ai-overviews/). RSL launched September 2025 with 50+ publishers (https://digiday.com/media/here-are-the-biggest-moments-in-ai-for-publishers-in-2025/).

robots.txt opt-outs are unenforceable and AI companies are ignoring them, leaving website owners with no practical way to prevent their content from being scraped for training

Evidence

Comments