Robots.txt Won't Save You

July 12, 2024

Is it too late? The ever hungry AI titans demand more and more data that is vacuuming up the entirety of the internet. You just got off a boring Zoom call? Chomp.¹ Checked Facebook for upcoming birthdays or said hello to a friend? Slurp.² Take a peak of your friend’s recent vacation on Instagram? Mmmm.³ How about the latest news aggregated by X? Scrumptious.⁴ Afterward, you look for nearby restaurants on Google because you are getting hungry? Delicious.⁵ Every action on these platforms feeds the insatiable blob of AI training models.

But we aren’t functioning in our own spaces when feeding these AI models. As a user of these platforms we are beholden to the profiteering whims of their owners. In the past this has been funded indirectly with advertisements or directly with subscription fees but further profit extraction now relies on gathering behavior data to train AI. Yet the gathering of this data isn’t limited to the platform itself. There simply isn’t enough data. It needs more!⁶

Subsequently, leading AI companies have turned to the wider public internet. Scraping data from message forums, news articles, blogs, Wikipedia pages, images, and whatever else it can find to feed and further refine AI models. But it hasn’t been enough. In 2021, OpenAI was running out of content to consume and created a speech recognition tool called Whisper for no other reason than to transcribe more than a million hours of YouTube videos to give itself more content.⁷ Anything reachable on the public internet is available for consumption, even millions of copyrighted articles from The New York Times.⁸ But how is this massive trawling system technically performed?

When you want to see a web page you type in the name in the address bar, hit enter, and it loads. Disregarding all other underlying systems like IP addresses and DNS for simplicity, the browser sends what is called a HTTP GET request. The process begins when you hit enter, which instructs the browser to send a GET request to the server to obtain whatever resource at the specified directory, such as a web page or an image.

The request itself includes a URL, which states the website’s location, and various headers, such as the User-Agent, which provides information about whoever is making the request. The server processes this request and responds with the desired stuff, which might be a video, image, or text. Once the client receives the response, the browser renders the content for the user to view.

As an example, the below GET request is being sent to the website www.example.com and is requesting resources from the directory /path/to/resource. The User-Agent and Accept fields show that the client is using Windows 10 and the Chrome web browser.

GET /path/to/resource HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Connection: keep-alive

But these GET requests can be easily automated as part of scripts or software. This is where web scrapers come into play. These are automated software programs that work by systematically extracting data from websites, allowing anyone to gather vast amounts of information efficiently and at massive scale. They navigate through web pages and collect relevant data points such as text, images, and links. Scrapers parse the HTML structure of web pages to identify and extract desired content.

Additionally, there are no rules about what you have state in User-Agent. Yes, people just go on the internet and tell lies. A GET request could state they are making a request from an iPhone when in fact it is just software running on a Linux server. This matters because in an ideal world all client requests would have the correct stated User-Agent so that website administrators could manage appropriate access. If there was a need to block the Cloudflare Always Online crawler it could easily be blocked by denying, dropping, or ignoring any HTTP GET requests with the below User-Agent:

Mozilla/5.0 (compatible; CloudFlare-AlwaysOnline/1.0; +http://www.cloudflare.com/always-online)

One way website owners can manage User-Agent access is through the contents of the robots.txt file. The robots.txt file serves as a guideline for web crawlers and scrapers on how to interact with the site’s content. Located in the root directory of a website, this file specifies which parts of the site should be crawled and indexed and which parts should be ignored. If you didn’t want Google indexing your website for any reason you would tell them not to from this file. By setting rules for user agents, website administrators can control the access of web crawlers, protecting sensitive information, managing server load, and optimizing the indexing process. The contents of the robots.txt file helps manage a balance between accessibility and privacy, ensuring that only the desired content is made available to search engines and other automated agents.

Any HTTP GET requests do not have to obey the contents of robots.txt. The file is a convention rather than a strict protocol. Compliance is voluntary and based on the good behavior of web crawlers. It serves purely as a guideline for well-behaved web crawlers and robots, instructing them on which parts of a website they are allowed to crawl and index. Malicious bots or poorly designed crawlers might ignore the robots.txt file and access areas of a website they were instructed not to visit.

The reasons for filtering certain User-Agent requests are endless. In the context of AI crawlers, it can be used to prevent well-behaved web scrapers that state the correct User-Agent information in their request. The major newspaper The New York Times started solidifying the prohibition of using its content to train AI models in the summer of 2023.⁹ This is expectedly pragmatic due to the fact that their primary product is selling publicly available text and images, which has necessitated the fabrication of a strong robots.txt file. As of July 12, 2024 it discourages 26 User-Agents from the all encompassing root directory of /. The popular site Reddit took the same measures recently but went further and disallowed everyone¹⁰ on their robots file. This is due to them selling user built data directly to Google for the tidy sum of $60 million.¹¹

The sentiment of using a robustly defined robots.txt file to protect a site has spread to more independent platforms like personal blogs. Roelant, Neil Clarke, and Cory Dransfeldt have all shared their experience about updating their robots.txt file to block AI crawlers. The makers of the Pika blogging platform even baked the hardened settings into everyone’s robots.txt.

While I applaud these website fortifications I have several fundamental doubts about its efficacy. The primary is about the truthfulness of the User-Agent field in requests. While many AI companies set this field correctly when crawling there are cases of it being obfuscated and ignored.

In June of 2023, Robb Knight found out that the Perplexity web scraper bot was not defining the appropriate User-Agent in requests.¹² The expected User-Agent designation was PerplexityBot, which was missing, and the actually provided information pointed to a user on Windows 10 using Chrome. The tech magazine Wired later confirmed Mr. Knight’s findings and went even further in finding that the Perplexity bot was frequently inaccurate and performed crawling from undocumented IP addresses.¹³

My own experience with OpenAI’s ChatGPT has left a similar apprehension about supposed User-Agent settings. My robots file specifically blocks User-agent: ChatGPT-User, which is supposedly used when performing live tasks for human users. Yet, when asking the 4o model to summarize one of my recent blog posts it does so without any issue. Pressing ChatGPT to a variety of half truth answers. The immediate response stated:

I accessed the content using standard browsing tools without specifying a particular user-agent setting. This allows the browsing tool to simulate regular web browsing behavior, enabling the retrieval of information from most websites.

Then I asked it to be specific.

I don’t have the capability to specify or reveal the user-agent settings used during web browsing. My responses are based on the content retrieved through standard web browsing tools provided by OpenAI.

Later inquiries state the below agent is used when accessing websites.

When accessing websites, the user agent string typically used by my browsing tool is designed to mimic a modern web browser. It usually looks like this: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36

It then goes on to explain why using that particular agent is helpful for whichever reasons. One of reasons stood out to me.

Avoid detection and blocking by websites that might restrict access based on non-standard user agents.

I haven’t seen any news articles about ChatGPT 4o bypassing the instructions of robots.txt specifically. I also have no real way of pulling connection logs as this blog is hosted on Cloudflare Pages with almost no tracking. Additionally, the agent settings ChatGPT shared might very well be a fluke. A dreamed up hallucination that plagues all AI models or a misinterpretation of my query. It could also be relying on a cached version of my blog from Bing. But it doesn’t really matter the method because the result is that even with a specific statement in robots.txt my blog post was gobbled up.

Another point is the sheer breadth and dynamism of what website administrators are facing. As billions pour into the AI sector how are we supposed to keep up with this or that new company that formed over the course of a few days? Or there might be new web scraper companies formed in the fringes of the AI industry built entirely to extract data where ever possible. They could both legally insulate and satiate the hunger of the main AI models. Heightened maintenance requirements for keeping robots.txt updated is not sustainable for most companies or independent creatives. Then we are faced with prior existential question about if it’s even effective? Perplexity’s behavior and ChatGPT 4o accessing my website are only compounded by the recent revelation from Reuters that multiple AI companies are ignoring robots.txt.¹⁴

However, we shouldn’t give up hope in the face of these daunting AI vacuums. There are services popping up like Dark Visitors that introduces a dynamic component to the contents of your robots.txt file. Another easy method is frequently checking The New York Times website for their robots.txt to get the most up to date and relevant agents to block. The question about how pragmatic this approach is in what Matthew Butterick calls a feel-good theatrical gesture¹⁵ is up to the organization or individual to decide.

For a more aggressive approach that doesn’t rely on the power of suggestion, you could use a web application firewall (WAF) to outright drop bot traffic. This would require individuals who are technically inclined and needs more care than just simply updating a text file. However, ready made services like Cloudflare’s Bot Categories¹⁶ are available for those who realize robots.txt aren’t enough. Cloudflare’s approach utilizes WAF rules to allow or block certain web scrapers at the network layer. This approach would produce better blocking enforcement as it drops undesirable network traffic.

More nuanced approaches to prevent wholesale AI theft are available that are quickly gaining in popularity. Methods and tools like Nightshade or Glaze¹⁷ can poison data such as images as it is ingested by AI models. Everything looks normal to human eyes but to AI the style, content, and interpretation is completely askew. The result is that the AI model produces an extremely flawed or simply wrong output. Yet a flaw was found in Glaze by researchers in early July.¹⁸ The flaw was quickly patched¹⁹ but reveals an ongoing arms race between the creators and AI titans devouring them up. The unfortunate take away is that creators need to be ever vigilant while AI crawlers only need a small lapse for content to be available before it goes into the deep and mysterious belly of AI training models.

Cristiano Lima-Strong and David DiMolfetta, “Zoom’s Privacy Tweaks Stoke Fears That Its Calls Will Be Used to Train AI,” Washington Post, August 8, 2023, https://www.washingtonpost.com/politics/2023/08/08/zooms-privacy-tweaks-stoke-fears-that-its-calls-will-be-used-train-ai/. ↩︎
James Vincent, “Facebook’s Next Big AI Project Is Training Its Machines on Users’ Public Videos,” The Verge, March 12, 2021, https://www.theverge.com/2021/3/12/22326975/facebook-training-ai-public-videos-digital-memories. ↩︎
Geoffrey A. Fowler, “Your Instagrams Are Training AI. There’s Little You Can Do About It.,” Washington Post, September 28, 2023, https://www.washingtonpost.com/technology/2023/09/08/gmail-instagram-facebook-trains-ai/. ↩︎
Sarah Perez, “X’S Privacy Policy Confirms It Will Use Public Data to Train AI Models,” TechCrunch, September 6, 2023, https://techcrunch.com/2023/09/01/xs-privacy-policy-confirms-it-will-use-public-data-to-train-ai-models/. ↩︎
Ben Schoon, “Google’s Updated Privacy Policy Doubles Down on Using Your Data for Training AI,” 9to5Google, July 3, 2023, https://9to5google.com/2023/07/03/google-privacy-policy-ai-training-data/. ↩︎
Jared Kaplan et al., “Scaling Laws for Neural Language Models,” arXiv, January 23, 2020, https://arxiv.org/pdf/2001.08361. ↩︎
Cade Metz et al., “How Tech Giants Cut Corners to Harvest Data for A.I.,” The New York Times, April 9, 2024, https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html. ↩︎
Michael M. Grynbaum and Ryan Mac, “New York Times Sues OpenAI and Microsoft Over Use of Copyrighted Work,” The New York Times, December 27, 2023, https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html. ↩︎
Jay Peters and Wes Davis, “The New York Times Blocks OpenAI’s Web Crawler,” The Verge, August 21, 2023, https://www.theverge.com/2023/8/21/23840705/new-york-times-openai-web-crawler-ai-gpt. ↩︎
Alex Heath, “Reddit Blocks AI Bots From Crawling Its Website,” The Verge, June 25, 2024, https://www.theverge.com/2024/6/25/24185984/reddit-robots-txt-fight-ai-bots-scraping-crawlers. ↩︎
Jason Koebler, “Google Is Paying Reddit $60 Million for Fucksmith to Tell Its Users to Eat Glue,” 404 Media, May 23, 2024, https://www.404media.co/google-is-paying-reddit-60-million-for-fucksmith-to-tell-its-users-to-eat-glue/. ↩︎
Robb Knight, “Perplexity AI Is Lying About Their User Agent,” Robb Knight (blog), June 15, 2024, https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/. ↩︎
Dhruv Mehrotra and Tim Marchman, “Perplexity Is a Bullshit Machine,” WIRED, June 19, 2024, https://www.wired.com/story/perplexity-is-a-bullshit-machine/. ↩︎
Katie Paul, “Exclusive: Multiple AI Companies Bypassing Web Standard to Scrape Publisher Sites, Licensing Firm Says,” Reuters, June 21, 2024, https://www.reuters.com/technology/artificial-intelligence/multiple-ai-companies-bypassing-web-standard-scrape-publisher-sites-licensing-2024-06-21/. ↩︎
Matthew Butterick, “AI Scraping & Publicly Available Web Data,” Matthew Butterick (blog), June 22, 2024, https://matthewbutterick.com/chron/ai-scraping-and-publicly-available-web-data.html. ↩︎
Reid Tatoris and Pawel Klimek, “Easily Manage AI Crawlers With Our New Bot Categories,” The Cloudflare Blog, June 14, 2024, https://blog.cloudflare.com/ai-bots/. ↩︎
Benj Edwards, “University of Chicago Researchers Seek to ‘Poison’ AI Art Generators With Nightshade,” Ars Technica, October 25, 2023, https://arstechnica.com/information-technology/2023/10/university-of-chicago-researchers-seek-to-poison-ai-art-generators-with-nightshade/. ↩︎
Robert Hönig, Javier Rando, Nicholas Carlini, and Florian Tramèr, “Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI,” arXiv, June 17, 2024, https://arxiv.org/pdf/2406.12027. ↩︎
Ashley Belanger, “Tool Preventing AI Mimicry Cracked; Artists Wonder What’s Next,” Ars Technica, July 5, 2024, https://arstechnica.com/tech-policy/2024/07/glaze-a-tool-protecting-artists-from-ai-bypassed-by-attack-as-demand-spikes/. ↩︎