AI bots ignore LLMS.txt but scan the internet at scale – 2 studies, 1 conclusion

For some time now, there has been discussion in the marketing industry about the need to structure content for language models. The proposed solution was the llms.txt file, which is intended to act as a guide for AI bots, providing them with clean, easy-to-process context about a given page in Markdown format.

AI bots ignore LLMS.txt but scan the internet at scale

Google Trends – llms.txt popularity in Google

The following set of data comes from Google Trends 1

The Google Trends data shows spikes in interest, with a global peak in March 2026 and a sharp (though short-lived) surge in Poland at the end of 2025. However, these impressive charts may be nothing more than a fleeting curiosity. In this case, the popularity chart is not a measure of the technology’s success, but merely a record of collective hope and hype that has absolutely nothing to do with the interest of AI giants.

This is my second study, and this time I approached it more comprehensively in order to dispel any doubts – if anyone had them – the llms.txt file makes absolutely no sense.

First study

I conducted the first study in the period 13.05.2025 – 01.09.2025 and the statistics look as follows

  • Dataprovider – 1582x
  • Some custom ones – 1332x
  • Regular user – 11x
  • python-requests – 10x
  • Screaming Frog – 8x
  • Fake Googlebot – 2x
  • Semrush – 2x
I wrote about it on my LinkedIn (click here and expand) [PL]

Second study (and the last one) related to llms.txt

I analyzed server logs from the last 191 days, covering ~900 domains. The data comes from the period 04.09.2025 – 13.04.2026, that is, from the beginning of September 2025 to mid-April 2026.

llms.txt is a proposed standard that no one uses

I started by checking how often files related to the new standard are requested, namely:

  • /llms.txt
  • /llms-full.txt
  • /llms-ctx.txt

Over more than half a year – let me remind you, across ~900 domains – I recorded only 1227 requests for these files (on average about 6 requests per day). This traffic concerned 107 domains. The most frequent path was the standard /llms.txt, which had as many as 1215 requests.

File / pathNumber of requests
/llms.txt1215
/llms-full.txt9
/docs/llms.txt1
/api/llms.txt1
/.well-known/llms.txt1

Who is requesting these files?

Among the requesters there was not a single real AI bot. Instead of giants training their models (such as OpenAI, Anthropic, or Google), llms.txt is mainly of interest to:

  • Data aggregators and scanners – Dataprovider.com was responsible for the lion’s share of traffic (794 requests). There is also activity from tools such as AI-Security-Scanner, ReconTool, and SiteAuditBot.
  • People – Chrome (392 requests) and Firefox indicate that it was most likely administrators, researchers, or SEO auditors manually checking for the presence of this file on servers.
  • Simple scriptsllmstxtcrawler or robots-ai-permissions, which based on the User-Agent turned out to be a Python script

Requester details

Client / BotNumber of requestsType / Purpose
Dataprovider794Data aggregator / Analytical crawler
Chrome392Web browser (human/script)
llmstxtcrawler12Script dedicated to scanning llms.txt
AI-Security-Scanner8Security scanner
ReconTool5Audit tool
SiteAuditBot5Bot Semrush
Googlebot (fake)4Impersonating Googlebot
Firefox3Web browser (human)
robots-ai-permissions2Script (Python)
DomainShield1Protection tool
Bingbot1Search engine crawler (Microsoft)
TOTAL1227

Daily trend and hourly distribution
The request trend charts confirm that we are mainly dealing with mechanical, automated scans here. The traffic is small (peaks reach only 20-25 requests per day), and the hourly distribution is fairly flat and even throughout the day. There is no trace here of organic, massive interest from LLM crawlers.

Real AI traffic, or 45 million requests in the background

Someone might argue that AI bots do not visit the sites on which I conducted the study at all. Well, while llms.txt collected just over a thousand requests, the overall traffic from bots associated with AI amounted to nearly 45 million requests during the same time! Yes, to be precise, 44,996,657 – that is exactly how many times AI of various kinds scanned the sites during the analyzed period. I identified a total of 88 unique bots, which gives an astronomical average of over half a million requests per bot.

Breakdown of all companies associated with AI crawlers

So who consumes the most resources?

1. OpenAI

Looking at the breakdown by company, OpenAI is the absolute leader. It generates over 25% of all AI traffic in my study (more than 11.5 million requests). This is driven by bots such as GPTBot (almost 8.8 million requests – number 1 in the overall ranking), OAI-SearchBot, and ChatGPT-User.

2. Anthropic

In second place is Anthropic (the creators of Claude) with just under 6 million requests, mainly due to the aggressive ClaudeBot.

3. PetalBot

In third place in the Top 15 bots ranking, a massive bar in second place stands out – PetalBot. With a result of nearly 8.3 million. PetalBot is a crawler belonging to Huawei (linked to its Petal Search engine and AI development). It is worth keeping this in mind, because it is often accused by administrators of very aggressive behavior and overloading servers.

4. Big tech is not far behind

Meta is responsible for nearly 3 million requests (meta-externalagent), and the top group also includes Amazon’s bot (Amazonbot with 4.3 million) and Apple’s (Applebot with 2.5 million).

5. Google scans too!

Google also has its share, although it is low, at just under 170 thousand requests (e.g. GoogleOther, Google-NotebookLM, Gemini-Deep-Research). This is probably because Google may largely use data gathered earlier by the main Googlebot to train its models (which in fact is not a pure AI crawler).

A collective look at big tech

A collective look at the tech giants leaves no illusions about who is downloading the most data from our sites:

LLM creator / OrganizationTotal number of requestsShare of total traffic
Other (remaining bots)24,444,255~54,3%
OpenAI (ChatGPT)11,521,228~25,6%
Anthropic (Claude)5,923,626~13,2%
Meta (Llama)2,939,423~6,5%
Google (Gemini)168,125~0,4%

TOP15 AI crawlers

Here is the ranking of the 15 greediest AI crawlers I identified in the logs (based on an analysis of nearly 45 million requests):

PlaceBot nameTotal number of requests
1GPTBot (OpenAI)8,798,505
2PetalBot (Huawei)8,291,994
3ClaudeBot (Anthropic)5,921,228
4Amazonbot (Amazon)4,361,437
5Applebot (Apple)2,597,117
6LinkupBot2,462,636
7meta-externalagent (Meta)2,331,582
8IbouBot1,719,613
9OAI-SearchBot (OpenAI)1,457,764
10LCC1,403,196
11ChatGPT-User (OpenAI)1,264,907
12Bytespider (ByteDance/TikTok)1129,001
13TerraCotta550,077
14Awario510,164
15spider354,905
TOP15 crawler table

Summary and conclusions

My study based on server data debunks (at least as of the publication date) the myth of the usefulness of lms.txt. Despite the huge and constantly growing traffic from AI bots, the technology giants have not widely implemented reading of this standard. They prefer to render and analyze the full HTML code „the old way”.

What does this mean in practice?

  1. Do not waste your time – creating and maintaining llms.txt files is currently art for art’s sake. Check your site technically and make sure the most important content is not presented with JavaScript. AI do not render JavaScript, so content may be invisible to them.
  2. Monitor logs – your servers are probably constantly being bombarded by GPTBot, PetalBot, and ClaudeBot. Server logs are a huge source of knowledge about who visits your sites, including Googlebot.
  3. Manage access – if you notice performance drops on your server, instead of creating useless but structured guides for AI, consider managing their traffic in a traditional robots.txt file or completely blocking the most resource-hungry crawlers, of course if you do not see any benefit from being in their training datasets 😉

The llms.txt file is nothing more than a curiosity, which I think everyone scans except the real AI bots and Googlebot (and if it did scan it, then it must have found it, after all text files are on the list that Googlebot indexes 2 but it does not get there on its own)

  1. Google Trends is a free tool by Google that shows how often specific queries are entered into the search engine, presenting the relative popularity of topics on a chart on a scale from 0 to 100. Link: https://trends.google.com/ ↩︎
  2. File types indexed by Google, https://developers.google.com/search/docs/crawling-indexing/indexable-file-types ↩︎

Dodaj komentarz

Twój adres e-mail nie zostanie opublikowany. Wymagane pola są oznaczone *