![]() |
![]() |
|
Click the comments link on any story to see comments or add your own. Subscribe to this blog |
17 Mar 2026
A long time ago I set up a toy web farm, which turned out to be very popular with web spiders, particularly the ones from AI companies. To help their training process, rather than just pages of links, it now has paragraphs of training text.
Originally I hosted the content farm on my main server, which collapsed from overload when the AI scrapebots found it and I had to turn it off. In November I restarted it on a dedicated virtual server which is quite overloaded by all the requests but manages to respond to about 25 requests/second.
Since November the ClaudeBot, which finds material to train Anthropic's LLM, has visited over 48 million times, and their SearchBot, which finds material to respond to question, 18.7 million times. Meta's "facebookexternalhit" bot has visited 38 million times. Meta claims that it's crawling web sites that were shared on Facebook or Instagram, but given how lame the site is, it's hard to believe that anyone or anything would share it 38 times, much less 38 million times. There's also 6.8 million hits from their web indexer bot, which is more plausible. The next most eager spider is DotBot, 5.7 million times, from an SEO company. As a baseline, the Googlebot has only visited 648,000 times. I have made no effort to disguise this toy content farm as anything other than what it is. The pages all look the same, and they're all on the same IP address. Google had no trouble figuring out what's going on. (648K visits to a site with over 7 billion potential pages is pretty moderate.) I hope Anthropic and Meta are getting good value from all the wisdom on these pages, and I suppose I should see whether Claude now seems to know about it.
|
TopicsMy other sitesOther blogsCAUCE Related sitesCoalition Against Unsolicited Commercial E-mail |
||||||||||||||||||
© 2005-2024 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will
not give, sell, or otherwise transfer addresses maintained by this
website to any other party for the purposes of initiating, or enabling
others to initiate, electronic mail messages.