Internet and e-mail policy and practice
including Notes on Internet E-mail


2026
Months
MarApr
May Jun
Jul Aug
Sep Oct
Nov Dec

Click the comments link on any story to see comments or add your own.


Subscribe to this blog


RSS feed


Home :: Internet

17 Mar 2026

AI scrapebot update Internet

A long time ago I set up a toy web farm, which turned out to be very popular with web spiders, particularly the ones from AI companies. To help their training process, rather than just pages of links, it now has paragraphs of training text.

The content farm is just pages of links with pseudo-randomly selected baby names. The text is what I would call a very small language model. Once a day I scrape a set of blogs and make a table of the relative frequency of four-word sequences. Each page has randomly generated sentence-like things generated using the frequency table, such as "The government argued that the occupying force was too small a fish to fry, the plaintiff testified her feelings were hurt, not her reputation." With more effort I could probably produce better text, but that's not the point.

Originally I hosted the content farm on my main server, which collapsed from overload when the AI scrapebots found it and I had to turn it off. In November I restarted it on a dedicated virtual server which is quite overloaded by all the requests but manages to respond to about 25 requests/second.


The logs show that Meta and Anthropic find my content farm extremely interesting.

Since November the ClaudeBot, which finds material to train Anthropic's LLM, has visited over 48 million times, and their SearchBot, which finds material to respond to question, 18.7 million times.

Meta's "facebookexternalhit" bot has visited 38 million times. Meta claims that it's crawling web sites that were shared on Facebook or Instagram, but given how lame the site is, it's hard to believe that anyone or anything would share it 38 times, much less 38 million times. There's also 6.8 million hits from their web indexer bot, which is more plausible.

The next most eager spider is DotBot, 5.7 million times, from an SEO company. As a baseline, the Googlebot has only visited 648,000 times.

I have made no effort to disguise this toy content farm as anything other than what it is. The pages all look the same, and they're all on the same IP address. Google had no trouble figuring out what's going on. (648K visits to a site with over 7 billion potential pages is pretty moderate.)

I hope Anthropic and Meta are getting good value from all the wisdom on these pages, and I suppose I should see whether Claude now seems to know about it.


  posted at: 13:48 :: permanent link to this entry :: 0 comments
Stable link is https://jl.ly/Internet/scrapeup.html

Topics


My other sites

Who is this guy?

Airline ticket info

Taughannock Networks

Other blogs

CAUCE
How Harassment Shaped the Internet
50 days ago

Related sites

Coalition Against Unsolicited Commercial E-mail

Network Abuse Clearinghouse

My Mastodon feed



© 2005-2024 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.