Click the comments link on any story to see comments or add your own.
Subscribe to this blog
13 Jul 2012
Bing is Microsoft's newish search engine, whose name I am reliably informed stands for Bing Is Not Google.
A couple of months ago, as an experiment, I put up a one page link farm at wild.web.sp.am. As should be apparent after about three seconds of clicking on the links there, each page has links to 12 other pages, with the page's host name made of three names, like http://aaron.louise.celia.web.sp.am. The pages are generated by a small perl script and a database of a thousand first names. All the pages have the same IP address, although there could be about a billion (1000 cubed, since there are three names in each page name) possible domains. I forgot about it until earlier this week, when the disk with my web logs filled up.
My web logs are normally 10 to 15 megabytes a week, but all of a sudden the logs ballooned past a gigabyte. A quick look at the logs revealed that my web server was getting hammered by the bingbot.
Every search engine has a "spider" or "bot" that visits web pages to collect data for its index. It's quite normal to see a fair number of log entries from bots as various search engines wander around your web pages looking to see what's changed.
But it was not normal to see the bingbot hammering on my link farm, ten queries a second, day after day. When I noticed it, the bingbot had already visited about 15 million times, fetching 15 million nearly identical pages. I added a robots.txt file, telling bingbot to go away. It didn't help, which wasn't that surprising; since each page is in a different domain, each page could hypothetically have its own different robots file, so while the robots file should stop future indexing, it won't affect any pages that Bing had queued up from previous visits. How many did it have queued up? A lot. Bing scooped up over a million copies of the robots file, at which point I adjusted the web server configuration to return an error page when the bingbot tried to fetch a link farm page, but to return the robots file normally. Still didn't help, it fetched a lot of robots files and a lot of error pages, I think of different domains.
Since the link farm has its own IP address, it was easy to add low level packet filters to reject all traffic to that address from the 12 addresses of the bingbot. I unfiltered for a few minutes today, and it's still hammering as hard as ever.
While this isn't doing any great damage, if I didn't have the skills to look at logs and write suitable packet filters, or if I were paying by the byte for network traffic, it could have crashed my system or cost me a lot of money.
Bing is not the only search engine to have discovered my link farm. Google's Googlebot-Mobile/2.1 visits the link farm every few seconds, claiming to be various kinds of Japanese mobile phones. But Bing's traffic is orders of magnitude more than everyone else's put together. (This is just a problem for the link farm, the rest of my web sites get along with Bing just fine.)
My main question is how these highly sophisticated search engines have failed to notice that they have fetched several million almost identical pages from the same IP address and blacklist it. I have reason to believe that Bing management is aware of the issue, so maybe they'll stop it some time. Or maybe even let on what happened.
My other sites
© 2005-2020 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.