Click the comments link on any story to see comments or add your own.
Subscribe to this blog
07 Oct 2009
A recent message in the Risks Digest called Risks of believing what you see on the WayBack Machine (archive.org) claims that:
I have now encountered 2 legal cases in 3 months in which a plaintiff saw images on the WayBack Machine (www.archive.org) and believed that they indicated events in the past that never happened.
This is a big deal in legal circles, since archive.org is widely used in court cases to show the state of a web site at a given time, which can be critical in, for example, cases where the site shows prior art for a patent or infringing copies of copyrighted material. If the archive entries aren't reliable, all of these cases are thrown into doubt. Needless to say, it would be many defendants' dream come true if courts were to stop accepting archived copies.
I have analyzed the material cited in the article and find that the archive is fine, and his claims to the contrary are somewhere between disengenuous and deliberately misleading. Here's why.
The Risks article gives a purported example where a web page saved in 1997 includes an image that clearly was created more recently. The instructions to display it start:
And that's what's wrong.
A web page isn't really a document, it's a recipe that tells your browser how to construct a document to display on your screen or print on your printer. Every image in the page is a separate file with a separate URL. If the page uses frames to divide up the screen, each frame is a separate URL as well.
If the archive is going to reproduce the appearance of a page at the time it was saved, it has to include saved copies of all of the embedded images and frames. For example, the image to the right is regenerated each time this page is fetched, to show the current time and date. An archived copy of this page would need to refer to an archived copy of the image, to show the time that the page was saved. That means that rather than linking to the actual URL of the image, the saved page needs to link to the URL of a saved copy of the image.
I see in the article that its author has already tried to bamboozle one court with this argument. I hope they don't fall for it.
Addendum: It turns out the situation is slightly more complicated than what I said, although the result is still that archived pages generally display a reasonable recreation of the prior state of a web page. The extra complication is that each element of a web page, such as the page itself, embedded images, and frames, is archived separately. The elements come from Alexa's web spider, which does not necessarily retrieve all of the elements at the same time. When the archive shows an archived page, the archived copy of each element includes a date stamp to say which version of the element to retrieve. In the common case that there are not images with date stamps identical to that in the page, the archive returns the version with the closest date. This is a reasonable thing to do, since more often than not pages are updated at different times from the images. Moreover, if the exact version of each element is in question, it's easy to see what's in the archive. For each archived URL, a date stamp of * (an asterisk) returns a page with an index of all of the available versions, so if need be one can go through and determine the actual dates of each element retrieved.
My other sites
© 2005-2020 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.