Internet and e-mail policy and practice
including Notes on Internet E-mail


2009
Months
Oct

Click the comments link on any story to see comments or add your own.


Subscribe to this blog


RSS feed


Home

07 Oct 2009

The Internet Archive Really Is Reliable

A recent message in the Risks Digest called Risks of believing what you see on the WayBack Machine (archive.org) claims that:

I have now encountered 2 legal cases in 3 months in which a plaintiff saw images on the WayBack Machine (www.archive.org) and believed that they indicated events in the past that never happened.

This is a big deal in legal circles, since archive.org is widely used in court cases to show the state of a web site at a given time, which can be critical in, for example, cases where the site shows prior art for a patent or infringing copies of copyrighted material. If the archive entries aren't reliable, all of these cases are thrown into doubt. Needless to say, it would be many defendants' dream come true if courts were to stop accepting archived copies.

I have analyzed the material cited in the article and find that the archive is fine, and his claims to the contrary are somewhere between disengenuous and deliberately misleading. Here's why.

The Risks article gives a purported example where a web page saved in 1997 includes an image that clearly was created more recently. The instructions to display it start:

Disable javascript in your Web browser.

And that's what's wrong.

A web page isn't really a document, it's a recipe that tells your browser how to construct a document to display on your screen or print on your printer. Every image in the page is a separate file with a separate URL. If the page uses frames to divide up the screen, each frame is a separate URL as well.

If the archive is going to reproduce the appearance of a page at the time it was saved, it has to include saved copies of all of the embedded images and frames. For example, the image to the right is regenerated each time this page is fetched, to show the current time and date. An archived copy of this page would need to refer to an archived copy of the image, to show the time that the page was saved. That means that rather than linking to the actual URL of the image, the saved page needs to link to the URL of a saved copy of the image.

This presents a problem for the archivist. For the best historical record, the archived page should be saved exactly as it was originally. But to display properly, the links in the page need to be changed. The Internet Archive takes a clever approach to meet both of these needs. The page is indeed saved in its original format, but the Archive adds a bit of Javascript at the bottom which runs in the user's browser to edit the URLs to point to archived versions. If you look at the source code of an archived page, it's very clear what they're doing. Near the bottom of the allegedly screwed up archive page, it says:

// FILE ARCHIVED ON 19971210055953 AND RETRIEVED FROM THE
// INTERNET ARCHIVE ON 20080109043126.
// JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.
// ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C.
// SECTION 108(a)(3)).

If you turn on Javascript and re-fetch the page, you get a reasonable version of the page, in this case with no image probably because they weren't archiving images in 1997.

It would be reasonable to warn people to be sure that they are displaying archived pages with a Javascript browser in order to get the proper display. But it is just plain misleading to deliberately turn off Javascript, then claim that the page is somehow archived wrong.

I see in the article that its author has already tried to bamboozle one court with this argument. I hope they don't fall for it.

Addendum: It turns out the situation is slightly more complicated than what I said, although the result is still that archived pages generally display a reasonable recreation of the prior state of a web page. The extra complication is that each element of a web page, such as the page itself, embedded images, and frames, is archived separately. The elements come from Alexa's web spider, which does not necessarily retrieve all of the elements at the same time. When the archive shows an archived page, the archived copy of each element includes a date stamp to say which version of the element to retrieve. In the common case that there are not images with date stamps identical to that in the page, the archive returns the version with the closest date. This is a reasonable thing to do, since more often than not pages are updated at different times from the images. Moreover, if the exact version of each element is in question, it's easy to see what's in the archive. For each archived URL, a date stamp of * (an asterisk) returns a page with an index of all of the available versions, so if need be one can go through and determine the actual dates of each element retrieved.


  posted at: 21:31 :: permanent link to this entry :: 0 comments
Stable link is https://jl.ly/iarchive.html

Topics


My other sites

Who is this guy?

Airline ticket info

Taughannock Networks

Other blogs

CAUCE
It turns out you don’t need a license to hunt for spam.
32 days ago

A keen grasp of the obvious
Italian Apple Cake
590 days ago

Related sites

Coalition Against Unsolicited Commercial E-mail

Network Abuse Clearinghouse

My Mastodon feed



© 2005-2020 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.