Not all snapshots are created equal – a time-saving Wayback Machine technique

We’re going to write about our daily work more often.  I’ll go first with a nerdy research tip:

The Internet Archive’s Wayback Machine (the awesomeness of which I won’t bang on about) can show you when captures of the same page differ in some way from each other.

So what?

Here’s a long dead page used by La Secretaría de la Defensa Nacional (SEDENA) in Mexico to list the commanding officers of Zonas Militares (a major tier of the army in Mexico).

http://www.sedena.gob.mx:80/ejercito/comandancias/zon_mil.htm

It exists only in the Internet Archive’s Wayback Machine now. Here are two captures of that URL – made in 2004 and 2005 respectively . The screenshots below show only the first 10 entries (of over 40 in each). Can you spot the difference?

cdxblog1
Clipping from 8 February 2004 Wayback Machine snapshot of SEDENA army commanders page
cdxblog2
Clipping from 3 October 2005 Wayback Machine snapshot of SEDENA army commanders page

Although the archived URL is the same, the content is not. For example, in the February 2004 snapshot SEDENA lists “Noe Antonio Ordoñez Herran” as the commander of 1/a Z.M. However, by October 2005 SEDENA lists “Germán Redondo Azuara” as the commanding officer. This is a substantive difference that we want to capture; there are also other differences between these two snapshots.

How do we approach it? First, we establish the total number of snapshots. Helpfully, the Wayback Machine tells us this for any URL that it holds snapshots for. For example, the present SEDENA page was captured 57 times:

cdxblog3

It is likely that a page like this may have been updated regularly: the little bar chart tells us that there are differences in the sizes of the snapshots, indicating that something changed. The changes could be an update to the text in the list of commanders,  a design change of some sort that affects the page size.

Do we have to wade through all of them to find out what the differences are? No. The Wayback Machine can tell us which snapshots differ from the previous ones. Therefore, we can just go to those that differ in some way from the others and extract information from those.

To do this, we have to use another way to ask the Wayback Machine questions: the Wayback CDX server. The CDX server is a more advanced way to query the Wayback Machine, but also using your browser. It doesn’t have graphical user interface to browse the archived pages. Rather it provides metadata about the snapshots.

Here’s Wayback Machine data about our URL, but viewed from the CDX server:

cdxblog4
Some output from the Wayback Machine CDX server.

Here’s the URL that gives you those results:

https://web.archive.org/cdx/search/cdx?url=http://www.sedena.gob.mx:80/ejercito/comandancias/zon_mil.htm

This is the few rows of the same 57 results but shown as metadata rather than as a navigable, graphical version of the web captures themselves. I’m sure you can figure how out how to turn this list into a spreadsheet that you can use to organise your research (hint: copy-paste into your favourite spreadsheet, then text-to-columns using a space as the separator).

By changing the URL a bit we can filter out snapshots that are the same as the preceding one:

https://web.archive.org/cdx/search/cdx?url=http://www.sedena.gob.mx:80/ejercito/comandancias/zon_mil.htm&showDupeCount=true&collapse=digest

We’ve tacked on two new bits to the end of the query URL:

&showDupeCount=true

This shows which of the snapshots have duplicates. And then:

&collapse=digest

This has the effect of removing data about snapshots that are the same as the previous one.

Overall, our results are filtered from 57 down to 31 snapshots. It’s removed 26 that were the same as the preceding one and saved us a good hour of work.

As it happens, of those 31 snapshots only 12 hold content that is useful to us. The remainder are captures of server errors, because SEDENA changed its official website (and URL structure) four times between 2004 and 2017. But that, my friends, is another blogpost.

So, to wrap up:

  • The Wayback Machine has the equivalent of an advanced query that helps us find out when snapshots of the same page differ from each other.
  • It’s called the Wayback CDX server, and you can read more about what it does on its Github page.
  • Using it at the beginning of a bit of research can save you a lot of time.

I hope this helps some of you save time when trawling the Wayback Machine, and encourages you to experiment a bit with obscurer features of well known tools. It certainly helps us create the rich data you see on WhoWasInCommand.com

Cheers!