Analytical Access to the Domain Dark Archive: June 2013

James Baker, Digital Curator, British Library

It is commonplace to describe something new in relation to something that is known: think 'motion picture', 'spaceship', 'email' or 'smartphone'. The word 'webpage' is no different. And indeed in a sense many webpages are similar to the pages found in books or newspapers: they hold static media (text, image); core elements of them read from top to bottom; their headers, footers, cut-aways and advertisements orientate, guide and entice the reader; and in URLs they possess a (relatively) unique system of identifiers. It is hard to think of another name these digital objects could have been given.

It is also commonplace for the new thing to - linguistically speaking - replace the old thing: think 'motion picture' and 'the pictures', 'spaceship' and 'ship', 'email' and 'mail', or 'smartphone' and 'phone'. The same goes for 'webpage' and 'page'. Here by virtue of this act of redefinition, the 'page' absorbs features of the webpage not (or less) possible in book or newspaper pages: features such as dynamic content, user interaction, and direct links to other pages (or, more precisely, other pages that are not part of a sequence defined by the author whose work is the main content held by the page).

All of this makes the webpage-cum-page appear both familiar and unsettling, conservative and disruptive, old and new. These elements of lineage are crucial, for they have allowed us (among other things) to think of preserving the webpage as akin to preserving the page. Yes the challenges of novelty and disruption are discussed and debated (on which I'm not qualified to comment), but at the most basic level the webpage stuff that is being collected by Internet Archive or the UK Web Archive is page level stuff. (This is not to say I don't think page level stuff should be archived. Far from it, the fragility of webpages is well known (see Rosenzweig, 2003) and without these efforts valuable data on our society would be lost.)

But what are these pages and how can historians use them? A seminar jointly hosted by the Digital History seminar and the Archives and Society seminar at the Institute of Historical Research sought last night to tackle this very problem, asking quite simply 'Is this a new class of primary source for historians?'. After a presentation on the UK Web Archive and the Analytical Access to the Domain Dark Archive project both the speakers and the audience were largely in agreement that yes, the web archive is a new class of primary source, of historical stuff.

Does this make our nomenclature for what this stuff is problematic? For to call a webpage a page is to potentially place it into a category for which it is ill-suited and the techniques for investigating that category under huge-strain. Take a normal news article from the Guardian website as an example. The page contains a story, framing, context and advertisements: all very page like. But those adverts are dynamic as opposed to static, their content quite possibly targeted depending on the IP address accessing the URL and different each time the page is refreshed. The page also contains moderated comments, ranked as default by oldest first but malleable to user preferences. In short, when you visit the website it is unlikely to be the same as when I visit the website, so an archived version can only be one possible version of a webpage at a particular historical moment. Not very page like behaviour. Of course we might (quite rightly for the most part) say that the 'core' of the page, the textual content that historians are likely to be interested in will remain the same regardless of these peripheral changes. And yet as the growth of mainstream live blogs demonstrates (such as those covering the Taksim Square protests), the web is moving toward dynamic content over static content as default: embedded video, maps and text content streams are now commonplace, and are likely to become more so as the web develops.

The webpage then is a rapidly evolving beast whose capacity to change whilst still being called a 'page' complicates how we do research using webpages and how we preserve the internet. It is a page but not a page as we knew it, a semantic shift worth keeping in mind as we prepare for an era of born-digital historical scholarship.

This post was first published on the British Library's Digital Scholarship blog.

Analytical Access to the Domain Dark Archive

Thursday 13 June 2013

A page, but not as we know it