Analytical Access to the Domain Dark Archive

Friday 9 May 2014

Researchers' final reports (4)

In another of our series of researchers' final reports, I am posting a link to a PDF of a talk given by Martin Gorsky of the London School of Hygiene and Tropical Medicine at the recent European Social Science History Conference in Vienna. Martin goes into plenty of detail here about how he used the search interface to the Dark Archive to research public health in local government in England.

Tuesday 12 November 2013

Researchers' final reports (3)

This is the third in our series of final reports by the AADDA project researchers, posted with their permission. This one is by Dr Carole Taylor, a researcher at the House of Lords:

I. Research Background and Methodology

My historical expertise lies in early Georgian music, art and politics and was not obviously suited to the Domain Dark Archive focus on UK websites extant between 1996 and 2010. However, my work as Research and Parliamentary Assistant to peers in the House of Lords seemed a more promising fit. I discussed this with colleagues in the Lords who immediately recognised the potential value of the web archive for MPs and Peers with “a range of policy interests which will map onto those of academic researchers”. With the particular encouragement and advice of Dr Elizabeth Hallam Smith (Director of Information Services and Librarian, House of Lords Library) I identified political engagement as an area of obvious interest to parliamentarians, as well as a theme noted by Peter Webster during the 13 June 2012 seminar as a category, among others, that lent itself well to web archive research of this kind, and wrote up a proposal.

I undertook an intensive period of research to familiarise myself with the present state of serious research on political engagement in the UK in order to identify a manageable research exercise to take to the AADDA interface. I was advised by several academic colleagues, particularly two PhD students in the Department of Government at the University of Essex, one of the two main centres (together with the University of Lancaster) of political engagement studies in the UK. I was also assisted in this information gathering exercise by a Senior Researcher at the House of Lords Library where serious efforts are made to understand how parliamentarians are listening to and engaging with the public.

In advance of our access to the AADDA I presented a scaled-down version of my research proposal to the IHR/BL team in March. I suggested a focus on social media forums used by parliamentarians, particularly the House of Lords blog, launched in 2008. The House of Lords was the first parliamentary chamber in the world to set up a bipartisan blog which makes it a compelling example in the history of political engagement. Disappointingly, in the ensuing months leading up to our encounter with the AADDA dataset, I learned that social media sites with the exception of .co.uk would not be included in the dataset, which meant my topic was no longer viable. I re-thought the proposal and decided on a very narrow, entirely new subject that felt manageable to complete within the parameters of the consultation – Heathrow’s Third Runway. In February 2013 I had a meeting with Jane Winters and Jonathan Blaney at the IHR to confirm this third version of the research proposal was acceptable (it was).

Thus I was on track for the purposes of the consultation. However, with such limited access to social media sites the value of this exercise for serious researchers at Parliament was considerably eroded. Even the significance of the results below on the “Third Runway” was questioned, albeit sympathetically, by parliamentarians who cautioned that I appeared to be accessing information that is already well-known to parliamentary researchers. Their interest is obviously about what this resource can offer over and above what they know already. It may be that areas that did not receive such widespread public airing (such as the Third Runway did) will deliver better results.

II. Research Results:

1st session: March 2013

I questioned the interface three times.

“third runway” – 171 items;
“third runway” AND “parliament” – 71 items;
“third runway” AND “heathrow” and “parliament” – 69 items

Yield:

a lot of travel companies;
.gov.uk (5 items) – entirely predictable;
public suffixes important to investigate engagement, but I didn’t readily grasp how usefully to link left and right side of the search results page

Questions arising:
How many of the 100 that were dropped between first two searches might have included useful information? In this respect, I agree with GM at the 21 March 2013 meeting who said “there needs to be a ‘search within’ option, for when there are many thousands of results.” PW’s response that “in such cases adding more search terms should have the same effect” is helpful to reduce “many thousands” to a couple of hundred; however, at this point I might not want to lose potentially useful information in the course of adding a new search term.

What about people who are undecided or don’t express their views? an obvious but important qualitative question for historical researchers.

Suggestions:
It would be a great help if a preview screen were available to the right of each item. Through all my searches (in March and September), I clicked on countless items that were duplicates of what I’d clicked on two or three items earlier. (Titles of items often differ, so titles alone are not a dependable indicator.)

At the March meeting I asked Peter and Andrew how to turn the search data into ngrams; the answer was that AADDA will have a “click to create ngram” function – not there yet: would be a great help

2nd session: September 2013

I questioned the interface 10 or 15 times.

“third runway” AND “parliament” – yielded 990 items, but breakdown of these results (crawl year, content type etc) proved both manageable and useful;
“third runway” AND “soley” – 122 items: Lord Soley was Chairman of “Future Heathrow”, the pro-expansion group; among the 122 items was an interview (helpful, though repeated twice); the first 21 items were all the same and most were inaccessible or gobbledygook (cooking recipes); several references had no mention of Soley or third runway at all, eg, travel sites (nothing to do with Soley and no mention of his name)
“third runway” AND “house of lords” – 206 items; and “third runway” AND “aviation” – 2000+. For both these, I checked out two extremes ends of sentiment analysis (“very positive” and “very negative”). Many of these items failed to link the two search filters in any way. Nearly all of the 206 items in the first set were bbc – this was not only a problem of repetition (though there was plenty of this), but these are also widely public documents of little use to parliamentarians (who are already well-equipped with knowledge at this level).
I checked “third runway” AND “Howard Davies” on the off chance he was mentioned in this connection before he became Chairman of the Airports Commission in 2008 – eight items, all identical (a pdf report of the Association of British Insurers that had no mention of “third runway” or Davies) – disappointing!
Also checked “third runway” AND “future of aviation”; “third runway” AND “environment”; “third runway” AND “economy” – no new observations.

Suggestions:
It would be a great help if we could print the page with search results (or somehow export this material).

Questions and Concerns:
Clearly in this September round of questioning the dataset I was encountering problems with the Boolean AND search that didn’t arise in March. At best I seemed in September to be accessing OR rather than AND; at worst there was no connection to either search filter. I corresponded with Richard Deswarte about this and he could not see where the problem lay and I have no idea what the problem was either.

Sentiment Analysis, where it hits items to do with the search term(s), was at least consistent and might therefore be of interest in early stages of research.

Repetition: This is my biggest concern about keyword searching: does the repetition of material occurring from one crawl to the next, render the number totals listed on the search results meaningless for the historian? And will this problem be multiplied by 200 when the entire dataset is available? Peter cautioned users to avoid taking numbers of results for 2009 and 2010 as evidence of patterns in relation to the previous years; does repetition compound this problem for all years?

III. Concluding Remarks

The Digital History and Archives seminar presented by Peter Webster and Richard Deswarte at the IHR on 23 September 2013 was an invaluable guide to my second round of searches on the interface: http://historyspot.org.uk/podcasts/digital-history/web-archives-new-class-pr - click on “Web Archives: A New Class of Primary Source for Historians?” I’d particularly like to highlight Peter’s observation that the traditional separation of historian and keeper of archives no longer holds in digitized systems of this kind. During the Q&A Tim Hitchcock expanded on this point, remarking that models of society now being digitized – newspapers, etc – were of course not digitized at the time. These changes demand a new skillset now being shaped by and for C21 historians. To this I would add that scholars will have questions about subjects they know well and subjects they are addressing for the first time, and this fact needs also to be built into the process of curating datasets of this kind – particularly in the present, pioneering stage of digital research.

Wednesday 23 October 2013

Researchers' final reports (2)

This is the second in our series of final reports by the AADDA project researchers, posted with their permission. This one is by Saskia Huc-Hepher:

AADDA Testing Report:

The French Community in London

by Saskia Huc-Hepher

1 - Methodology

The initial purpose of this research was two-fold: firstly, to use the geo-indexing tool to map out the areas of London with the greatest concentrations of French inhabitants on the basis of the post-codes associated with 'French' Web sites / spaces; and, secondly, to identify French community websites in the Domain Dark Archive (DDA) appropriate for subsequent multimodal analysis on the basis their visual and textual meaning potentialities. The ultimate objective of the former was to triangulate the findings of additional empirical research conducted within the framework of my PhD, which sought to ascertain the actual numbers and hot-spots of the London French community, thereby serving to dispel the exclusively, or at least predominantly, South Kensington myth. Whilst the aim of the latter was to scrutinise the visual landscape of the London French over the period of the DDA data set, as (re)presented through the images – still or moving, in parallel to the technological advances of the Internet – displayed on the French community websites found in the DDA. It was envisaged that this historical visual data would provide the study with greater temporal contextualisation and depth, and, using social semiotic theory, in particular multimodality, would allow meaning to be inferred and ethnographic conclusions drawn from the images, on such subjects as the community's sense of belonging; how they perceive and conceive London and its inhabitants; how they (re)present and define their own identity through images; what elements of France and Frenchness they portray and promote; and whether any of these have changed over time.

Similarly, it was hoped that the geo-indexing analysis would be of historical value, determining whether or not there was any relationship between the areas most associated with the London French today and those districts favoured in previous waves of migration to the capital.

The final objective of the DDA research proposed here was for the image-tagging analytical tool to enable a word, or combination of words, such as 'French' and 'London', to search for photographs or images only, the visual data thereby potentially serving to triangulate the findings of the geo-indexing investigation in that the images and spaces associated with key words such as 'London', or specific areas within London, could have coincided with the places and spaces that were identified as being particularly French through the geo-indexing process and/or historically. This micro-investigation was therefore to be binary in its objectives: visual data for ethnosemiotic analysis and geo-indexing data for triangulation of previous qualitative research.

The methodology outlined above was adopted on several occasions over the course of the AADDA project time-span: firstly in March 2013, later in August 2013 and September 2013, with a final trial, using the most functional interface and comprehensive data set, in October 2013. The results, at every stage, however, were disappointing.

2 – Deep Search Data Testing

March 2013

The first trial session was carried out in the knowledge that at that point in time the DDA included only a random subset of the entire cohort of data, but one which was evenly spread over the archive in temporal terms. Therefore, in theory, trends, developments and patterns should have been identifiable, despite sentiment analysis and geographic options not being available at that stage. In practice, however, a number of basic search hurdles prevented any valuable findings from materialising. These included:

the lack of clarity regarding the need to click on the crawl date to access a website; choosing the website title would have been more intuitive. Such functionality was updated at the subsequent meeting (21/03/2013);
the lack of clarity regarding the purpose of the bar charts at the top of the page; they have since been removed;
the fact that not all web captures functioned at that time – e.g. Le Petit Parisien restaurant had no images and almost no text (but enabled me to do a current Google search for the website, only to find out that the restaurant – and website – is now closed; this is therefore an example of the potential historical worth of the DDA, had it been operating correctly, in allowing the analysis of obsolete Websites);
some websites cited in the list of 'hits' subsequently being found to be unavailable; the links to alternative sites proved to be useful, however;
time being wasted revisiting Websites which had already been scrutinised. Once a site has been viewed, it would be helpful and more time-efficient if the visited link appeared in a different colour (e.g. purple, cf. Google) from the others on the list;
the fact that search tools operated extremely slowly and the interface was not yet user-friendly. Speeds and appearance have since improved and the latter is no doubt a work in progress;
http://web.archive.org/web/20080601000000*/http://www.guardian.co.uk/world/2008/jul/12/france.islam Here, every separate date in the July (burka scandal) peak (as well as all the other dates in August and October 2008, the two snapshots available from 2009 and the single one from 2012) showed the same snapshot from The Guardian (12 July 2008). If the online material is unchanged in relation to another date, this should be immediately visible on the list of data (possibly via colour coding, as suggested for the pre-visited Web pages, or grouping by content & date);
the majority of search results not being particularly useful for my purposes; they were either not relevant (for instance displaying large numbers of Websites related to French tourism for English users) or not French-specific (that is, 'Londres' retrieved results in Portuguese, Spanish, etc., not French exclusively; while English search words retrieved sites aimed at Francophiles as opposed to Francophones);
phrase searching using the “double inverted commas” being equally disappointing (nothing of relevance was found following a search for “French community London”, or indeed '“French” and “community”', trialled at a later stage); “French London” was therefore tested, resulting in a list of sites relating to French teachers & jobs in London.

Conversely, it was useful to have the 'media' / 'pdf' search options at the bottom of the screen, as this enabled access to images and audio 'texts' (of relevance to the multimodal methodological / theoretical approach taken in my research);

Overall, the initial testing was found to be useful in assessing the lasting impact, or otherwise, of the French community on London, in a temporally comparative manner. That is, by identifying French restaurants/cafés/businesses through their retrospective on-line presence before submitting the titles to a live Google search at the time of testing, I was able to discover if such enterprises were growing, in decline or defunct. Whilst that limited use was of potential value to my research in assessing the lasting contribution of French businesses to London's cultural and economic landscape, I was nevertheless acutely aware (given my curation of the London French Special Collection for the UK Web Archive) of the mass of relevant data – such as community websites and blogs – which had not been detected or listed as featuring in the DDA. It was hoped at the time that this was due to the incomplete and arbitrary state of the data set.

August 2013

This trial was more successful than the last as regards the speed and efficiency of the data search tools, despite there still being only a five per cent random, if temporally representative, sample of websites available. Somewhat paradoxically, those searches which pinpointed the early years of Internet use, namely 1996 and 1997, proved to be the most valuable. Several different searches were tested on this occasion, as follows:

a) A search for the terms “French community” was filtered by language, using the “French” option. This functionality was found to be extremely useful in reducing the large amount of irrelevant data to a more manageable subset. Again, by filtering further, this time by year (in this case 1996 and 1997), I was able to focus in on yet more pertinent Web pages. Thus, when I began to analyse the <Associations Françaises> site, I noted that the landing page directed the visitor to separate sites, one for French expatriates and one for Belgians. Not only are these sites an indication of the relative establishment of the said Francophone communities in the UK, each warranting an on-line home for the long list of associations set up in the country of residence, but the fact that a distinction is made between Belgian and Franco-French populations has implications regarding identity.

Using the same search terms, another site <Les Grenouilles Cablées>, harvested in 1996, proved worthy of an initial analysis. Firstly, the landing page pointed the visitor in the direction of three separate sub-sections: <Grenouilles du monde>, <Grenouilles des USA> and <Grenouilles de Californie>. These distinctions suggest that either the French expatriate community was more significant in the USA than elsewhere at that time (including London, which is no longer the case and perhaps related to the opening of European borders) or that US residents, including French ones, were earlier adopters of Internet technology than in the UK. When examining the site more closely and entering the

<Grenouilles du monde> space, it was telling that the first choice was then <Nouvelles de France> (before the hyperlink to Quebec), which suggests that this website is indeed aimed at the French expat diaspora worldwide, linked together by their shared affinity to France, and keen to maintain links with the homeland. Further, when choosing the French news link, the selection of newspapers available was a left-leaning one. Again, the possible implications of this are two-fold: either the political leanings of the newspapers featured are an indication of the papers' social commitment, i.e. making information freely available to all, or they are an indication of the profile of the diaspora visiting on-line sites at that time, i.e. Libération and Charlie Hebdo both target a young, left-wing readership. If this is the case, it is thus a profile at odds with the predominantly right-wing (particularly at that time) expat community of the South Kensington stereotype, which serves to substantiate the hypothesis posited at the beginning of this report. There are also hyperlinks to <Metéo France> (suggestive of a need for a physical sense of proximity to the homeland, despite the geographical distance separating the community from it) and to <Les dernières nouvelles d'Alsace' and <Pariscope>, both of which could be indicative of a longing for insignificant local minutiae in the globalised age, made possible through the worldwide Web, as well as pointing towards greater emigration from eastern France (and Belgium, as confirmed by the first website) and the French capital than other geographical zones.

This site offers links to French audiovisual sites including radio and TV and, perhaps more importantly for my research, to two on-line fora, <French Talk> and <Francopolis> which are evidence of the formation of both Internet and French communities (despite other empirical evidence suggesting that the French community per se does not exist, or if at all, in South Kensington alone). Finally, this website creator's recommended sites are telling in terms of identity (just as a Blog would be today in its related networks) especially within the theoretical framework of Pierre Bourdieu's Habitus, with the Vatican, Charlie Hebdo, the RATP (equivalent to TFL in London) and various French sports sites (football, Formula 1 and rugby) featuring among others.

Another site displayed following this search was the <Association des Francophones de Cranfield> in which advice is provided on low-cost means of transport to France and Belgium. This in itself demonstrates that the target audience are medium- to long-term French residents of the UK, rather than short-term visitors, and that they have been attracted to England by its (Higher) education system – a point which, as incongruous as it may appear, is compounded in the qualitative data gathered outside the AADDA project.

b) The second search undertaken in the August trial was “London French” by “content type”, notably “image”. This was highly disappointing and of little use given that the few images which were displayed related to French football or simply contained a set of codes, with no discernible image.

c) To counter the insufficiency of the image search above, a “format search” was instead chosen from the AADDA homepage. This was more successful in terms of number, with some 6,369 items listed for the “French London + format” search trialled, filtered by year (2006). However, given that the images were not tagged and stood in complete isolation, their usefulness was questionable, as many appeared to relate not to the French community in London, but linked to websites on French property or university Webpages.

d) This search attempted to assess the value of the post-code filter, which initially was again rather disappointing. Given the lack of pertinence of the majority of the sites identified after the early years (1996, 1997), their related post-codes were of equal irrelevance. Furthermore, there were no apparent clusters of London websites, with many coming from outside London; no micro-geographical/demographic conclusions could therefore be drawn. A subsequent search (“French community” filtered by language and year), despite listing only one Website, revealed two potentially telling post-codes, N7 and NW5, for 2010, which could have been related to the forthcoming opening of a new French State school in Kentish Town (NW5) (but the insignificant numbers involved are again inconclusive).

e) A search for “communauté française”, filtered by year (2001) and language (French) identified a Blog, which would have been of particular pertinence to my research. However, it transpired that the said Blog was the work of an English-speaker, practising their written French, rather than a French Londoner's Blog. The lack of Blogs retrieved by the DDA search engine was perplexing, as many are known to me within the framework of my UK Web Archive Special Collection work. The question of whether this is due to the domains favoured by the London French Bloggers as hosts for their autobiographical logs is therefore worth consideration, and if so, the possibility of accessing them through the DDA should also be contemplated.

f) The same search as in item (e), this time written in and filtered by the English language for the year 2010, found only one Website, the <Ile aux enfants> school in North London. Despite the unexpected limitedness of the search results in this case, the “links to host” tool was telling, particularly in terms of “mapping the field” and Bourdieu's “three-stage analysis” paradigm. That is, by scrutinising the – predominantly institutional – list of Websites linked to the <Ile aux enfants>, such as <ambafrance>, <assemblee-afe>, <bienvenuealondres> and <edufrance>, socio-cultural assessments were facilitated. Nevertheless, it was frustrating that these links to the host site were not functioning during the trial, directing the visitor back to the host page as opposed to opening the linked Webpage itself. It was not clear, therefore, whether their inclusion was exclusively for quantitative analysis (the number of visits was in brackets), as they were of no qualitative worth without access to the content of the linked Websites.

September 2013

The most notable and satisfying difference between this trial and the preceding ones was that all the links to related Websites were at least partially, and in the great majority of cases completely, successful. This meant that the discovery of one website (from

a long list of still relatively futile others), namely the “Londoscope” reference pages of the <www.acticours.freeserve.co.uk> proved to be invaluable through its hyperlinks, as opposed to the content of the site itself. Thus, several pertinent results were attained, as detailed below:

a) The apparition of London French social-networking-type pages, known as <Londoscope> is perhaps indicative of the growing numbers of French Londoners seeking a physical sense of community by means of digital linking and dissemination mechanisms. Entries such as “Eglise protestante française de Londres: Soirée anti-stress” and the enumeration of French films on show at the Ciné Lumière and the NFT, together with other French cultural events at the Institute of Contemporary Arts bears witness to the importance of French culture to London's overall cultural capital and is also evidence of community belonging in practice.

b) The <Londoscope> pages from 2003 enabled the identification of a culturally and historically pertinent French amateur dramatics group which has been performing in London since 1929: Le Cercle dramatique français (CDF). My research into this amateur theatre company can now be taken forward in an effort to ascertain whether it is still in existence and, if so, its place in French community life today.

c) Another link on the same Website, from 2004, referred to the Francophone television channel TV5 celebrating its 20^th anniversary and revealed some useful viewer figures, including it being watched in 167 million households in 2003, with some 56 million weekly viewers. This constitutes further evidence as to the impact of the French language and culture worldwide and potentially to the growing French diaspora.

d) The final finding of relevance during this trial session was the <Londoscope> link to the ADFE (Association Démocratique des Français à l'Etranger), created in 1980 'par des Français qui voulaient, pour les représenter, une association dynamique et correspondant aux nouvelles réalités de l'expatriation' (i.e. by French people who sought representation through a dynamic association in tune with the new realities of expatriation). This quotation alone is of worth for a number of reasons; firstly the notion of 'representation' itself is key, as it begs the question of 'representation to whom?', which, reading further, it appears is to the French authorities. This in turn indicates that the need to be politically represented in France has its roots much further back historically than the election in 2012 of the first ever Député for French overseas residents implies, as well as demonstrating an unwillingness to integrate fully in the London socio-political scene and an attachment to the homeland. Similarly, the notion of “new realities” suggests a shift from an old form of migration to a new one, acting as a temporal forerunner to the massive wave of cross-Channel immigration which began in the early nineties and continues to this day. The term “dynamique” could also be seen to illustrate the London “pull factor” for French expats living in the capital; that is, many are arguably escaping the inertia and complacency of French institutions and mindsets in their decision to emigrate to London, as exemplified in other forms of empirical evidence gathered for this research. Here, therefore, the data gathered from a single Website in the DDA has served to triangulate several key findings in my PhD.

October 2013

Having exhausted most of the available search options during the previous trial sessions, this was the shortest and least enlightening of all. It was necessary, nonetheless, to conduct a final test with the most functional interface to date and a now complete data set. The search tools also provided an opportunity for sentiment analysis, unavailable in previous trials.

This experiment involved a phrase search for “London French community” combined with “English language” and “very negative” sentiment filters. No results were identified. When the French language was used and chosen as a filter, 240 matches were found, but these were of little relevance to my research given their pedagogical focus. One potentially valuable find for historians of the French presence in the UK was a Website on Augustine monks, in which the flight of monks from France during and after the French Revolution, and the creation of brotherhoods in York (1802), Bristol (1818) and Ealing (1897), where a Benedictine monastery was founded, were reported. However, in view of the contemporary emphasis of my research, this proved of little relevance, once again.

Further searches, using different phrases/words, content types and language/sentiment filters were also trialled, to no avail. Furthermore, it was disappointing to note that the post-code and media filters appeared to have been removed, or were not readily visible.

Overall, if not the least successful of the trials conducted to date, this was the most frustrating, given the unfulfilled aspirations of working with the complete data set.

3 - Lessons Learnt

The lessons learnt from this exercise are as follows:

“Think small” – minimising one’s research objectives is perhaps the only way of navigating the enormity of the data.
Maximise material – as the deep search process is akin to searching for the proverbial needle in a haystack, any relevant data identified as being pertinent should be analysed immediately, or saved for subsequent analysis, due to the apparent randomness of the retrieval process.
Use big data for its quantitative value, but not for drawing representative conclusions or in an attempt to test large-scale hypotheses, due to the apparent fallibility of the findings. Therefore, restrict qualitative research to the micro-findings of those Web sites and Web pages found to be of value – albeit somewhat arbitrarily – and optimise this data for its comparative and preservation worth.

4 - Future research and AADDA Recommendations

As regards my own research, I intend to explore the identity / Habitus evidence found in early Websites (1996 / 1997) in greater detail and compare it with contemporary Blogs to establish whether the same affiliations are present and the same sense of group, or otherwise, identity. These findings will also be compared and triangulated with the qualitative data gathered from one-to-one interviews with members of the contemporary French population in London. It is also possible that I will study sample historical Websites / Webpages alongside their contemporary equivalents, from a multimodal perspective, to gain an understanding of how technological constraints might influence the making of meaning to varying degrees over time.

It is unlikely that the post-code filter searches will be used to inform my research, given the weakness of the findings, but the process was worthwhile in its disproving of my theory, and some cautious, small-scale conclusions could be drawn from the associations with the NW5 district.

With respect to the AADDA project looking forward, the following recommendations have been tentatively made:

(colour?) coding to indicate both sites already visited and replica Webpages (identified repeatedly according to the sweep date)

inclusion of a Blog filter (in addition to the <.org.uk>, <.ac.uk>, etc. domain filters

“links to host” tool to open link in new window

retention of the post-code and media type (image, audio, video, etc.) filters, with tags / provenance if possible

user-friendly search “help” / search “tutorial” function as cursor hovers over certain fields (such as host links, domains, numbers, etc.) on the deep search landing page (giving particular advice regarding correct wording and punctuation, for example)

5 - Conclusion

The lasting impression, having carried out several trial sessions using the DDA data and its current search tools, is that the results can present islands of valuable resources within a sea of irrelevant material, but that the likelihood of finding them is dictated by chance rather than design. Throughout this testing process, I have pondered the reason for the seemingly arbitrary nature of my AADDA findings and for my failure to access a greater amount of material relevant to my research; that is, the question of whether my lack of technological expertise was the cause or whether such outcomes are inherent to searching this vast set of data has been recurrent and remains unanswered. Instructions offering clear guidelines on the best ways to use the archive and acknowledging its limitations would therefore be both helpful and reassuring to researchers.

Wednesday 16 October 2013

Researchers' final reports (1)

Our project researchers on AADDA have kindly written up the research the planned to do with the web archive, a summary of how it went and problems that they encountered. I'll be posting these as blog posts over the next few months. Here is the first, from Helen Taylor:

AADDA Report: Sentiment Analysis and the Reception of the Liverpool Poets

My project and the AADDA: a lesson in ‘digging down’

When I proposed my research project for the Analytical Access to the Domain Dark Archive project, it was based on a ‘wish list’ of tools that scholars might want to use to access this resource. The tools my proposed project required were sentiment analysis, proximity search, and geo-indexing. This latter was not available during this test period, but the first two were. However, this report is not so much a record of my findings, but about not making assumptions with the data produced via these two tools.

I sought to access information about the reception of the Liverpool Poets (in practise, I focused solely on Adrian Henri). With the Domain Dark Archive I could find avenues – fan pages, forums, and the like – which would provide me with information to consider alongside newspapers, interviews, and archival material. I wanted to see what labels were attached to the poets, and how they were viewed, in informal recollections and non-academic contexts. I would then combine and compare this data with searches for the same terms from newspaper and published works. There is a marked difference in academic and popular attitudes to the poets, and the internet archival searches should be able to provide evidence for how the people who actually received the work viewed their experiences.

Methodology: considerations and consequences

It must be noted that the AADDA project involved only a slice of the full dataset, and that my results will almost certainly differ greatly when it goes live. (Just as an example, a search for “Adrian Henri” on the AADDA browser returns 1847 results, compared to over 8,200 current UK hits on Google.) The lack of references is almost certainly due to the smaller dataset, rather than the data not being there at all (1).

Another issue was that very search term, “Adrian Henri”. Searching for just ‘Adrian’ or ‘Henri’ rather than ‘Adrian Henri’ is unhelpful in that it throws up results of which the majority are not relevant: ‘“Henri” NEAR “painter”’ might give you Matisse; ‘“Adrian” NEAR “poet”’ might give you Mitchell. My own research and interview experience has been that people are likely to refer to him as ‘Henri’ or as ‘Adrian’, so the fact that I was only searching for ‘Adrian Henri’ might have excluded some results. However, articles on online magazines and the like do usually follow academic and journalistic traditions of referring to the subject by their full name in the first instance, and then surname, so therefore are caught by the crawl.

I had to decide what labels to search for in relation to Henri, and my initial searches – using what terms I was already aware of – may have excluded other labels and ways of talking about Henri. I also found that my own academic assumptions were not the standard – there were 203 results for the label ‘Liverpool poet’, versus only 3 for ‘Merseybeat poet’, the term I am using in my thesis!

Search for ‘“Adrian Henri” AND …’	Number of items returned
“painter and poet”	5
“poet and painter”	2
“painter/poet”	5
“poet/painter”	10
“performance poet”	0
“performer”	10
“entertainer”	16

Fig 1 – examples of search terms and results

The five results for both “painter and poet” and “painter/poet” were all from the Tate Archives.(2) This – with search terms placing the artistic side of his output first – is not surprising, given that the Tate is an art gallery. It did surprise me that “performance poet” did not prove a useful search term, although this is perhaps an academic designation rather than a layman’s term – as evidenced by the results for “entertainer”. But none of these results can be taken at face value, as this report shall discuss.

Boolean searching: How near is NEAR?

These initial exploratory searches bring me to my first problem with the data. Throughout this report what I refer to as problems are not faults with the dataset or the browser but rather potential issues for the users interacting with it. Parameters for how close together the two search terms can differ, but I found that the NEAR search was sometimes not near enough here. I found two issues when reading the actual results: firstly, that the terms were often not that close together; and second, that the second term was not actually being used to discuss Henri:

Fig 2 - search result for “Adrian Henri” NEAR “painter”, post on www.ancestryaid.co.uk (3)

Therefore, the results in the table listed above are not a reliable source for enumerating the most common labels attached to Henri – one cannot rely on reading only the initial search results.

Crawl dates: Encountering a display problem

I have already stated that some results could not be ‘clicked through’ and their content displayed past that initial search results page, such as the Tate results for “painter and poet” and “painter/poet”. There is therefore no way of knowing what the pages actually contained. At other times, there were results which could not be viewed for a different reason: they did not even appear on the search results page.

This revealed itself to me when running an exploratory query. After a basic search for “Adrian Henri”, one of the things that I noticed is that there is a ‘jump’ in the number of hits in the year 2000. Whilst this is not the highest number (2007 has 345), I thought that this could be explained by this being the year that he died – obituaries, tributes, more ‘noise’ around his name.

Fig 3 – showing results for “Adrian Henri” by crawl year (4)

Clicking through to filter these results by that year – and hoping to find relevant obituary results – I encountered my first problem. From 242 results on the initial search, the “Search found 202 items”:

Fig 4 – filtering “Adrian Henri” results by crawl year “2000” (5)

Furthermore, when clicking through to the second page of these already shrinking items, the number jumped down again to 186:

Fig 5 – filtering “Adrian Henri” results by crawl year “2000”, page 2 (6)

This was repeated elsewhere – for example, the following year, 2001, went from 53 potential results to 37 search items being displayed. It was not the case that the items were only those which could be ‘clicked through’ – as the Tate example above shows, those which the Wayback Machine could not display were still included in the search items.

One potential explanation for the discrepancy between the total number of results and number of items which the “search found” is that the results returned here might omit duplications, perhaps where a second crawl finds nothing different from the first. I am unsure whether this is a valid response, as I have found many instances of crawls where the Wayback Machine’s results are exactly the same from crawl to crawl. Furthermore, of the 242 results for 2000, 235 were from Amazon.co.uk, and not related to his death. I would, therefore, propose that the ‘jump’ came simply from there being more crawls in that year, as it must be remembered that the dates are dates at which the sites were recorded, not the dates at which the material was published.(7) Whatever the reason, this shows that the results must be interrogated further along the line from the initial search, as however innocent the numbers appear, they cannot be presented without ‘digging down’ to the actual website results themselves.

Sentiment Analysis: Don’t take it on face value

Taking a quick look at the totals when doing a basic search for “Adrian Henri” reveals mostly neutral results, as one might expect from an analysis over a large amount of text, but the results are also far more positive than negative, if a sentiment is found – 136 “very positive” versus 11 “very negative”. However, this is another lesson is ‘digging down’ and not taking the results at face value.

Fig 6 – showing sentiment totals for the “Adrian Henri” search (8)

The success of sentiment analysis relies in part on how positivity or negativity is determined across the whole search parameters. This quote from a 1998 school newsletter is clearly – and does indeed appear under the term – very positive:

Many thanks to Stockport Art Gallery staff for the invitation to bring our Junior children to meet Adrian Henri, the famous artist and poet, on Wednesday 21 October. Adrian was terrific, telling us the stories behind many of the pictures currently on exhibition at the Gallery and reading from his poetry collections. We can really recommend a visit to see his work. Many thanks to Adrian for a great day with you in Stockport! (9)

However, other results which were listed as “very positive” must be discounted from this total for the same reason as the proximity searches above: the positive nature of the whole is not related to Henri’s part. See, for example, the discussion of Carol Ann Duffy’s The World’s Wife in an AQA English Literature Examiner’s Report from June 2005:

Once again, The World’s Wife proved highly popular: more centres study this text than any other on the paper. As last year, examiners were impressed by the enthusiasm and engagement with which many candidates approach Duffy’s poetry … Examiners were also concerned that intrusive, and often irrelevant, biographical material (such as lengthy character assassinations of Adrian Henri) prevented candidates from meeting the Assessment Objectives.(10)

Whilst this, therefore, means one cannot blithely cite all 136 “very positive” results in Henri’s favour, we also need to revise the total of “very negative” results. Firstly, of the 11 results, the 6 items which can be displayed are all the same Peter Finch interview:

Fig 7 – results page “Adrian Henri” with sentiment “very negative” (11)

And secondly, in this interview Henri actually appears very favourably:

The Liverpool Scene arrived, and with it the merging of music and poetry with Roger McGough, Brian Patten, Adrian Henri, and others. I eventually met Adrian Henri, who was also a painter, and the most interesting, I thought, of the three. We became frends and he pointed me in some new directions.(12)

The Wayback Machine has 12 captures of this page on this site, from October 2006 to July 2013. Each crawl obviously takes a snapshot of whatever is on the page at the time, and the crawl date is clearly indicated in the results, but the 11 apparently different “very negative” results are, in practise, all the exact same interview, the text of which has not changed (bat the removal of the first line under the title), although the formatting of the page itself has slightly changed (see the links beneath the header), as illustrated here:

Fig 8a – first Wayback Machine capture of www.argotistonline.co.uk (13)

Fig 8b – last Wayback Machine capture of www.argotistonline.co.uk (14)

I have suggested that one reason for the discrepancy between the total number of results and the items which can be displayed is that the duplications might not be shown, and the snapshots for this page do show that there have been changes over time, but what this also shows is the need to interrogate the results, at the level of those snapshots, rather than making assumptions based on the initial totals. Whilst this may be deliberately simplifying the issue, the message to take away here is not to take the results on face value: there aren’t 11 “very negative” results – there are none at all!

Brief Conclusions

This report has attempted to present some of the potential mishaps involved with looking at the Web Archive results on the surface, at face value. What my exploratory searches have shown is that one cannot make assumptions based purely on looking at the initial search results – you have to dig down.

Being involved in the AADDA project was certainly useful for my own research, as I found sources of information which I wouldn’t have found otherwise, such as pages which are no longer live, or places I hadn’t thought to look. It was also fascinating to read non-academic histories of performance poetry and the 1960s underground, where Henri and the Merseybeat poets appear as far more important than in ‘official’ criticism.[15] These histories were also presented as if public knowledge, proving my theory that those ‘ordinary’ people who received the work did have an idea of its importance, and that the audiences for this kind of poetry were significant, particularly in terms of recognising the legacy of the Merseybeat poets where academia has dismissed them. However, what my research experiences have been far more useful for, I believe, is pointing up some of the potential issues – both with the interface (display problems) and the users (making assumptions) – before the Domain Dark Archive goes live.

(1) I am aware of sites which were not included in the slice available for this initial project, as well as those without a UK domain suffix which are beyond the scope of the project, such as www.my-liverpool.co.uk or www.mudcat.org.

(2) The Tate Archive results could not be shown by the Wayback Machine, due to ‘robots.txt’ on the site – see http://web.archive.org/web/20060824234002/http://archive.tate.org.uk:80/DServe/dserve.exe?dsqServer=tg_calm&dsqApp=Archive&dsqDb=Catalog&dsqCmd=Browse.tcl&dsqSearch=*(RefNo='TAp*')&dsqKey=RefNo

(3) http://web.archive.org/web/20070514010256/http://www.ancestryaid.co.uk:80/boards/archive/index.php/t-928.html

(4) http://www.webarchive.org.uk/aadda-discovery/browse?text=%22adrian+henri%22&sort_by=solr_document&sort_order=ASC

(5) http://www.webarchive.org.uk/aadda-discovery/browse?text=%22adrian%20henri%22&sort_by=solr_document&sort_order=ASC&f[0]=crawl_year%3A%222000%22

(6) http://www.webarchive.org.uk/aadda-discovery/browse?text=%22adrian%20henri%22&sort_by=solr_document&sort_order=ASC&page=1&f[0]=crawl_year%3A%222000%22

(7) This is something which we have discussed at AADDA meetings, and I feel that the interface does make this clear, it is just something which should be stressed to users in any guidance material, to avoid misunderstanding.

(8) http://www.webarchive.org.uk/aadda-discovery/browse?text=%22adrian+henri%22&sort_by=solr_document&sort_order=ASC

(9) http://web.archive.org/web/19991008172118/http://webserv1.stockportmbc.gov.uk:80/pages/links/schools/primary/ourlarc/oct1998.htm

(10) http://web.archive.org/web/20060618094849/http://www.aqa.org.uk:80/qual/pdf/AQA-5741-6741-WRE-Jun05.pdf

(11) http://www.webarchive.org.uk/aadda-discovery/browse?text=%22adrian%20henri%22&sort_by=solr_document&sort_order=ASC&f[0]=sentiment%3A%22Very%20Negative%22

(12) http://web.archive.org/web/20070208145352/http://www.argotistonline.co.uk:80/Finch%20interview.htm

(13) http://web.archive.org/web/20061019024105/http://www.argotistonline.co.uk/Finch%20interview.htm

(14) http://web.archive.org/web/20130723093311/http://www.argotistonline.co.uk/Finch%20interview.htm

(15) See, for example, http://web.archive.org/web/19961221024212/http://www.users.dircon.co.uk:80/~dirkje/pjmanif.htm or http://web.archive.org/web/20020701043924/http://www.artcircus.org.uk:80/route/version5/paper/paper_article_detail.asp?idno=3

Thursday 13 June 2013

A page, but not as we know it

James Baker, Digital Curator, British Library

It is commonplace to describe something new in relation to something that is known: think 'motion picture', 'spaceship', 'email' or 'smartphone'. The word 'webpage' is no different. And indeed in a sense many webpages are similar to the pages found in books or newspapers: they hold static media (text, image); core elements of them read from top to bottom; their headers, footers, cut-aways and advertisements orientate, guide and entice the reader; and in URLs they possess a (relatively) unique system of identifiers. It is hard to think of another name these digital objects could have been given.

It is also commonplace for the new thing to - linguistically speaking - replace the old thing: think 'motion picture' and 'the pictures', 'spaceship' and 'ship', 'email' and 'mail', or 'smartphone' and 'phone'. The same goes for 'webpage' and 'page'. Here by virtue of this act of redefinition, the 'page' absorbs features of the webpage not (or less) possible in book or newspaper pages: features such as dynamic content, user interaction, and direct links to other pages (or, more precisely, other pages that are not part of a sequence defined by the author whose work is the main content held by the page).

All of this makes the webpage-cum-page appear both familiar and unsettling, conservative and disruptive, old and new. These elements of lineage are crucial, for they have allowed us (among other things) to think of preserving the webpage as akin to preserving the page. Yes the challenges of novelty and disruption are discussed and debated (on which I'm not qualified to comment), but at the most basic level the webpage stuff that is being collected by Internet Archive or the UK Web Archive is page level stuff. (This is not to say I don't think page level stuff should be archived. Far from it, the fragility of webpages is well known (see Rosenzweig, 2003) and without these efforts valuable data on our society would be lost.)

But what are these pages and how can historians use them? A seminar jointly hosted by the Digital History seminar and the Archives and Society seminar at the Institute of Historical Research sought last night to tackle this very problem, asking quite simply 'Is this a new class of primary source for historians?'. After a presentation on the UK Web Archive and the Analytical Access to the Domain Dark Archive project both the speakers and the audience were largely in agreement that yes, the web archive is a new class of primary source, of historical stuff.

Does this make our nomenclature for what this stuff is problematic? For to call a webpage a page is to potentially place it into a category for which it is ill-suited and the techniques for investigating that category under huge-strain. Take a normal news article from the Guardian website as an example. The page contains a story, framing, context and advertisements: all very page like. But those adverts are dynamic as opposed to static, their content quite possibly targeted depending on the IP address accessing the URL and different each time the page is refreshed. The page also contains moderated comments, ranked as default by oldest first but malleable to user preferences. In short, when you visit the website it is unlikely to be the same as when I visit the website, so an archived version can only be one possible version of a webpage at a particular historical moment. Not very page like behaviour. Of course we might (quite rightly for the most part) say that the 'core' of the page, the textual content that historians are likely to be interested in will remain the same regardless of these peripheral changes. And yet as the growth of mainstream live blogs demonstrates (such as those covering the Taksim Square protests), the web is moving toward dynamic content over static content as default: embedded video, maps and text content streams are now commonplace, and are likely to become more so as the web develops.

The webpage then is a rapidly evolving beast whose capacity to change whilst still being called a 'page' complicates how we do research using webpages and how we preserve the internet. It is a page but not a page as we knew it, a semantic shift worth keeping in mind as we prepare for an era of born-digital historical scholarship.

This post was first published on the British Library's Digital Scholarship blog.