The Atlas of Digitised Newspapers and Metadata

In 2020, the Atlas of Digitised Newspapers and Metadata was released. With database histories which used the series of blog posts first published here on the Oceanic Exchanges website as a starting point, the Atlas provides a deep contextualisation of ten collections of digitised newspapers:

Chronicling America (The Library of Congress)
The Hemeroteca Nacional Digital de México
The British Library
The Times Digital Archive
Delpher (Koninklijke Bibliotheek)
Europeana
The Suomen Kansalliskirjaston Digitoidut Sanomalehdet
Trove (The National Library of Australia)
Papers Past (The National Library of New Zealand)
ZEFYS (Berlin State Library)

Within the ‘Ontologies’ work package, we conducted interviews with staff at thirteen libraries and digitisers around the world, including representatives from the British Library, the National Library of Scotland, the National Library of Australia and the national library of the Netherlands. Interviews were also conducted with the publishing companies Readex, ProQuest, and Gale, a Cengage Company. Within these interviews, questions were asked about the history, selection processes, funding and development of the collections.

We combined these oral histories with analysis of metadata, input into a spreadsheet resulting in 3300 lines of metadata fields. The full dataset is available on Figshare, and includes metadata in three different formats, many different file structures (such as files for alternate newspaper titles, page files broken down into text and image data, files for issue data, and so on), and metadata from more than sixteen countries, thanks to the aggregation of Europeana and the remit of also representing colonies and near neighbours in some collections.

Although most of the databases used variants of the METS/ALTO standard, these were not implemented in a way that would allow for simple equivalencies. The variance in terminology, and in the interpretation of the correct range of inputs for a given field, arose from the use of a hodgepodge of different vocabularies, including variants of Dublin Core, METS/ALTO, MPEG-21, PREMIS, as well as other bespoke or proprietary taxonomies. Overlapping and ambiguous vocabularies were also structured inconsistently, with some combining data at the article, page or issue level and others separating the metadata and content for these elements into multiple files. Our initial attempts to account for both internal structures and field equivalencies across these databases made the level of irregularity strikingly clear.

In order to explain these metadata fields, we asked our archival partners questions and made use of public documentation and documentation made available to us on request. Sometimes we’ve had to work out what specific fields mean based on the content, or use forums and blogs rather than official sources, as the initial decision-making is not always recorded.

Finally, building upon previous research by team members and our interviews, we were able to develop a longitudinal understanding of how the data has been augmented or repackaged by institutions over the past twenty years. We combined this deep understanding of collections with a literature review of the historical development of newspaper layout around the world. This gave us our own hierarchy of the newspaper and the metadata, bringing together the logic of the file structures and the logic of the newspaper; a kind of ontology, although not the one we had planned when we started. This has involved defining meta-metadata categories and providing not only technical definitions but also considering how these terms are used in the academic literature to account for the differences in these metadata fields. Our analysis of these terms ‘in the wild’ of scholarly discussion of periodicals has shown that the ambiguity of terms within the metadata is often reflected in historical use, too.

Our initial research purpose allowed one month for mapping metadata fields, based on the assumption that it would be possible – at least broadly – to identify fields for the same kind of data across the collections. In the end, we spent a year crafting the Atlas. The report and website contain three major sections: the histories of the collections, multidisciplinary discussions of the metadata, and glossaries of terms used by journalists, periodical scholars, library scientists and digitisation specialists to refer to the components of these collections. Rather than reduce to the lowest common denominator of official documentation, we scoured the archives and, in many cases, truly annoyed and pestered the holders of these collections in order to provide the complete view, from deposit and cataloguing standards for the hardcopy collections to microfilming priorities, the digitisation process, and the development of online interfaces. Although not all questions could be fully answered with this initial offering, for the first time ever, a true like-for-like catalogue of international digitised newspapers is available to read, explore, and expand online.

Download the Atlas: Beals, M. H. and Emily Bell, with contributions by Ryan Cordell, Paul Fyfe, Isabel Galina Russell, Tessa Hauswedell, Clemens Neudecker, Julianne Nyhan, Mila Oiva, Sebastian Padó, Miriam Peña Pimentel, Lara Rose, Hannu Salmi, Melissa Terras, and Lorella Viola. The Atlas of Digitised Newspapers and Metadata: Reports from Oceanic Exchanges. Loughborough: 2020. DOI:10.6084/m9.figshare.11560059.