The Times Digital Archive

Collection History

The Times Digital Archive was the first online digitized newspaper collection of British newspapers. This early adoption of digitization, building upon the ubiquity of Palmer’s index of the Times, ensured its prominence in historical and journalistic research, including its particular use by the House of Lords in researching past legal debates. As of 2013, it was the most searched digitized newspaper database among Cengage’s news media collections.

Produced by then Thompson Gale publishers, the collection debuted in 2002. Its initial remit was to make available the entirety of the Times, including its previous incarnations, from 1785-1985. This initial material was digitized in a relatively short period of time (2002-3), allowing for a consistency of staff, equipment, method and product, both in terms of image and OCR quality. The content was released in several batches, the first being 1936-1946, growing monthly to include 1880-1985 by the end of 2002 and the whole 200 years by the close of 2003.

Since its acquisition by Cengage in 2007, Gale has continued to expand the collection, which now currently offers the complete run of the publication from 1785 to 2010.

Collection Composition

The Times Digital Archive currently contains material from 1785-2010. This includes over 1.6 million pages from 70,000 issues, sub-divided or zoned into 11.8 million articles, catalogued by category, including advertising, editorial and commentary, news, business, news, people and photojournalism. Although the modern Times began publication in 1788, the collection includes digital issues of its precursors, The Daily Universal Register (1785-1787) and the Times, or, Daily Universal Register (1788). The collection continues to expand with additional year content added on an annual basis.

Data and Metadata

The data for the digitised newspapers comes in two forms: a scanned image of each newspaper page at 300 dpi, zoned and sub-divided at article level, and an XML file containing the text (OCR) and metadata for each article. The machine-readable text appears within the XML file, surrounded by metadata that describes various features about the article, including the title, issue, date, section, and page number. Articles between 1785-1985 were created during a single project, undertaken by the same staff and using the same equipment and processes. Subsequent additions have been included on a rolling basis and their data is contained in a separate but similar substructure within the collection.

Each XML file contains information for a single article, or textual unit, within the collection. Within a holding element, it contains the following metadata tree:

Issue, with unique, project and collection IDs and publication type
Publication Meta data
- Publication ID
- Project code
- Physical object ID
- Digitized Collection ID
- OCR process information
- Source information
  - Source collection name
  - Source collection location
  - Copyright statement
- Digital collection information
  - Release Date
  - Unique ID
- Publication Information
  - Issue Number
  - Date
    - Year
    - Month
    - Day
    - Printed Date
    - Standardized Date
  - Page count
Article Meta Data
- Page
  - Page ID, with physical descriptors of image
  - Page Number
  - Asset Unique ID
  - Article Type
  - Digital Article Unique ID
  - OCR Language
  - OCR Process Data
  - Object Type
  - Page ID
  - Column ID
  - Page Count (for article)
  - Word Count
  - Title or Headline
  - Category

This is followed by the machine-readable text, in which each individual word is encoded with spatial coordinates of its location on the corresponding image, as well as marker elements indicating new pages or columns.

Access and Use

The collection is available in many state and institutional libraries throughout the World through a commercial licencing arrangement with Gale. The underlying text and meta data can be accessed by request and a cost recovery fee by subscribing institutions.

References

“The Times Digital Archive”.

Liddle, Dallas. “Reflections on 20,000 Victorian Newspapers: ‘Distant Reading’ The Times using The Times Digital Archive”, Journal of Victorian Culture 17:2 (2002), 230–237.

MacQueen, Donald S. “Developing Methods for Very-Large-Scale Searches in Proquest Historical Newspapers Collection and Infotrac the Times Digital Archive: The Case of Two Million Versus Two Millions” *Journal of English Linguistics 32:2 (2004), 124-143.

Hobbs, Andrew. “The Deleterious Dominance of The Times in Nineteenth-Century Scholarship” Journal of Victorian Culture 18:4 (2013), 472-497.

Fritze, Ronald H., Brian E. Coutts, Louis Andrew Vyhnanek. Reference Sources in History: An Introductory Guide. New York: Modern Language Association of America, 2003.