The Times Digital Archive
Collection History
The Times Digital Archive was the first online digitized newspaper collection of British newspapers. This early adoption of digitization, building upon the ubiquity of Palmer’s index of the Times, ensured its prominence in historical and journalistic research, including its particular use by the House of Lords in researching past legal debates. As of 2013, it was the most searched digitized newspaper database among Cengage’s news media collections.
Produced by then Thompson Gale publishers, the collection debuted in 2002. Its initial remit was to make available the entirety of the Times, including its previous incarnations, from 1785-1985. This initial material was digitized in a relatively short period of time (2002-3), allowing for a consistency of staff, equipment, method and product, both in terms of image and OCR quality. The content was released in several batches, the first being 1936-1946, growing monthly to include 1880-1985 by the end of 2002 and the whole 200 years by the close of 2003.
Since its acquisition by Cengage in 2007, Gale has continued to expand the collection, which now currently offers the complete run of the publication from 1785 to 2010.
Collection Composition
The Times Digital Archive currently contains material from 1785-2010. This includes over 1.6 million pages from 70,000 issues, sub-divided or zoned into 11.8 million articles, catalogued by category, including advertising, editorial and commentary, news, business, news, people and photojournalism. Although the modern Times began publication in 1788, the collection includes digital issues of its precursors, The Daily Universal Register (1785-1787) and the Times, or, Daily Universal Register (1788). The collection continues to expand with additional year content added on an annual basis.
Data and Metadata
The data for the digitised newspapers comes in two forms: a scanned image of each newspaper page at 300 dpi, zoned and sub-divided at article level, and an XML file containing the text (OCR) and metadata for each article. The machine-readable text appears within the XML file, surrounded by metadata that describes various features about the article, including the title, issue, date, section, and page number. Articles between 1785-1985 were created during a single project, undertaken by the same staff and using the same equipment and processes. Subsequent additions have been included on a rolling basis and their data is contained in a separate but similar substructure within the collection.
Each XML file contains information for a single article, or textual unit, within the collection. Within a holding element, it contains the following metadata tree:
- Issue, with unique, project and collection IDs and publication type
- Publication Meta data
- Publication ID
- Project code
- Physical object ID
- Digitized Collection ID
- OCR process information
- Source information
- Source collection name
- Source collection location
- Copyright statement
- Digital collection information
- Release Date
- Unique ID
- Publication Information
- Issue Number
- Date
- Year
- Month
- Day
- Printed Date
- Standardized Date
- Page count
- Article Meta Data
- Page
- Page ID, with physical descriptors of image
- Page Number
- Asset Unique ID
- Article Type
- Digital Article Unique ID
- OCR Language
- OCR Process Data
- Object Type
- Page ID
- Column ID
- Page Count (for article)
- Word Count
- Title or Headline
- Category
- Page
This is followed by the machine-readable text, in which each individual word is encoded with spatial coordinates of its location on the corresponding image, as well as marker elements indicating new pages or columns.
Access and Use
The collection is available in many state and institutional libraries throughout the World through a commercial licencing arrangement with Gale. The underlying text and meta data can be accessed by request and a cost recovery fee by subscribing institutions.
References
Fritze, Ronald H., Brian E. Coutts, Louis Andrew Vyhnanek. Reference Sources in History: An Introductory Guide. New York: Modern Language Association of America, 2003.