Biometrika #

This is a canonical data source description, pencelab:source/biometrika.
The current dataSourceVersion described by this documentation is 1. The dataSource name for this data is Biometrika.

Coverage: The journal Biometrika, from its founding in 1901 until May 30, 2021
Size: 8,240 articles
Copyright: JSTOR and/or Oxford University Press (see each article)
License: an agreement with JSTOR DFR (for articles on or prior to December 1, 2013), and the Oxford Journals Site License for articles after December 1, 2013
Credits: Christophe Malaterre, Francis Lareau, Nicola Bertoldi, C.H. Pence

How we got it #

Articles on or prior to December 1, 2013 were provided to us by JSTOR via their DFR service. Articles after December 1, 2013 were downloaded directly in HTML format from the Oxford Journals website, following the text and data mining provisions in their institutional site license.

This corpus was prepared in collaboration between the Pence Lab and the research group of Christophe Malaterre at UQaM, with further thanks especially to Francis Lareau and Nicola Bertoldi.

Processing #

  • OCR to plain text (JSTOR articles): Unknown, performed by JSTOR (see discussion of JSTOR OCR on our JSTOR data source page).
  • Plain text (OUP articles): Extracted directly from the HTML files from the journal.
  • Metadata (JSTOR articles): Provided by JSTOR.
  • Metadata (OUP articles): Extracted directly from the HTML files from the journal.
  • Canonical JSON: The canonical JSON format was initially extracted from a Python data-frame provided by the Malaterre group.
  • Keywords and Tags: There are no keywords or tags in this data source.
  • Cleaning: This data was cleaned by the Malaterre group prior to being provided to us:
    • Front matter, back matter, reviews, errata, acknowledgments, and similar content are included as metadata records, but are not present as full-text
    • Full text of articles was also excluded if there was less than 400 characters of text content per page (to filter out tables of statistical constants, very common in earlier volumes of Biometrika)
    • Full text of aricles in foreign languages was passed through Google Translate

Changelog #

  • Data Source Version 1 (2021-12-13): Imported this dataset for the first time from the Malaterre group.