Crossref Scraping

Crossref Scraping #

Many of our datasets come with DOI information that we must later turn into machine-readable bibliographic data. This can be accomplished using Crossref, the central repository for DOI information (which is found at https://doi.org/). The support on their servers for this access is documented here and here.

In short, if you query the Crossref server URL for a particular DOI (e.g., https://dx.doi.org/10.1038/165387a0), but set the “Accept” HTTP header to a non-HTML content type, then instead of being served a redirect to the canonical URL for that article, you will get a file representing Crossref’s bibliographic information about that article.

For our purposes, we use the content type application/vnd.citationstyles.csl+json, which returns data in the form of Citation Style Language JSON. All of the CSL file formats are clearly documented in their Github repository, and the CSL JSON reference format in particular is documented here. This format can be easily parsed into our canonical JSON.