COVID-19 Open-Source Helpdesk

API for medrvix

I looked and couldn’t find one. Is there an API that will let you download COVID-19 articles?

I do not think there is a pbulic api (yet) but they do have an rss feed for the covid paper (https://connect.medrxiv.org/relate/content/181) that has links to the paper landing pages which has links to the pdfs.

There is on-going work on a public API, well bring this question to the attention of the people working onit.

Thanks. I know one institution that is downloading all of the PDFs by hand.

In the absence of an API this site looks easy to crawl over using beautifulsoup and selenium. If you’d like me to put together a script to do so I’d be happy to help for personal use.

I’m not a lawyer so I know there could be legal issues or a violation of terms of service but given the situation hoping for leniency

It looks like the PDFs exist at predictable URLs based on the DOI, so you could download them without needing to scrape. For example, the feed links to https://biorxiv.org/cgi/content/short/2020.01.20.913368v1?rss=1, and the corresponding PDF is at https://www.biorxiv.org/content/10.1101/2020.01.20.913368v1.full.pdf

I don’t believe there are legal issues with just downloading the papers, but it may technically contravene copyright if you redistribute papers to others where the license doesn’t allow that - authors choose the license for their own work. I doubt many scientists would practically object to you sharing important work that’s already freely available.

Mass downloading may go against the terms of the sites which provide the papers (medrxiv, biorxiv). Biorxiv mentions in the about page that text mining is allowed, which implies that downloading a lot of papers is expected, but they may prefer you to sign up or observe some rate limit.

(I am not a lawyer either, so don’t take any of this as gospel)

I wrote something using Beautiful Soup against their JSON download. Two points to consider for others:

  • The JSON payload is invalid. Remove the first and the last character data = json.loads(line[1:-1]) # Drop invalid json
  • The JSON
{'rel_doi': '10.1101/782409', 'rel_title': 'Coronavirus Interferon Antagonists Differentially Modulate the Host Response during Replication in Macrophages', 'rel_date': '2019-09-25', 'rel_grpid': '181', 'rel_gname': 'COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv', 'rel_type': 'span', 'rel_site': 'biorxiv'}

does not have a version number, so the article might be out of data.

The XML has the same issues as the JSON with the additional challenge of pagination.

1 Like

After digging through the XML/JSON, I came to the same realization. The challenge is finding the correct version number.

If you use the RSS feed, it seems to have links which include the version number:

<item rdf:about="https://biorxiv.org/cgi/content/short/2020.03.21.001933v1?rss=1">
1 Like

We’re using Biorxiv JSON API to get the latest preprints from Biorxiv and Medrxive and assemble the results using an Observable notebook https://observablehq.com/@ismms-himc/covid-19-sars-cov-2-preprints-from-medrxiv-and-biorxiv

We don’t yet have links to full text.

2 Likes

I just spotted http://biomed-sanity.com/ , which aims to present the Covid papers from biorxiv in a conveniently searchable form. It’s open source, so the code it uses to access biorxiv could be reused:

It’s under the MIT license.

1 Like