This repository was archived by the owner on May 10, 2022. It is now read-only.

crminer v0.4.0

Latest

Latest

sckott released this 26 Jun 18:28

b863a45

NEW FEATURES

crm_pdf() and crm_text() lose the cache parameter, which toggled whether or not to use caching. those functions always cache requests now (#37)
crm_extract() gains parameter try_ocr (logical, default: FALSE) to optionally try Optical Character Recognition (OCR) with extract pdf pages if the pdf is scanned images. extraction can take a while, but the result is cached, so will be very fast on subsequent requests for the same article (#37)

MINOR IMPROVEMENTS

crm_plain(), crm_xml(), crm_html(), and crm_text() now cache articles as crm_pdf() has for a while. Along with this change caching is now split into separate folders for pdf, txt (for plain), xml, and html (#17)
internally force Pensoft publisher urls to https from http (#48)
added docs section User-agent to crm_html(), crm_pdf(), crm_plain(), crm_xml(), and crm_text() detailing how users can set a user agent string with the useragent curl option (#41) (#42)
fix a link in the README (#47) thanks @salim-b

BUG FIXES

for wiley articles, replace part of url pdf with pdfdirect for better access (#40)
initially for wiley specific errors, extracted out internal function try_extract_pdf_errors() to attempt to extract various errors that occur when trying to download and extract text from pdfs (#40)
eLife specific url fix in crm_links(), older url was leading to article landing pages (#6)
fix for cases in which Elsevier returns just the first page of a pdf instead of the whole article. we show the user a warning when this occurs and delete the 1 page pdf file (#43)

Assets 2