This repository was archived by the owner on May 10, 2022. It is now read-only.
NEW FEATURES
crm_pdf()andcrm_text()lose thecacheparameter, which toggled whether or not to use caching. those functions always cache requests now (#37)crm_extract()gains parametertry_ocr(logical, default:FALSE) to optionally try Optical Character Recognition (OCR) with extract pdf pages if the pdf is scanned images. extraction can take a while, but the result is cached, so will be very fast on subsequent requests for the same article (#37)
MINOR IMPROVEMENTS
crm_plain(),crm_xml(),crm_html(), andcrm_text()now cache articles ascrm_pdf()has for a while. Along with this change caching is now split into separate folders for pdf, txt (for plain), xml, and html (#17)- internally force Pensoft publisher urls to https from http (#48)
- added docs section
User-agenttocrm_html(),crm_pdf(),crm_plain(),crm_xml(), andcrm_text()detailing how users can set a user agent string with theuseragentcurl option (#41) (#42) - fix a link in the README (#47) thanks @salim-b
BUG FIXES
- for wiley articles, replace part of url
pdfwithpdfdirectfor better access (#40) - initially for wiley specific errors, extracted out internal function
try_extract_pdf_errors()to attempt to extract various errors that occur when trying to download and extract text from pdfs (#40) - eLife specific url fix in
crm_links(), older url was leading to article landing pages (#6) - fix for cases in which Elsevier returns just the first page of a pdf instead of the whole article. we show the user a warning when this occurs and delete the 1 page pdf file (#43)