Releases: alephdata/ingest-file
3.18.4-rc3
What's Changed
Dependency upgrades
- Bump pillow from 9.4.0 to 9.5.0 by @dependabot in #448
- Bump google-cloud-vision from 3.4.0 to 3.4.1 by @dependabot in #447
- Bump openpyxl from 3.1.1 to 3.1.2 by @dependabot in #446
- Bump pytest from 7.2.1 to 7.2.2 by @dependabot in #444
- Bump tesserocr from 2.5.2 to 2.6.0 by @dependabot in #445
Full Changelog: 3.18.4-rc1...3.18.4-rc3
3.18.4-rc1
What's Changed
- Use PyMuPDF instead of pikepdf + pdfminer.six for PDF ingestion (text and image extraction). #441
Dependency upgrades
- Bump google-cloud-vision from 3.3.0 to 3.4.0 by @dependabot in #439
- Bump pantomime from 0.5.3 to 0.6.0 by @dependabot in #436
- Bump cryptography from 38.0.4 to 39.0.1 by @dependabot in #431
- Bump pytest from 7.2.0 to 7.2.1 by @dependabot in #424
- Bump openpyxl from 3.0.10 to 3.1.1 by @dependabot in #435
- Bump spacy from 3.4.4 to 3.5.1 by @dependabot in #440
- Bump fingerprints from 1.0.3 to 1.1.0 by @dependabot in #438
Full Changelog: 3.18.2...3.18.4-rc1
3.18.3-rc2
What's Changed
- Fix PDF ingest bug by @catileptic in #430
Dependency upgrades
- Bump pikepdf from 6.2.8.post1 to 7.1.1 by @dependabot in #434
- Bump google-cloud-vision from 3.3.0 to 3.4.0 by @dependabot in #439
- Bump pantomime from 0.5.3 to 0.6.0 by @dependabot in #436
- Bump cryptography from 38.0.4 to 39.0.1 by @dependabot in #431
- Bump pytest from 7.2.0 to 7.2.1 by @dependabot in #424
- Bump openpyxl from 3.0.10 to 3.1.1 by @dependabot in #435
- Bump spacy from 3.4.4 to 3.5.1 by @dependabot in #440
- Bump fingerprints from 1.0.3 to 1.1.0 by @dependabot in #438
Full Changelog: 3.18.2...3.18.3-rc2
3.18.2
IMPORTANT NOTE: this release was pulled. At this time 3.17.1 is the latest release.
What's Changed
- Update public error message for password protected PDFs by @catileptic in #422
Dependency upgrades
- Bump requests[security] from 2.28.1 to 2.28.2 by @dependabot in #413
- Bump google-cloud-vision from 3.2.0 to 3.3.0 by @dependabot in #412
- Bump pikepdf from 6.2.7 to 6.2.8.post1 by @dependabot in #411
New Contributors
- @catileptic made their first contribution in #422
Full Changelog: 3.18.0...3.18.2
3.18.1
IMPORTANT NOTE: this release was pulled. At this time 3.17.1 is the latest release.
What's Changed
- 
Handle TIFFs in PDFs by converting to PNG by @stchris in #419 
- 
PDF ingest: ignore unsupported image file formats 
- 
PDF ingest: normalize text using unicode.normalize 
Full Changelog: 3.18.0...3.18.1
3.18.1-rc3
IMPORTANT NOTE: this release was pulled. At this time 3.17.1 is the latest release.
What's Changed
- PDF ingest: ignore unsupported image file formats
- PDF ingest: normalize text using unicode.normalize
Full Changelog: 3.18.0...3.18.1-rc3
3.18.1-rc2
IMPORTANT NOTE: this release was pulled. At this time 3.17.1 is the latest release.
What's Changed
- Handle TIFFs in PDFs by converting to PNG by @stchris in #419
- Change dependabot schedules to monthly by @stchris in #414
Full Changelog: 3.18.0...3.18.1-rc2
3.18.0
IMPORTANT NOTE: this release was pulled. At this time 3.17.1 is the latest release.
What's Changed
Major PDF library change
We are hereby deprecating pdflib, replacing it with well maintained, performant libraries. This enables local development on hardware with Apple Silicon CPUs. This also enables support for JBIG2 images in PDF files.
- Replace pdflib with pdfminersix (for text) & pikpedf (for images) by @stchris in #380
- Properly link page entities to the Pages entity they belong to by @stchris in #410
- Remove poppler by @stchris in #393
- Better word recognition with large spaces between letters by @stchris in #402
- Preference towards small text as opposed to spaced apart one by @stchris in #403
Integrating convert-document into ingest-file
- Merge convert-document into ingest-file by @stchris in #395
- Better logging when converting documents to pdf by @Rosencrantz in #376
Smaller changes
- Allow rc releases, aligning with aleph by @stchris in #388
- Document JSON logging format option by @stchris in #392
- Replace nosetests with pytest by @stchris in #381
Dependency updates
- Bump servicelayer[amazon,google] from 1.19.1 to 1.20.5 by @dependabot in #386
- Bump followthemoney from 3.1.0 to 3.2.0 by @dependabot in #387
- Bump flask from 2.1.2 to 2.2.2 by @dependabot in #385
- Bump normality from 2.3.3 to 2.4.0 by @dependabot in #384
- Bump pillow from 9.2.0 to 9.3.0 by @dependabot in #383
- Bump bump2version from 0.5.4 to 1.0.1 by @dependabot in #382
- Bump psutil from 5.9.2 to 5.9.4 by @dependabot in #372
- Bump google-cloud-vision from 3.1.2 to 3.1.4 by @dependabot in #356
- Bump cryptography from 38.0.1 to 38.0.3 by @dependabot in #367
- Bump pantomime from 0.5.1 to 0.5.3 by @dependabot in #379
- Bump spacy from 3.4.1 to 3.4.3 by @dependabot in #374
- Bump pyicu from 2.9 to 2.10.2 by @dependabot in #364
- Bump icalendar from 4.1.0 to 5.0.3 by @dependabot in #389
- Bump pymediainfo from 5.1.0 to 6.0.1 by @dependabot in #391
- Bump cryptography from 38.0.3 to 38.0.4 by @dependabot in #390
- Bump pikepdf from 6.2.4 to 6.2.5 by @dependabot in #394
- Bump lxml from 4.9.1 to 4.9.2 by @dependabot in #396
- Bump spacy from 3.4.3 to 3.4.4 by @dependabot in #397
- Bump pikepdf from 6.2.5 to 6.2.6 by @dependabot in #399
- Bump google-cloud-vision from 3.1.4 to 3.2.0 by @dependabot in #400
- Bump pikepdf from 6.2.6 to 6.2.7 by @dependabot in #406
- Bump icalendar from 5.0.3 to 5.0.4 by @dependabot in #405
- Bump dbf from 0.99.2 to 0.99.3 by @dependabot in #404
- Bump followthemoney from 3.2.0 to 3.2.1 by @dependabot in #409
- Bump pillow from 9.3.0 to 9.4.0 by @dependabot in #408
Full Changelog: 3.17.1...3.18.0
3.18.0-rc4
IMPORTANT NOTE: this release was pulled. At this time 3.17.1 is the latest release.
What's Changed
- Properly link page entities to the Pages entity they belong to (which fixes #398) by @stchris in #410
Dependency updates
- Bump followthemoney from 3.2.0 to 3.2.1 by @dependabot in #409
- Bump pillow from 9.3.0 to 9.4.0 by @dependabot in #408
Full Changelog: 3.18.0-rc3...3.18.0-rc4
3.18.0-rc3
IMPORTANT NOTE: this release was pulled. At this time 3.17.1 is the latest release.
What's Changed
Version bumps
- Bump lxml from 4.9.1 to 4.9.2 by @dependabot in #396
- Bump spacy from 3.4.3 to 3.4.4 by @dependabot in #397
- Bump pikepdf from 6.2.5 to 6.2.6 by @dependabot in #399
- Bump google-cloud-vision from 3.1.4 to 3.2.0 by @dependabot in #400
- Preference towards small text as opposed to spaced apart one by @stchris in #403
- Bump pikepdf from 6.2.6 to 6.2.7 by @dependabot in #406
- Bump icalendar from 5.0.3 to 5.0.4 by @dependabot in #405
- Bump dbf from 0.99.2 to 0.99.3 by @dependabot in #404
Full Changelog: 3.18.0-rc2...3.18.0-rc3