Extending PDF to Text Capability #6377

demongolem-biz2 · 2023-11-21T17:47:34Z

demongolem-biz2
Nov 21, 2023

As part of a pipeline, I would like to convert pdf files to text (most of my inputs are pdf files). There a number of things I would like to remove as part of a cleaning process. For one, my documents have headers and footers on each page which should be removed because they create extra noise. Other things are quite standard: caption removal on Figures which would just bleed in with the rest of the text, mathematical equations which often do not survive conversion, and so on.

Haystack I see provides two mechanisms to convert pdfs and I don't quite understand why both exists:

Use PDFToTextConverter in haystack/nodes/file_converter/pdf.py
Use convert_files_to_docs in haystack/utils/preprocessing

The latter has a clean_func parameter which I can see is very useful. The former has a parameter to suppress tables which could be useful in some situations.

What is the best way to clean information? Does either one excel, I guess the 2nd because it is more customizable? Is there a way to drill down to a deeper level within the Haystack codebase? Would another tool external to Haystack perhaps meet my needs better?

anakin87 · 2023-11-22T16:59:27Z

anakin87
Nov 22, 2023
Maintainer

Hello, @demongolem-biz2!

Things might be clearer and more modular. We are working in this direction for the upcoming Haystack 2.0.

(My impression is that convert_files_to_docs is almost a relic from the past and I would not rely on it).

I suggest the following approach:

use PDFToTextConverter to produce Documents from files
Documents are simple objects, where content contains the text. So, I propose creating your custom Python function to manipulate/clean the content.
The Preprocessor also has some cleaning parameters, but I don't know if they fit your use case.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extending PDF to Text Capability #6377

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extending PDF to Text Capability #6377

Uh oh!

demongolem-biz2 Nov 21, 2023

Replies: 1 comment

Uh oh!

anakin87 Nov 22, 2023 Maintainer

demongolem-biz2
Nov 21, 2023

anakin87
Nov 22, 2023
Maintainer