Extending PDF to Text Capability #6377
Unanswered
demongolem-biz2
asked this question in
Questions
Replies: 1 comment
-
Hello, @demongolem-biz2! Things might be clearer and more modular. We are working in this direction for the upcoming Haystack 2.0. (My impression is that I suggest the following approach:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
As part of a pipeline, I would like to convert pdf files to text (most of my inputs are pdf files). There a number of things I would like to remove as part of a cleaning process. For one, my documents have headers and footers on each page which should be removed because they create extra noise. Other things are quite standard: caption removal on Figures which would just bleed in with the rest of the text, mathematical equations which often do not survive conversion, and so on.
Haystack I see provides two mechanisms to convert pdfs and I don't quite understand why both exists:
The latter has a clean_func parameter which I can see is very useful. The former has a parameter to suppress tables which could be useful in some situations.
What is the best way to clean information? Does either one excel, I guess the 2nd because it is more customizable? Is there a way to drill down to a deeper level within the Haystack codebase? Would another tool external to Haystack perhaps meet my needs better?
Beta Was this translation helpful? Give feedback.
All reactions