Skip to content

Add possibility to deactivate OCR #2467

Open
@thomascerbelaud

Description

@thomascerbelaud

Is your feature request related to a problem? Please describe.
I am working on large text-based PDF files, and would like to parse them as fats as possible, while keeping a high resolution (strategy="hi_res"). I am interested in extracting tables, and tables from pictures, however I would like to deactivate OCR for images detected as such. The first obvious reason is speed. But also images do not matter much.

Describe the solution you'd like
A keyword argument that would enable or disable OCR would be the most easy thing to code I guess and would be a nice additional feature, especially if it can differentiate between tables and images. Another nice feature would be to not perform OCR on tables for text-based regions, in order to speed the partition process.

Describe alternatives you've considered
Adding a new OCRMode.NO_OCR

Additional context
I am not interested in images s.a. graphs or photos.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions