Skip to content

feat: Add failed_files to the output of pdf converters #9851

@medsriha

Description

@medsriha

Is your feature request related to a problem? Please describe.
It would be helpful if we could access the list of failed files so we can send them to another converter, such as OCR or similar. Ideally, this new feature would work for both PyPDFToDocument and PDFMinerToDocument.

Describe the solution you'd like
Basically, when there is an exception, the failed files would be appended to a list, something like this:

  try:
      pdf_reader = PdfReader(io.BytesIO(bytestream.data))
      text = self._default_convert(pdf_reader)
  except Exception as e:
      logger.warning(
          "Could not read {source} and convert it to Document, skipping. {error}", source=source, error=e
      )
      failed_files.append(source)  # return this list along with `documents`
      continue

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priority, add to the next sprint if no P1 available

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions