Skip to content

Inline image EI markers only divided by a few renderable characters fail to detect the correct one #3468

@AbdiHaryadi

Description

@AbdiHaryadi

I found a document which raised a PdfReadError. This error was unexpected because the document can be read properly with, for example, Adobe Acrobat Reader DC.

Environment

I used Google Colab.

$ python -m platform
Linux-6.1.123+-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.0.0, crypt_provider=('cryptography', '43.0.3'), PIL=11.3.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
reader = PdfReader("/content/UNIQ_LAP_TAHUNAN_2021-69.pdf")

for page in reader.pages:
    text = page.extract_text()
    print(text)

Access the PDF here.

Traceback

This is the complete traceback I see:

---------------------------------------------------------------------------

PdfReadError                              Traceback (most recent call last)

[/tmp/ipython-input-3995888517.py](https://localhost:8080/#) in <cell line: 0>()
      3 
      4 for page in reader.pages:
----> 5     text = page.extract_text()
      6     print(text)

[/usr/local/lib/python3.12/dist-packages/pypdf/_page.py](https://localhost:8080/#) in extract_text(self, orientations, space_width, visitor_operand_before, visitor_operand_after, visitor_text, extraction_mode, *args, **kwargs)
   2036             orientations = (orientations,)
   2037 
-> 2038         return self._extract_text(
   2039             self,
   2040             self.pdf,

[/usr/local/lib/python3.12/dist-packages/pypdf/_page.py](https://localhost:8080/#) in _extract_text(self, obj, pdf, orientations, space_width, content_key, visitor_operand_before, visitor_operand_after, visitor_text)
   1719         extractor.initialize_extraction(orientations, visitor_text, cmaps)
   1720 
-> 1721         for operands, operator in content.operations:
   1722             if visitor_operand_before is not None:
   1723                 visitor_operand_before(operator, operands, extractor.cm_matrix, extractor.tm_matrix)

[/usr/local/lib/python3.12/dist-packages/pypdf/generic/_data_structures.py](https://localhost:8080/#) in operations(self)
   1404     def operations(self) -> list[tuple[Any, bytes]]:
   1405         if not self._operations and self._data:
-> 1406             self._parse_content_stream(BytesIO(self._data))
   1407             self._data = b""
   1408         return self._operations

[/usr/local/lib/python3.12/dist-packages/pypdf/generic/_data_structures.py](https://localhost:8080/#) in _parse_content_stream(self, stream)
   1297                     peek = stream.read(1)
   1298             else:
-> 1299                 operands.append(read_object(stream, None, self.forced_encoding))
   1300 
   1301     def _read_inline_image(self, stream: StreamType) -> dict[str, Any]:

[/usr/local/lib/python3.12/dist-packages/pypdf/generic/_data_structures.py](https://localhost:8080/#) in read_object(stream, pdf, forced_encoding)
   1474     stream.seek(pos)
   1475     read_until_whitespace(stream)
-> 1476     raise PdfReadError(
   1477         f"Invalid Elementary Object starting with {tok!r} @{pos}: {stream_extract!r}"
   1478     )

PdfReadError: Invalid Elementary Object starting with b'\xff' @1157515: b'\xa0\x9d0\nh\x08 \xa9z0\x00\x01~~ \t\x00EI\n\xff\nEI Q\nQ\nq\n4205.52 6751.53 2.47266 576.254 re W n\nq 0 576.25'

Metadata

Metadata

Assignees

No one assigned

    Labels

    genericThe generic submodule is affectedis-robustness-issueFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions