-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Labels
genericThe generic submodule is affectedThe generic submodule is affectedis-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
I found a document which raised a PdfReadError. This error was unexpected because the document can be read properly with, for example, Adobe Acrobat Reader DC.
Environment
I used Google Colab.
$ python -m platform
Linux-6.1.123+-x86_64-with-glibc2.35
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.0.0, crypt_provider=('cryptography', '43.0.3'), PIL=11.3.0
Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader("/content/UNIQ_LAP_TAHUNAN_2021-69.pdf")
for page in reader.pages:
text = page.extract_text()
print(text)
Access the PDF here.
Traceback
This is the complete traceback I see:
---------------------------------------------------------------------------
PdfReadError Traceback (most recent call last)
[/tmp/ipython-input-3995888517.py](https://localhost:8080/#) in <cell line: 0>()
3
4 for page in reader.pages:
----> 5 text = page.extract_text()
6 print(text)
[/usr/local/lib/python3.12/dist-packages/pypdf/_page.py](https://localhost:8080/#) in extract_text(self, orientations, space_width, visitor_operand_before, visitor_operand_after, visitor_text, extraction_mode, *args, **kwargs)
2036 orientations = (orientations,)
2037
-> 2038 return self._extract_text(
2039 self,
2040 self.pdf,
[/usr/local/lib/python3.12/dist-packages/pypdf/_page.py](https://localhost:8080/#) in _extract_text(self, obj, pdf, orientations, space_width, content_key, visitor_operand_before, visitor_operand_after, visitor_text)
1719 extractor.initialize_extraction(orientations, visitor_text, cmaps)
1720
-> 1721 for operands, operator in content.operations:
1722 if visitor_operand_before is not None:
1723 visitor_operand_before(operator, operands, extractor.cm_matrix, extractor.tm_matrix)
[/usr/local/lib/python3.12/dist-packages/pypdf/generic/_data_structures.py](https://localhost:8080/#) in operations(self)
1404 def operations(self) -> list[tuple[Any, bytes]]:
1405 if not self._operations and self._data:
-> 1406 self._parse_content_stream(BytesIO(self._data))
1407 self._data = b""
1408 return self._operations
[/usr/local/lib/python3.12/dist-packages/pypdf/generic/_data_structures.py](https://localhost:8080/#) in _parse_content_stream(self, stream)
1297 peek = stream.read(1)
1298 else:
-> 1299 operands.append(read_object(stream, None, self.forced_encoding))
1300
1301 def _read_inline_image(self, stream: StreamType) -> dict[str, Any]:
[/usr/local/lib/python3.12/dist-packages/pypdf/generic/_data_structures.py](https://localhost:8080/#) in read_object(stream, pdf, forced_encoding)
1474 stream.seek(pos)
1475 read_until_whitespace(stream)
-> 1476 raise PdfReadError(
1477 f"Invalid Elementary Object starting with {tok!r} @{pos}: {stream_extract!r}"
1478 )
PdfReadError: Invalid Elementary Object starting with b'\xff' @1157515: b'\xa0\x9d0\nh\x08 \xa9z0\x00\x01~~ \t\x00EI\n\xff\nEI Q\nQ\nq\n4205.52 6751.53 2.47266 576.254 re W n\nq 0 576.25'
Metadata
Metadata
Assignees
Labels
genericThe generic submodule is affectedThe generic submodule is affectedis-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow