ENH: Wrap and align text in flattened PDF forms #3465

PJBrs · 2025-09-13T14:47:27Z

This patch implements text wrapping and alignment in appearance streams.

The scale_text method was vibe-coded, as well as the code for right-aligned text and centered text, but they both work great.

The result offers a good basis for text wrapping. I did notice, however, that the results with pdftk are better. In the future, it would be nice to read the info for the annotation border from the annotiation instead of just adding some padding here and there (which is the case now). Also, I notice there's also an annotation option called "comb" that is not taken into account. Then there is annotation text colour... Finally, pdftk takes into account the font bounding box / ascent in deciding scaled font size.

For now, however, this PR "finishes" PDF flattening in the sense that it correctly wraps long texts and aligns it as intended.

Related but not fixed here: #2153
I think this does fix the alignment part of #1919

codecov · 2025-09-13T14:56:42Z

Codecov Report

❌ Patch coverage is 97.91667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.13%. Comparing base (85b53d8) to head (e4231bc).

Files with missing lines	Patch %	Lines
pypdf/generic/_appearance_stream.py	97.77%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3465      +/-   ##
==========================================
+ Coverage   97.11%   97.13%   +0.02%     
==========================================
  Files          57       57              
  Lines        9713     9784      +71     
  Branches     1758     1772      +14     
==========================================
+ Hits         9433     9504      +71     
  Misses        168      168              
  Partials      112      112

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pypdf/_writer.py

PJBrs · 2025-09-26T11:20:53Z

This is a reworked version on top of #3466

Not for review right now.

PJBrs · 2025-11-05T14:37:24Z

@stefan6419846 First, thanks very much for merging the refactoring of appearance stream code from _writer.py to generic_appearance_stream.py!

With that in place, it should now be easier to review this PR, which adds text wrapping, scaling and alignment for text appearance streams.

pypdf/generic/_appearance_stream.py

docs/user/forms.md

pypdf/generic/_appearance_stream.py

tests/test_appearance_stream.py

tests/test_writer.py

PJBrs · 2025-11-08T09:35:00Z

@stefan6419846 Thanks very much for your prompt review! It took me a while to see that you wanted a couple of newlines in the test that I added, but I finally figured that out too ;-)

I've added a long note about the changed font test result. We might want to revert the default font resource...

As far as I can see, the only potential question left would be where to put the TextAlignment IntEnum.

docs/user/forms.md

pypdf/generic/_appearance_stream.py

mypy complained that the .from_font_resource method's return type is Optional[FontDescriptor]. Change the code to not confuse mypy.

This adds a method to calculate the width of a text string. This method can later be used to wrap text at a certain length. Code blatantly copied from the _font.py file in the text extractor code.

This patch adds a method to scale and wrap text, depending on whether or not text is allowed to be wrapped. It takes a couple of arguments, including the text string itself, field width and height, font size, a FontDescriptor with character widths, and a bool specifying whether or not text is allowed to wrap. Returns the text in in the form of list of tuples, each tuple containing the length of a line and its contents, and the font size for these lines and lengths.

This patch scales and/or wrap text that does not fit into a text field unaltered, under the condition that font size was set to 0 in the default appearance stream. We only wrap text if the multiline bit was set in the corresponding annotation's field flags, otherwise we just scale the font until it fits. We move the escaping of parentheses below, so that it does not interfere with calculating the width of a text string.

Make sure that we always have Helvetica as a viable font resource, for which we surely have all necessary font metrics needed for text wrapping.

This adds a class TextAlignment that defines alignment options.

This patch changes the TextAppearanceStream code so that it can deal with right alignment and centered text. Note that both require correct font metrics in order to work.

We need the info that is in CORE_FONT_METRICS, and that is the same information as in _default_fonts_space_width anyway. So this patch removes a bit of redundancy.

Add tests for the TextStreamAppearance.

PJBrs · 2025-11-10T19:22:15Z

@stefan6419846 Thanks for your next review! I think I have addressed all points.

PJBrs · 2025-11-12T11:59:45Z

@stefan6419846 Apologies for adding one final patch; I noticed two small mistakes in docstrings, and one instance where I could use the new TextAlignment IntEnum instead of just writing 0. Won't do anything else before your next review.

docs/user/forms.md

stefan6419846 · 2025-11-12T12:07:28Z

pypdf/generic/_appearance_stream.py

+            font_glyph_byte_map = {v: bytes((k,)) for k, v in font_encoding.items()}
+            font_encoding_rev = {v: bytes((k,)) for k, v in font_encoding.items()}
+            for key, value in font_map.items():
+                font_glyph_byte_map[value] = font_encoding_rev.get(key, key)


Are we able to cover this line in the tests as well?

I do not know how. This was already there in the previous code. In fact, this code looks a bit weird and we did have some back and forth about, because it appears as if we're overwriting font_glyph_byte_map. I remember having asked some AI suggestions, but we decided to not address this previously.

In a more general sense, I do think some ToDo's remain in the existing code:

Making the above bit simpler (and make it be covered by the test as well)

Making the escaping of parentheses more robust (the present code will gladly escape parentheses that are already escaped)

And then there is some missing functionality:

Dealing with comb fields (I already have code for this locally)

Dealing with password fields

Dealing with field borders

Using font ascent and descent while scaling font size and deciding line distance in the scaling method.

(Perhaps) being more robust about finding font resources

In any case, this is an agenda that I'm trying to work on.

That being said, I'd rather fix this code in a separate PR than try to fix the tests.

https://github.com/py-pdf/pypdf/pull/3466/files#r2444808295

Tested some more; the test never gets here, because font_map is always {} when font_encoding is not a string.

OK, I think it is highly likely that we hit this code path, after having consulted Google Gemini.

The Overlap
The logic for determining encoding is separate from the logic for determining map_dict.

A PDF font dictionary can easily define both:

A simple 8-bit /Encoding (which causes encoding to be a dictionary of size 256).

A /ToUnicode CMap (which causes map_dict to be non-empty).

The get_encoding function handles this overlap (PDF reference 1.7, §5.9.1):

Python # Apply rule from PDF ref 1.7 §5.9.1, 1st bullet: # if cmap not empty encoding should be discarded # (here transformed into identity for those characters) # If encoding is a string, it is expected to be an identity translation. if isinstance(encoding, dict): for x in int_entry: if x <= 255: encoding[x] = chr(x)

This check ensures that if a /ToUnicode CMap exists (making int_entry non-empty, which usually implies map_dict is non-empty), any codes it defines that are also in the 8-bit range are not re-translated by the 8-bit encoding map.

In summary, a font with both an explicit 8-bit /Encoding dictionary AND a /ToUnicode CMap will produce a non-string encoding (dict) and a non-empty map_dict (dict).

Therefore, the original simplification you were considering (keeping the for loop) is technically necessary to handle PDFs that define both a standard encoding and a ToUnicode map, even if your current test suite doesn't cover that case.

The 14 Base Fonts are defined via font dictionaries that typically do not contain a /ToUnicode entry.

In conclusion: for a proper test case, we need a pdf form that uses a ttf font or type 0 cid font (says gemini) for filling forms.

stefan6419846 reviewed Sep 13, 2025

View reviewed changes

pypdf/_writer.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Sep 13, 2025

View reviewed changes

pypdf/_writer.py Outdated Show resolved Hide resolved

PJBrs marked this pull request as draft September 15, 2025 12:02

PJBrs force-pushed the wrap branch from 438cc37 to 6af7101 Compare September 26, 2025 10:56

PJBrs force-pushed the wrap branch from 6af7101 to 4ccfa3a Compare October 6, 2025 18:54

PJBrs force-pushed the wrap branch from 4ccfa3a to 5f10527 Compare October 26, 2025 20:58

PJBrs force-pushed the wrap branch from 5f10527 to 30b73c6 Compare November 5, 2025 14:26

PJBrs marked this pull request as ready for review November 5, 2025 14:35

PJBrs commented Nov 5, 2025

View reviewed changes

pypdf/generic/_appearance_stream.py Outdated Show resolved Hide resolved

PJBrs force-pushed the wrap branch from 30b73c6 to fcb45b3 Compare November 5, 2025 17:02

PJBrs requested a review from stefan6419846 November 5, 2025 17:31

stefan6419846 requested changes Nov 6, 2025

View reviewed changes

PJBrs force-pushed the wrap branch from fcb45b3 to aeea105 Compare November 8, 2025 09:16

PJBrs requested a review from stefan6419846 November 8, 2025 09:35

stefan6419846 reviewed Nov 10, 2025

View reviewed changes

docs/user/forms.md Outdated Show resolved Hide resolved