Skip to content

JSON.dump("\x82\xAC\xEF".b) no error with the C extension #634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eregon opened this issue Oct 21, 2024 · 9 comments · Fixed by #633
Closed

JSON.dump("\x82\xAC\xEF".b) no error with the C extension #634

eregon opened this issue Oct 21, 2024 · 9 comments · Fixed by #633
Labels

Comments

@eregon
Copy link
Member

eregon commented Oct 21, 2024

On json 2.7.2:

$ ruby -rjson/ext -e 'p JSON.dump("\x82\xAC\xEF".b)'
json-2.7.2/lib/json/common.rb:306:in `generate': source sequence is illegal/malformed utf-8 (JSON::GeneratorError)
$ ruby -rjson/pure -e 'p JSON.dump("\x82\xAC\xEF".b)'
json-2.7.2/lib/json/pure/generator.rb:453:in `encode': "\x82" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

On json master:

$ ruby -rjson/ext -e 'p JSON.dump("\x82\xAC\xEF".b)'
"\"\\u0082\\u00ac\xEF\""
$ ruby -rjson/pure -e 'p JSON.dump("\x82\xAC\xEF".b)'
json-2.7.2/lib/json/pure/generator.rb:341:in `encode': "\x82" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
@byroot
Copy link
Member

byroot commented Oct 21, 2024

It is known, and I don't plan to change it until 3.0.

Ref: #609

@byroot byroot added this to the 3.0 milestone Oct 21, 2024
@byroot byroot added the bug label Oct 21, 2024
@eregon
Copy link
Member Author

eregon commented Oct 21, 2024

I think it's due to using obj = rb_str_export_to_enc(obj, rb_utf8_encoding()); vs using String#encode, so a potential fix would be to use String#encode from the C extension to get a Encoding::UndefinedConversionError.
It could also be a JSON::GeneratorError like the latest release (although inconsistent with the pure-Ruby version), that would still be better than silently accepting broken strings.

@byroot
Copy link
Member

byroot commented Oct 21, 2024

I added the behavior back on purpose because I realized a bunch of apps relied on it. It's easy to remove, I just want to emit deprecations first to help people fix their code.

@byroot
Copy link
Member

byroot commented Oct 21, 2024

Hum, wait. I think I'm misreading your report.

@byroot
Copy link
Member

byroot commented Oct 21, 2024

yeah, this is a regression from 2.7.2, good catch I'll fix it.

@byroot byroot removed this from the 3.0 milestone Oct 21, 2024
@eregon
Copy link
Member Author

eregon commented Oct 21, 2024

I think it's related but not the same, #609 is about bytes valid in UTF-8 and this is about bytes invalid in UTF-8.

I'm not sure about the semantics of rb_str_export_to_enc() is it like String#encode (from ruby -rjson/ext -e 'p JSON.dump("A".encode("UTF-32LE"))') but without exceptions?
What does it do if given a BINARY string with valid UTF-8 bytes?
It seems to be like force_encoding in that case, but unsure of the details:
ruby -rjson/ext -e 'p JSON.dump("Aé".b)' -> `""Aé""

@byroot
Copy link
Member

byroot commented Oct 21, 2024

I think the function to use is rb_str_conv_enc

@eregon
Copy link
Member Author

eregon commented Oct 21, 2024

I guess in the C ext to preserve #609 we'll need something equivalent to:

if str.encoding == BINARY
  str = str.dup.force_encoding(UTF_8)
else
  str = str.encode(UTF_8)
end

And then the rb_enc_str_coderange check after deals with the case it's not valid.

@byroot
Copy link
Member

byroot commented Oct 21, 2024

Yup. I got a quick patch in that effect that passes the tests: byroot@170ec37

I'll clean it up tomorrow and merge this PR with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants