Skip to content

mbstring: Make encoding detection stricter #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 15, 2025
Merged

mbstring: Make encoding detection stricter #4

merged 1 commit into from
Mar 15, 2025

Conversation

jtojnar
Copy link
Member

@jtojnar jtojnar commented Mar 15, 2025

PHP 8.3 changed how source encoding detection works:
https://www.php.net/manual/en/migration83.other-changes.php#migration83.other-changes.functions.mbstring
Most locales only consider ASCII and UTF-8 (see mb_detect_order()),
and when a byte sequence invalid in both tested encodings (such as 0x91 for ‘ in Windows-1252) is encountered,
one of them might now be chosen as the most fitting encoding.
(This is done using the heuristics introduced in PHP 8.1:
php/php-src@28b346b)

Compare the output of the following script across PHP versions:

<?php
$result = hex2bin("91");
var_dump(mb_detect_encoding($result));
var_dump(mb_detect_encoding($result, 'auto', true));
var_dump(mb_convert_encoding($result, 'UTF-8', 'auto'));

Let’s run the mb_detect_encoding() ourselves with $strict argument set to true, to ensure consistent behaviour across all PHP versions.
This might potentially cause a regression is some cases. Not sure.

Additionally, since we are now ensuring all encodings are valid, we can drop the warning capture mechanism.
It does not work on PHP ≥ 8.0 anyway, since that raises a ValueError instead of a warning when an invalid encoding is provided.
https://www.php.net/manual/en/function.mb-convert-encoding.php#refsect1-function.mb-convert-encoding-errors

Also adjust the confusing string in tests.

https://www.php.net/manual/en/function.mb-convert-encoding.php
https://www.php.net/manual/en/function.mb-detect-encoding.php
https://www.php.net/manual/en/function.mb-detect-order.php

PHP 8.3 changed how source encoding detection works:
https://www.php.net/manual/en/migration83.other-changes.php#migration83.other-changes.functions.mbstring
Most locales only consider `ASCII` and `UTF-8` (see `mb_detect_order()`),
and when a byte sequence invalid in both tested encodings (such as 0x91 for ‘ in Windows-1252) is encountered,
one of them might now be chosen as the most fitting encoding.
(This is done using the heuristics introduced in PHP 8.1:
php/php-src@28b346b)

Compare the output of the following script across PHP versions:

    <?php
    $result = hex2bin("91");
    var_dump(mb_detect_encoding($result));
    var_dump(mb_detect_encoding($result, 'auto', true));
    var_dump(mb_convert_encoding($result, 'UTF-8', 'auto'));

Let’s run the `mb_detect_encoding()` ourselves with `$strict` argument set to `true`, to ensure consistent behaviour across all PHP versions.
This might potentially cause a regression is some cases. Not sure.

Additionally, since we are now ensuring all encodings are valid, we can drop the warning capture mechanism.
It does not work on PHP ≥ 8.0 anyway, since that raises a `ValueError` instead of a warning when an invalid encoding is provided.
https://www.php.net/manual/en/function.mb-convert-encoding.php#refsect1-function.mb-convert-encoding-errors

Also adjust the confusing string in tests.

https://www.php.net/manual/en/function.mb-convert-encoding.php
https://www.php.net/manual/en/function.mb-detect-encoding.php
https://www.php.net/manual/en/function.mb-detect-order.php
@jtojnar jtojnar added the bug Something isn't working label Mar 15, 2025
@jtojnar jtojnar merged commit 5d40f6b into master Mar 15, 2025
0 of 12 checks passed
@jtojnar jtojnar deleted the stricter branch March 15, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant