Skip to content

Commit 5d40f6b

Browse files
committed
mbstring: Make encoding detection stricter
PHP 8.3 changed how source encoding detection works: https://www.php.net/manual/en/migration83.other-changes.php#migration83.other-changes.functions.mbstring Most locales only consider `ASCII` and `UTF-8` (see `mb_detect_order()`), and when a byte sequence invalid in both tested encodings (such as 0x91 for ‘ in Windows-1252) is encountered, one of them might now be chosen as the most fitting encoding. (This is done using the heuristics introduced in PHP 8.1: php/php-src@28b346b) Compare the output of the following script across PHP versions: <?php $result = hex2bin("91"); var_dump(mb_detect_encoding($result)); var_dump(mb_detect_encoding($result, 'auto', true)); var_dump(mb_convert_encoding($result, 'UTF-8', 'auto')); Let’s run the `mb_detect_encoding()` ourselves with `$strict` argument set to `true`, to ensure consistent behaviour across all PHP versions. This might potentially cause a regression is some cases. Not sure. Additionally, since we are now ensuring all encodings are valid, we can drop the warning capture mechanism. It does not work on PHP ≥ 8.0 anyway, since that raises a `ValueError` instead of a warning when an invalid encoding is provided. https://www.php.net/manual/en/function.mb-convert-encoding.php#refsect1-function.mb-convert-encoding-errors Also adjust the confusing string in tests. https://www.php.net/manual/en/function.mb-convert-encoding.php https://www.php.net/manual/en/function.mb-detect-encoding.php https://www.php.net/manual/en/function.mb-detect-order.php
1 parent 941ab65 commit 5d40f6b

File tree

3 files changed

+23
-20
lines changed

3 files changed

+23
-20
lines changed

CHANGELOG.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,11 @@
22

33
This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
44

5+
## [3.0.0] – unreleased
6+
7+
- Mb: Make encoding detection stricter.
8+
9+
510
## [2.0.0] – 2023-03-07
611

712
- Iconv: Fix warning on PHP 8.2 when passing `null` as source encoding.
@@ -23,5 +28,6 @@ The project has been revived and is now available under the name [`fossar/transc
2328
- Added Nix expression for easier development and sharing the environment with CI.
2429
- Switched to GitHub Actions for CI and added more PHP versions.
2530

26-
[2.0.0]: https://github.com/fossar/transcoder/compare/1.0.1...v2.0.0
31+
[3.0.0]: https://github.com/fossar/transcoder/compare/v2.0.0...v3.0.0
32+
[2.0.0]: https://github.com/fossar/transcoder/compare/v1.0.1...v2.0.0
2733
[1.0.1]: https://github.com/fossar/transcoder/compare/1.0.0...v1.0.1

src/MbTranscoder.php

Lines changed: 15 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -55,34 +55,31 @@ public function transcode(string $string, $from = null, ?string $to = null): str
5555
} else {
5656
$this->assertSupported($from);
5757
}
58+
} else {
59+
$from = 'auto';
5860
}
5961

6062
if ($to) {
6163
$this->assertSupported($to, false);
6264
}
6365

64-
$handleErrors = !$from || 'auto' === $from;
65-
if ($handleErrors) {
66-
set_error_handler(
67-
function ($no, $warning) use ($string): void {
68-
throw new UndetectableEncodingException($string, $warning);
69-
},
70-
E_WARNING
71-
);
66+
if ($from === 'auto') {
67+
$from = mb_detect_encoding($string, 'auto', true);
7268
}
7369

74-
try {
75-
$result = mb_convert_encoding(
76-
$string,
77-
$to ?: $this->defaultEncoding,
78-
$from ?: 'auto'
79-
);
80-
} finally {
81-
if ($handleErrors) {
82-
restore_error_handler();
83-
}
70+
if ($from === false) {
71+
throw new UndetectableEncodingException($string, 'Unable to detect character encoding');
8472
}
8573

74+
$result = mb_convert_encoding(
75+
$string,
76+
$to ?: $this->defaultEncoding,
77+
$from
78+
);
79+
80+
// For PHPStan: We check the encoding is valid.
81+
assert($result !== false);
82+
8683
return $result;
8784
}
8885

tests/MbTranscoderTest.php

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ public function testUndetectableEncoding(): void
5555
$this->expectException(\Ddeboer\Transcoder\Exception\UndetectableEncodingException::class);
5656
$this->expectExceptionMessage('is undetectable');
5757
$result = $this->transcoder->transcode(
58-
'‘curly quotes make this incompatible with 1252',
58+
'Windows-1252 encodes curly quotes as 0x91 and 0x92, which are indistinguishable from any other single-byte encoding',
5959
null,
6060
'windows-1252'
6161
);

0 commit comments

Comments
 (0)