Skip to content

Conversation

darrachequesne
Copy link
Contributor

@darrachequesne darrachequesne commented Oct 16, 2016

Closes #2 and #5

@coveralls
Copy link

Coverage Status

Coverage increased (+0.4%) to 92.958% when pulling c373d19 on darrachequesne:patch-1 into 2fa80fa on mathiasbynens:master.

@darrachequesne
Copy link
Contributor Author

@mathiasbynens does that implementation comply with what you had in mind? Could you please review when you have time?

@mathiasbynens
Copy link
Owner

Of course! It might take a while until I get around to it, though.

@darrachequesne
Copy link
Contributor Author

No problem! Please tell me if I can help in any way.

@darrachequesne
Copy link
Contributor Author

Hi @mathiasbynens ! Do you know when you'll be able to review that PR please?

@coveralls
Copy link

coveralls commented Dec 18, 2016

Coverage Status

Coverage increased (+0.4%) to 92.958% when pulling 41c4eef on darrachequesne:patch-1 into 5566334 on mathiasbynens:master.

@chharvey
Copy link

@darrachequesne Does this handle the case of missing or extra continuation bytes?

The encoding 1110xxxx 10xxxxxx 10xxxxxx 0xxxxxxx (a 3-sequence followed by a 1-sequence) is well-formed and decodes to two codepoints. But if one of the “continuation bytes” was lost in transmission,1110xxxx 10xxxxxx 0xxxxxxx would error. With {strict: false}, we would want the first character to resolve to U+FFFD instead of erroring, and the second character to resolve as normal. Example:

utf8.decode(
	'\xE2\xAC\xE2\x82\xAC', // 11100010 10101100 11100010 10000010 10101100
	{strict: false},
) === '\uFFFD\u20AC';

Likewise, 1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx is not well-formed either. With strict turned off, the first character (the 3-sequence) should resolve as normal, but then U+FFFD should be returned for any remaining continuation bytes until the next “header byte” (that is, a byte starting with 00, 01, or 11) is found. Example:

utf8.decode(
	'\xE2\x82\xAC\x82\xAC\xE2\x82\xAC', // 11100010 10000010 10101100 10000010 10101100 11100010 10000010 10101100
	{strict: false},
) === '\u20AC\uFFFD\u20AC';

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add error-tolerant mode

4 participants