Skip to content

Detecting a UTF-8 string with control characters leads to unexpected behavior #49

@ruicc

Description

@ruicc
import Encoding from 'encoding-japanese';

const bs = Buffer.from(new Uint8Array([0x08]));
console.log('detect("<BS>"):', Encoding.detect(bs));

const utf8str = Buffer.from("UTF8の文字列です", 'utf8');
console.log('detect("UTF8の文字列です"):', Encoding.detect(utf8str));

const utf8BS = Buffer.concat([utf8str, bs]);
console.log('detect("UTF8の文字列です<BS>"):', Encoding.detect(utf8BS));

output:

detect("<BS>"): ASCII
detect("UTF8の文字列です"): UTF8
detect("UTF8の文字列です<BS>"): UNICODE

I think the final result can be "UTF8" because UTF-8 includes ASCII.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions