Skip to content
This repository was archived by the owner on Jan 22, 2019. It is now read-only.
This repository was archived by the owner on Jan 22, 2019. It is now read-only.

Invalid UTF8-character in non-UTF8 file is detected too early, so parsing can not be continued #132

@flappingeagle

Description

@flappingeagle

following code-example can be tested with the attached file (test8.csv). The file is in ISO-8859 format and contains an UTF8 character, which is: é

            File file = new File("test8.csv");
            InputStream in = Files.newInputStream(file.toPath(), StandardOpenOption.READ);

            CsvSchema schema = CsvSchema.emptySchema().withHeader();
            CsvMapper mapper = new CsvMapper();
            ObjectReader reader = mapper.readerFor(Map.class).with(schema);
            MappingIterator<Map<String, String>> mappingIterator = reader.readValues(in);

            while (mappingIterator.hasNextValue()) {
                Map<String, String> line = mappingIterator.nextValue();
                System.out.println(line);
            }
            mappingIterator.close();

the parsing crashes in line 152 at the call of "nextValue()". But the problematic UTF8 character is in line 185. So the parsing does not crash at the position of the problematic character but much earlier... (must be because of buffering?)

i just ask, because if the parsing would crash at the exact position of the UTF8 character, we may simple ignore this line and continue with the next line. But this way the parsing crashes earlier and can not be recovered/continued.

Following parse-exception is output:

java.io.CharConversionException: Invalid UTF-8 middle byte 0x65 (at char #4861, byte #3999): check content encoding, does not look like UTF-8

The problematic character in the file test8.csv can be found in VI-Editor with ":goto 4861"

test8.csv.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions