Skip to content

[BUG] Intermittent pylibcudf CI failures in JSON writing #19900

@vyasr

Description

@vyasr

We've been seeing the io/test_json.py::test_write_json_basic[100-source_or_sink1-False-100-stream1] pylibcudf test fail intermittently. The failure is always in "wheel-tests-cudf / 12.9.1, 3.13, arm64, ubuntu22.04, a100, latest-driver, latest-deps". Based on the matrix of tests that we run in PRs for conda and wheels, we have seen both x86 + Python 3.13 and arm + Python 3.12 succeed, and we've seen the same driver and hardware also pass with other matrix runs, so it's not immediately clear what variable or combination of variables is implicated. We have attempted to reproduce it consistently in CI in #19865, but have yet to find a way to see it happen regularly. Here are some previous runs showing the error:

The only consistent fact is that the failing test is the first one in the test_json.py file to run on a non-default stream. That makes stream-ordering a very likely culprit. Upon inspection of the test suite, I noticed the lack of synchronization of the streams, which I attempted a fix for in #19889. However, on further inspection of rmm I realized that this fix was unnecessary because of rapidsai/rmm#2029. Since all rmm streams are created as blocking, the fixtures should be valid on exit as currently constructed since they all run on the default stream.

The specific error that we observe is that a single character in the written JSON file is incorrect:

AssertionError: assert '\x01{"col_in...92.379533}}}]' == '[{"col_int64...92.379533}}}]'

Note the first character on the left is a \x01 non-printing char, whereas on the right it is a normal [ character.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions