Skip to content

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Sep 4, 2025

Description

As pylibcudf is working to enable stream-ordered APIs and we add tests accordingly, those tests will all be running on non-default streams. Since we create those streams using rmm's APIs, by default they will be non-blocking streams that do not synchronize with the default stream. For such tests to be valid, any fixtures used in those tests must synchronize the streams used to create those fixtures before the tests queue up any work on the new streams (either synchronize the stream or enqueue an event on that stream for the test stream to wait on, but the latter is more complicated and probably unnecessary). Doing so ensures valid data since the host thread will block on the first synchronization, which will occur before any work is queued on the new stream that could use data on the old one.

We've been seeing the io/test_json.py::test_write_json_basic[100-source_or_sink1-False-100-stream1] pylibcudf test fail intermittently. The failure is always in "wheel-tests-cudf / 12.9.1, 3.13, arm64, ubuntu22.04, a100, latest-driver, latest-deps". Based on the matrix of tests that we run in PRs for conda and wheels, we have seen both x86 + Python 3.13 and arm + Python 3.12 succeed, and we've seen the same driver and hardware also pass with other matrix runs, so it's not immediately clear what variable or combination of variables is implicated. We have attempted to reproduce it consistently in CI in #19865, but have yet to find a way to see it happen regularly. Here are some previous runs showing the error:

The only consistent fact is that the failing test is the first one in the test_json.py file to run on a non-default stream. That makes stream-ordering a very likely culprit. Upon inspection of the test suite, I noticed the lack of synchronization of the streams. I don't know for sure if this is the problem, but it seems like a plausible culprit. If we stop seeing this failure consistently once this PR merges, then we can go through and update the rest of our fixtures as well (we should do that anyway, but I want this PR in to see if it resolves the JSON test issue).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vyasr vyasr self-assigned this Sep 4, 2025
@vyasr vyasr requested a review from a team as a code owner September 4, 2025 19:35
@vyasr vyasr requested review from mroeschke and wence- September 4, 2025 19:35
@vyasr vyasr added tests Unit testing for project non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Sep 4, 2025
@vyasr
Copy link
Contributor Author

vyasr commented Sep 4, 2025

/merge

@rapids-bot rapids-bot bot merged commit 532e6b9 into rapidsai:branch-25.10 Sep 5, 2025
131 checks passed
@vyasr vyasr deleted the test/stream_sync_fixtures branch September 5, 2025 18:11
@vyasr
Copy link
Contributor Author

vyasr commented Sep 5, 2025

It turns out that due to rapidsai/rmm#2029 this PR is actually not necessary for correctness AFAIU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function non-breaking Non-breaking change tests Unit testing for project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants