Skip to content

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Sep 2, 2025

Description

This PR should not be merged, it exists solely to reproduce an intermittent failure that we have been observing in CI.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vyasr vyasr added the DO NOT MERGE Hold off on merging; see PR for details label Sep 2, 2025
@vyasr vyasr requested a review from a team as a code owner September 2, 2025 17:39
@vyasr vyasr requested a review from msarahan September 2, 2025 17:39
rapids-bot bot pushed a commit that referenced this pull request Sep 5, 2025
As pylibcudf is working to enable stream-ordered APIs and we add tests accordingly, those tests will all be running on non-default streams. Since we create those streams using rmm's APIs, by default they will be non-blocking streams that do not synchronize with the default stream. For such tests to be valid, any fixtures used in those tests must synchronize the streams used to create those fixtures before the tests queue up any work on the new streams (either synchronize the stream or enqueue an event on that stream for the test stream to wait on, but the latter is more complicated and probably unnecessary). Doing so ensures valid data since the host thread will block on the first synchronization, which will occur before any work is queued on the new stream that could use data on the old one.

We've been seeing the `io/test_json.py::test_write_json_basic[100-source_or_sink1-False-100-stream1]` pylibcudf test fail intermittently. The failure is always in "wheel-tests-cudf / 12.9.1, 3.13, arm64, ubuntu22.04, a100, latest-driver, latest-deps". Based on the matrix of tests that we run in PRs for conda and wheels, we have seen both x86 + Python 3.13 and arm + Python 3.12 succeed, and we've seen the same driver and hardware also pass with other matrix runs, so it's not immediately clear what variable or combination of variables is implicated. We have attempted to reproduce it consistently in CI in #19865, but have yet to find a way to see it happen regularly. Here are some previous runs showing the error:
- https://github.com/rapidsai/cudf/actions/runs/17078533043/job/48428636573?pr=19738
- https://github.com/rapidsai/cudf/actions/runs/17088133288/job/48458607661?pr=19729
- https://github.com/rapidsai/cudf/actions/runs/17078069473/job/48427585028
- https://github.com/rapidsai/cudf/actions/runs/17108385574/job/48525491249?pr=19743#step:11:416

The only consistent fact is that the failing test is the first one in the test_json.py file to run on a non-default stream. That makes stream-ordering a very likely culprit. Upon inspection of the test suite, I noticed the lack of synchronization of the streams. I don't know for sure if this is the problem, but it seems like a plausible culprit. If we stop seeing this failure consistently once this PR merges, then we can go through and update the rest of our fixtures as well (we should do that anyway, but I want this PR in to see if it resolves the JSON test issue).

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #19889
@vyasr vyasr force-pushed the test/json_stream_failure branch from 0ff778b to c7d0a64 Compare September 8, 2025 17:34
@vyasr vyasr requested a review from a team as a code owner September 8, 2025 17:34
@vyasr vyasr requested review from bdice and mroeschke September 8, 2025 17:34
@vyasr vyasr force-pushed the test/json_stream_failure branch from c7d0a64 to 08e687a Compare September 8, 2025 17:35
@vyasr vyasr force-pushed the test/json_stream_failure branch from 08e687a to 7bc8352 Compare September 9, 2025 19:03
@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Sep 16, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Sep 16, 2025
@vyasr vyasr changed the base branch from branch-25.10 to branch-25.12 September 24, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO NOT MERGE Hold off on merging; see PR for details pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants