Skip to content

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Oct 9, 2025

Description

The so-called "low-overhead" tracing added in #19895 can have some measurable overhead in some cases (see below).

This PR adds additional configuration options to control which metrics are collected when tracing is enabled. The default is not to collect any traces, which is zero overhead. With CUDF_POLARS_LOG_TRACES=1, all tracing is enabled, which includes information on memory from RMM and NVML, and information on the input / output dataframes from cudf-polars. Users can disable certain metrics by setting another environment variable. For example, this would log disable logging of memory (from RMM and nvml):

CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 python ...

and this would disable the memory and dataframe-related metrics:

CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 CUDF_POLARS_LOG_TRACES_DATAFRAMES=0 python ...

This boxplot shows the runtime of our PDSH benchmarks at SF-3K with the distributed scheduler, using 8 workers with an H100 each, 5 iterations per run. There are 3 runs show:

  1. "on": tracing was enabled with CUDF_POLARS_LOG_TRACES=1
  2. "off": tracing was not enabled
  3. time-only: tracing was enabled, but memory and dataframe metrics were disabled, with CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 CUDF_POLARS_LOG_TRACES_DATAFRAMES=0
tracing-overhead

The interesting parts are the large gaps between the "on" box and the two "off" / "time-only" boxes, which I've highlighted. These indicate that the tracing overhead is relatively large with all the metrics turned on. But the limited tracing that only measures durations doesn't have that same overhead, because the "off" and "time-only" boxes are overlapping.


A note on the implementation: I wasn't sure whether to make things opt-in or opt-out. Right now we have a mix (opt in to everything with CUDF_POLARS_LOG_TRACES=1, and opt out of specific metrics with CUDF_POLARS_LOG_TRACES_MEMORY=0). We could easily make it opt-in to specific metrics (e.g. CUDF_POLARS_LOG_TRACES_MEMORY=1 would enable just memory, CUDF_POLARS_LOG_TRACES_DATAFRAMES=1 would enable just dataframe tracing). Neither option seemed obviously better to me.

Copy link

copy-pr-bot bot commented Oct 9, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Oct 9, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Oct 9, 2025
@TomAugspurger TomAugspurger added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 9, 2025
@TomAugspurger TomAugspurger changed the title Configuration for which metrics are enabled during tracing= Configuration for which metrics are enabled during tracing Oct 9, 2025
@TomAugspurger TomAugspurger marked this pull request as ready for review October 9, 2025 14:26
@TomAugspurger TomAugspurger requested a review from a team as a code owner October 9, 2025 14:26
@TomAugspurger
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit c0d228e into rapidsai:branch-25.12 Oct 10, 2025
236 of 253 checks passed
@TomAugspurger TomAugspurger deleted the tom/tracing-config branch October 10, 2025 16:07
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants