Configuration for which metrics are enabled during tracing #20223
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The so-called "low-overhead" tracing added in #19895 can have some measurable overhead in some cases (see below).
This PR adds additional configuration options to control which metrics are collected when tracing is enabled. The default is not to collect any traces, which is zero overhead. With
CUDF_POLARS_LOG_TRACES=1
, all tracing is enabled, which includes information on memory from RMM and NVML, and information on the input / output dataframes from cudf-polars. Users can disable certain metrics by setting another environment variable. For example, this would log disable logging of memory (from RMM and nvml):and this would disable the memory and dataframe-related metrics:
This boxplot shows the runtime of our PDSH benchmarks at SF-3K with the distributed scheduler, using 8 workers with an H100 each, 5 iterations per run. There are 3 runs show:
CUDF_POLARS_LOG_TRACES=1
CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 CUDF_POLARS_LOG_TRACES_DATAFRAMES=0
The interesting parts are the large gaps between the "on" box and the two "off" / "time-only" boxes, which I've highlighted. These indicate that the tracing overhead is relatively large with all the metrics turned on. But the limited tracing that only measures durations doesn't have that same overhead, because the "off" and "time-only" boxes are overlapping.
A note on the implementation: I wasn't sure whether to make things opt-in or opt-out. Right now we have a mix (opt in to everything with
CUDF_POLARS_LOG_TRACES=1
, and opt out of specific metrics withCUDF_POLARS_LOG_TRACES_MEMORY=0
). We could easily make it opt-in to specific metrics (e.g.CUDF_POLARS_LOG_TRACES_MEMORY=1
would enable just memory,CUDF_POLARS_LOG_TRACES_DATAFRAMES=1
would enable just dataframe tracing). Neither option seemed obviously better to me.