Skip to content

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Sep 24, 2025

Description

Closes #1931

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Sep 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@bdice bdice changed the base branch from branch-25.10 to branch-25.12 September 25, 2025 20:23
Copy link

copy-pr-bot bot commented Sep 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bdice
Copy link
Contributor Author

bdice commented Sep 26, 2025

Does the benchmark in this PR seem useful? The results I get locally are:

--------------------------------------------------------------------------------
Benchmark                              Time       CPU Iterations UserCounters...
--------------------------------------------------------------------------------
BM_AsyncPrimingImpact/primed         209 us    209 us       3171 first_round_throughput=759.964T latency_to_first_ns=4.059k second_round_throughput=491.074T
BM_AsyncPrimingImpact/unprimed       218 us    218 us       3018 first_round_throughput=760.279T latency_to_first_ns=3.971k second_round_throughput=489.498T
BM_AsyncConstructionTime/primed   471741 us 471732 us          2 construction_time_ns=17.6421M
BM_AsyncConstructionTime/unprimed   19.9 us   19.7 us      33521 construction_time_ns=15.864k

In summary: I think priming costs a lot at MR construction time and I don't think it has any significant performance impact on subsequent allocation times. I am not sure if this benchmark is really a fair measurement or not.

I am going to run PDS-H workflows to get more data.

@bdice
Copy link
Contributor Author

bdice commented Sep 27, 2025

Here are the PDS-H benchmarks with this PR. cc: @GregoryKimball

Hardware: V100 32GB, CUDA 12.9, driver 575.

Profiling command:

$ nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --env-var CUDA_VISIBLE_DEVICES=5,POLARS_GPU_ENABLE_CUDA_MANAGED_MEMORY=0 --output=branch-25.12 python pdsh.py all --path /velox-100 --rmm-async --no-validate -e streaming --suffix "" --iterations 2

Note: --rmm-async doesn't use the async allocator unless you also use the distributed scheduler. Instead, I relied on the environment variable POLARS_GPU_ENABLE_CUDA_MANGED_MEMORY=0 to force the use of the async MR. See cudf#20129.

Overall Performance

Results for branch-25.12 (priming enabled):

Total mean time across all queries: 57.1874 seconds
branch-25.12 JSON results
{"queries": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], "suffix": "", "executor": "streaming", "scheduler": "synchronous", "n_workers": 1, "versions": {"cudf_polars": {"version": "25.12.00", "commit": ""}, "polars": "1.32.3", "python": "3.13.7", "rapidsmpf": null}, "records": {"1": [{"query": 1, "duration": 3.6291721519082785, "shuffle_stats": null}, {"query": 1, "duration": 2.5483336290344596, "shuffle_stats": null}], "2": [{"query": 2, "duration": 0.4386968738399446, "shuffle_stats": null}, {"query": 2, "duration": 0.39029303286224604, "shuffle_stats": null}], "3": [{"query": 3, "duration": 2.3394500641152263, "shuffle_stats": null}, {"query": 3, "duration": 2.33677189797163, "shuffle_stats": null}], "4": [{"query": 4, "duration": 1.3956746738404036, "shuffle_stats": null}, {"query": 4, "duration": 1.3462188951671124, "shuffle_stats": null}], "5": [{"query": 5, "duration": 2.9932695999741554, "shuffle_stats": null}, {"query": 5, "duration": 2.9835128230042756, "shuffle_stats": null}], "6": [{"query": 6, "duration": 1.3697084370069206, "shuffle_stats": null}, {"query": 6, "duration": 1.3562008100561798, "shuffle_stats": null}], "7": [{"query": 7, "duration": 3.3783102575689554, "shuffle_stats": null}, {"query": 7, "duration": 3.331870677880943, "shuffle_stats": null}], "8": [{"query": 8, "duration": 4.035966556053609, "shuffle_stats": null}, {"query": 8, "duration": 4.024940917268395, "shuffle_stats": null}], "9": [{"query": 9, "duration": 6.08514833310619, "shuffle_stats": null}, {"query": 9, "duration": 6.041446194052696, "shuffle_stats": null}], "10": [{"query": 10, "duration": 4.739399116951972, "shuffle_stats": null}, {"query": 10, "duration": 3.994611604139209, "shuffle_stats": null}], "11": [{"query": 11, "duration": 0.3541259649209678, "shuffle_stats": null}, {"query": 11, "duration": 0.3385001323185861, "shuffle_stats": null}], "12": [{"query": 12, "duration": 2.401626610662788, "shuffle_stats": null}, {"query": 12, "duration": 2.354862041771412, "shuffle_stats": null}], "13": [{"query": 13, "duration": 3.509920444339514, "shuffle_stats": null}, {"query": 13, "duration": 3.53349726786837, "shuffle_stats": null}], "14": [{"query": 14, "duration": 1.8350142962299287, "shuffle_stats": null}, {"query": 14, "duration": 1.8336891261860728, "shuffle_stats": null}], "15": [{"query": 15, "duration": 1.6993464292027056, "shuffle_stats": null}, {"query": 15, "duration": 1.6658845716156065, "shuffle_stats": null}], "16": [{"query": 16, "duration": 1.9789668591693044, "shuffle_stats": null}, {"query": 16, "duration": 1.9449447789229453, "shuffle_stats": null}], "17": [{"query": 17, "duration": 2.9797028959728777, "shuffle_stats": null}, {"query": 17, "duration": 3.007782726082951, "shuffle_stats": null}], "18": [{"query": 18, "duration": 2.3274956489913166, "shuffle_stats": null}, {"query": 18, "duration": 2.286834402009845, "shuffle_stats": null}], "19": [{"query": 19, "duration": 2.602522084955126, "shuffle_stats": null}, {"query": 19, "duration": 2.602529602125287, "shuffle_stats": null}], "20": [{"query": 20, "duration": 2.3213649629615247, "shuffle_stats": null}, {"query": 20, "duration": 2.285520649049431, "shuffle_stats": null}], "21": [{"query": 21, "duration": 5.445147569756955, "shuffle_stats": null}, {"query": 21, "duration": 5.445033299271017, "shuffle_stats": null}], "22": [{"query": 22, "duration": 0.44194518122822046, "shuffle_stats": null}, {"query": 22, "duration": 0.4195217848755419, "shuffle_stats": null}]}, "dataset_path": "/velox-100", "scale_factor": 100, "shuffle": null, "gather_shuffle_stats": false, "broadcast_join_limit": null, "blocksize": null, "max_rows_per_partition": null, "threads": 1, "iterations": 2, "timestamp": "2025-09-27T03:24:05.385135+00:00", "hardware": {"gpus": [{"name": "Tesla V100-SXM2-32GB", "index": 0, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 1, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 2, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 3, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 4, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 5, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 6, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 7, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}]}, "rmm_async": true, "rapidsmpf_oom_protection": false, "rapidsmpf_spill": false, "spill_device": 0.5, "query_set": "pdsh", "stats_planning": false, "config_options": {"raise_on_fail": true, "parquet_options": {"chunked": true, "n_output_chunks": 1, "chunk_read_limit": 0, "pass_read_limit": 0, "max_footer_samples": 3, "max_row_group_samples": 1}, "executor": {"name": "streaming", "scheduler": "synchronous", "fallback_mode": "warn", "max_rows_per_partition": 1000000, "unique_fraction": {"c_custkey": 0.05, "l_orderkey": 1.0, "l_partkey": 0.1, "o_custkey": 0.25}, "target_partition_size": 1000000000, "groupby_n_ary": 32, "broadcast_join_limit": 32, "shuffle_method": "tasks", "rapidsmpf_spill": false, "sink_to_directory": false, "stats_planning": {"use_io_partitioning": true, "use_reduction_planning": false, "use_join_heuristics": true, "use_sampling": true, "default_selectivity": 0.8}}, "device": null}}

Results for async-priming (priming disabled):

Total mean time across all queries: 58.3694 seconds
async-priming JSON results
{"queries": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], "suffix": "", "executor": "streaming", "scheduler": "synchronous", "n_workers": 1, "versions": {"cudf_polars": {"version": "25.12.00", "commit": ""}, "polars": "1.32.3", "python": "3.13.7", "rapidsmpf": null}, "records": {"1": [{"query": 1, "duration": 3.6397354789078236, "shuffle_stats": null}, {"query": 1, "duration": 2.641171406954527, "shuffle_stats": null}], "2": [{"query": 2, "duration": 0.4379019048064947, "shuffle_stats": null}, {"query": 2, "duration": 0.3898875010199845, "shuffle_stats": null}], "3": [{"query": 3, "duration": 2.3516018842346966, "shuffle_stats": null}, {"query": 3, "duration": 2.3232951653189957, "shuffle_stats": null}], "4": [{"query": 4, "duration": 1.3856513118371367, "shuffle_stats": null}, {"query": 4, "duration": 1.3312080642208457, "shuffle_stats": null}], "5": [{"query": 5, "duration": 2.9755894858390093, "shuffle_stats": null}, {"query": 5, "duration": 3.6367362341843545, "shuffle_stats": null}], "6": [{"query": 6, "duration": 1.3454122920520604, "shuffle_stats": null}, {"query": 6, "duration": 1.3441804316826165, "shuffle_stats": null}], "7": [{"query": 7, "duration": 3.291757849045098, "shuffle_stats": null}, {"query": 7, "duration": 3.295462945010513, "shuffle_stats": null}], "8": [{"query": 8, "duration": 3.9435508949682117, "shuffle_stats": null}, {"query": 8, "duration": 3.90307339373976, "shuffle_stats": null}], "9": [{"query": 9, "duration": 5.971479965839535, "shuffle_stats": null}, {"query": 9, "duration": 7.103332689963281, "shuffle_stats": null}], "10": [{"query": 10, "duration": 3.6968108317814767, "shuffle_stats": null}, {"query": 10, "duration": 3.28237788612023, "shuffle_stats": null}], "11": [{"query": 11, "duration": 0.34587280498817563, "shuffle_stats": null}, {"query": 11, "duration": 0.33308263309299946, "shuffle_stats": null}], "12": [{"query": 12, "duration": 2.3720550439320505, "shuffle_stats": null}, {"query": 12, "duration": 2.3155864560976624, "shuffle_stats": null}], "13": [{"query": 13, "duration": 3.5541658513247967, "shuffle_stats": null}, {"query": 13, "duration": 3.4868649072013795, "shuffle_stats": null}], "14": [{"query": 14, "duration": 1.8379328828305006, "shuffle_stats": null}, {"query": 14, "duration": 1.8286241302266717, "shuffle_stats": null}], "15": [{"query": 15, "duration": 1.6941107548773289, "shuffle_stats": null}, {"query": 15, "duration": 1.6617944450117648, "shuffle_stats": null}], "16": [{"query": 16, "duration": 1.9736291877925396, "shuffle_stats": null}, {"query": 16, "duration": 1.9502548277378082, "shuffle_stats": null}], "17": [{"query": 17, "duration": 3.005155229009688, "shuffle_stats": null}, {"query": 17, "duration": 3.0089298533275723, "shuffle_stats": null}], "18": [{"query": 18, "duration": 2.3264262438751757, "shuffle_stats": null}, {"query": 18, "duration": 2.711486048065126, "shuffle_stats": null}], "19": [{"query": 19, "duration": 2.6837993119843304, "shuffle_stats": null}, {"query": 19, "duration": 2.5807790379039943, "shuffle_stats": null}], "20": [{"query": 20, "duration": 2.2770903329364955, "shuffle_stats": null}, {"query": 20, "duration": 2.263146643061191, "shuffle_stats": null}], "21": [{"query": 21, "duration": 5.44541895808652, "shuffle_stats": null}, {"query": 21, "duration": 5.420293725095689, "shuffle_stats": null}], "22": [{"query": 22, "duration": 0.42978309467434883, "shuffle_stats": null}, {"query": 22, "duration": 0.4130848487839103, "shuffle_stats": null}]}, "dataset_path": "/velox-100", "scale_factor": 100, "shuffle": null, "gather_shuffle_stats": false, "broadcast_join_limit": null, "blocksize": null, "max_rows_per_partition": null, "threads": 1, "iterations": 2, "timestamp": "2025-09-27T02:48:34.866401+00:00", "hardware": {"gpus": [{"name": "Tesla V100-SXM2-32GB", "index": 0, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 1, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 2, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 3, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 4, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 5, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 6, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 7, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}]}, "rmm_async": true, "rapidsmpf_oom_protection": false, "rapidsmpf_spill": false, "spill_device": 0.5, "query_set": "pdsh", "stats_planning": false, "config_options": {"raise_on_fail": true, "parquet_options": {"chunked": true, "n_output_chunks": 1, "chunk_read_limit": 0, "pass_read_limit": 0, "max_footer_samples": 3, "max_row_group_samples": 1}, "executor": {"name": "streaming", "scheduler": "synchronous", "fallback_mode": "warn", "max_rows_per_partition": 1000000, "unique_fraction": {"c_custkey": 0.05, "l_orderkey": 1.0, "l_partkey": 0.1, "o_custkey": 0.25}, "target_partition_size": 1000000000, "groupby_n_ary": 32, "broadcast_join_limit": 32, "shuffle_method": "tasks", "rapidsmpf_spill": false, "sink_to_directory": false, "stats_planning": {"use_io_partitioning": true, "use_reduction_planning": false, "use_join_heuristics": true, "use_sampling": true, "default_selectivity": 0.8}}, "device": null}}

I ran both branches 3 times and got results within 1-2 seconds of the above on both branches. Any changes in workflow performance are well within the run-to-run noise.

Investigating profiles

Here is the async-priming (priming disabled) profile. I also looked at profiles for branch-25.12 (priming enabled) and the only significant difference I saw was in the first touch, described below.

image

First-touch access

On branch-25.12, the "first touch" of the memory is the priming step in the MR constructor. This is a large allocation (free / 2), so it takes considerable time to allocate and deallocate. Subsequent allocations are fast, as expected.

Name	Start	Duration	TID
cudaMallocFromPoolAsync	6.72649s	132.726 ms	670507
image

On the async-priming branch, the first touch of the memory pool is a small allocation, only 16 bytes, but it takes a few milliseconds (probably doing work to get the pool set up). As above, subsequent allocations are fast, and take the same amount of time as in the primed pool. I did not see any differences in the timings for the first few allocations (some of which are bytes and some of which are several megabytes) between primed and unprimed.

Name	Start	Duration	TID
cudaMallocFromPoolAsync	1.53844s	9.473 ms	1278614
image

I would say that the cost of first-touch access shouldn't be a problem, because we are skipping the more expensive priming call in the memory resource constructor, which doesn't appear to show any benefits. Overall we are doing less work with priming disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I have run the PDS-H benchmarks, I am less sure if adding this benchmark is really useful. Reviewers, please weigh in on whether you'd like to see this added or if it provides no value.

@bdice bdice marked this pull request as ready for review September 27, 2025 03:50
@bdice bdice requested review from a team as code owners September 27, 2025 03:50
@bdice bdice requested review from shrshi and harrism September 27, 2025 03:50
@bdice bdice added feature request New feature or request non-breaking Non-breaking change labels Sep 27, 2025
@bdice bdice self-assigned this Sep 27, 2025
Copy link
Member

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved trivial CMake changes

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am neutral on the benchmark, but +1 on the priming change.

@bdice
Copy link
Contributor Author

bdice commented Sep 29, 2025

/merge

@rapids-bot rapids-bot bot merged commit ade7c9f into rapidsai:branch-25.12 Sep 29, 2025
78 checks passed
@GregoryKimball
Copy link

Here is a visualization for the data @bdice posted above.
image

It looks like this PDS-H run used query setting all. I'm not seeing any difference in Q1 execution in first iteration with and without priming. Looks good to me!

meta-codesync bot pushed a commit to facebookincubator/velox that referenced this pull request Oct 2, 2025
Summary:
This PR skips the default "priming" step of the RMM async memory resource. This reduces initialization costs and has benefits for multi-process applications.

xref:
- rapidsai/rmm#2060
- rapidsai/rmm#1931
- rapidsai/rmm#2051

Pull Request resolved: #14997

Reviewed By: Yuhta

Differential Revision: D83668107

Pulled By: kgpai

fbshipit-source-id: 41b6bd5807f60b0e1c76a1c91dd26f7e5451255a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[PERF] Measure impact of async allocator priming the memory pool
5 participants