Disable async MR priming by default #2051

bdice · 2025-09-24T20:21:39Z

Description

Closes #1931

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-09-24T20:21:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot · 2025-09-25T22:30:09Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

bdice · 2025-09-26T01:48:53Z

Does the benchmark in this PR seem useful? The results I get locally are:

--------------------------------------------------------------------------------
Benchmark                              Time       CPU Iterations UserCounters...
--------------------------------------------------------------------------------
BM_AsyncPrimingImpact/primed         209 us    209 us       3171 first_round_throughput=759.964T latency_to_first_ns=4.059k second_round_throughput=491.074T
BM_AsyncPrimingImpact/unprimed       218 us    218 us       3018 first_round_throughput=760.279T latency_to_first_ns=3.971k second_round_throughput=489.498T
BM_AsyncConstructionTime/primed   471741 us 471732 us          2 construction_time_ns=17.6421M
BM_AsyncConstructionTime/unprimed   19.9 us   19.7 us      33521 construction_time_ns=15.864k

In summary: I think priming costs a lot at MR construction time and I don't think it has any significant performance impact on subsequent allocation times. I am not sure if this benchmark is really a fair measurement or not.

I am going to run PDS-H workflows to get more data.

bdice · 2025-09-27T03:48:32Z

Here are the PDS-H benchmarks with this PR. cc: @GregoryKimball

Hardware: V100 32GB, CUDA 12.9, driver 575.

Profiling command:

$ nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --env-var CUDA_VISIBLE_DEVICES=5,POLARS_GPU_ENABLE_CUDA_MANAGED_MEMORY=0 --output=branch-25.12 python pdsh.py all --path /velox-100 --rmm-async --no-validate -e streaming --suffix "" --iterations 2

Note: --rmm-async doesn't use the async allocator unless you also use the distributed scheduler. Instead, I relied on the environment variable POLARS_GPU_ENABLE_CUDA_MANGED_MEMORY=0 to force the use of the async MR. See cudf#20129.

Overall Performance

Results for branch-25.12 (priming enabled):

Total mean time across all queries: 57.1874 seconds

branch-25.12 JSON results

{"queries": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], "suffix": "", "executor": "streaming", "scheduler": "synchronous", "n_workers": 1, "versions": {"cudf_polars": {"version": "25.12.00", "commit": ""}, "polars": "1.32.3", "python": "3.13.7", "rapidsmpf": null}, "records": {"1": [{"query": 1, "duration": 3.6291721519082785, "shuffle_stats": null}, {"query": 1, "duration": 2.5483336290344596, "shuffle_stats": null}], "2": [{"query": 2, "duration": 0.4386968738399446, "shuffle_stats": null}, {"query": 2, "duration": 0.39029303286224604, "shuffle_stats": null}], "3": [{"query": 3, "duration": 2.3394500641152263, "shuffle_stats": null}, {"query": 3, "duration": 2.33677189797163, "shuffle_stats": null}], "4": [{"query": 4, "duration": 1.3956746738404036, "shuffle_stats": null}, {"query": 4, "duration": 1.3462188951671124, "shuffle_stats": null}], "5": [{"query": 5, "duration": 2.9932695999741554, "shuffle_stats": null}, {"query": 5, "duration": 2.9835128230042756, "shuffle_stats": null}], "6": [{"query": 6, "duration": 1.3697084370069206, "shuffle_stats": null}, {"query": 6, "duration": 1.3562008100561798, "shuffle_stats": null}], "7": [{"query": 7, "duration": 3.3783102575689554, "shuffle_stats": null}, {"query": 7, "duration": 3.331870677880943, "shuffle_stats": null}], "8": [{"query": 8, "duration": 4.035966556053609, "shuffle_stats": null}, {"query": 8, "duration": 4.024940917268395, "shuffle_stats": null}], "9": [{"query": 9, "duration": 6.08514833310619, "shuffle_stats": null}, {"query": 9, "duration": 6.041446194052696, "shuffle_stats": null}], "10": [{"query": 10, "duration": 4.739399116951972, "shuffle_stats": null}, {"query": 10, "duration": 3.994611604139209, "shuffle_stats": null}], "11": [{"query": 11, "duration": 0.3541259649209678, "shuffle_stats": null}, {"query": 11, "duration": 0.3385001323185861, "shuffle_stats": null}], "12": [{"query": 12, "duration": 2.401626610662788, "shuffle_stats": null}, {"query": 12, "duration": 2.354862041771412, "shuffle_stats": null}], "13": [{"query": 13, "duration": 3.509920444339514, "shuffle_stats": null}, {"query": 13, "duration": 3.53349726786837, "shuffle_stats": null}], "14": [{"query": 14, "duration": 1.8350142962299287, "shuffle_stats": null}, {"query": 14, "duration": 1.8336891261860728, "shuffle_stats": null}], "15": [{"query": 15, "duration": 1.6993464292027056, "shuffle_stats": null}, {"query": 15, "duration": 1.6658845716156065, "shuffle_stats": null}], "16": [{"query": 16, "duration": 1.9789668591693044, "shuffle_stats": null}, {"query": 16, "duration": 1.9449447789229453, "shuffle_stats": null}], "17": [{"query": 17, "duration": 2.9797028959728777, "shuffle_stats": null}, {"query": 17, "duration": 3.007782726082951, "shuffle_stats": null}], "18": [{"query": 18, "duration": 2.3274956489913166, "shuffle_stats": null}, {"query": 18, "duration": 2.286834402009845, "shuffle_stats": null}], "19": [{"query": 19, "duration": 2.602522084955126, "shuffle_stats": null}, {"query": 19, "duration": 2.602529602125287, "shuffle_stats": null}], "20": [{"query": 20, "duration": 2.3213649629615247, "shuffle_stats": null}, {"query": 20, "duration": 2.285520649049431, "shuffle_stats": null}], "21": [{"query": 21, "duration": 5.445147569756955, "shuffle_stats": null}, {"query": 21, "duration": 5.445033299271017, "shuffle_stats": null}], "22": [{"query": 22, "duration": 0.44194518122822046, "shuffle_stats": null}, {"query": 22, "duration": 0.4195217848755419, "shuffle_stats": null}]}, "dataset_path": "/velox-100", "scale_factor": 100, "shuffle": null, "gather_shuffle_stats": false, "broadcast_join_limit": null, "blocksize": null, "max_rows_per_partition": null, "threads": 1, "iterations": 2, "timestamp": "2025-09-27T03:24:05.385135+00:00", "hardware": {"gpus": [{"name": "Tesla V100-SXM2-32GB", "index": 0, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 1, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 2, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 3, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 4, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 5, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 6, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 7, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}]}, "rmm_async": true, "rapidsmpf_oom_protection": false, "rapidsmpf_spill": false, "spill_device": 0.5, "query_set": "pdsh", "stats_planning": false, "config_options": {"raise_on_fail": true, "parquet_options": {"chunked": true, "n_output_chunks": 1, "chunk_read_limit": 0, "pass_read_limit": 0, "max_footer_samples": 3, "max_row_group_samples": 1}, "executor": {"name": "streaming", "scheduler": "synchronous", "fallback_mode": "warn", "max_rows_per_partition": 1000000, "unique_fraction": {"c_custkey": 0.05, "l_orderkey": 1.0, "l_partkey": 0.1, "o_custkey": 0.25}, "target_partition_size": 1000000000, "groupby_n_ary": 32, "broadcast_join_limit": 32, "shuffle_method": "tasks", "rapidsmpf_spill": false, "sink_to_directory": false, "stats_planning": {"use_io_partitioning": true, "use_reduction_planning": false, "use_join_heuristics": true, "use_sampling": true, "default_selectivity": 0.8}}, "device": null}}

Results for async-priming (priming disabled):

Total mean time across all queries: 58.3694 seconds

async-priming JSON results

{"queries": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], "suffix": "", "executor": "streaming", "scheduler": "synchronous", "n_workers": 1, "versions": {"cudf_polars": {"version": "25.12.00", "commit": ""}, "polars": "1.32.3", "python": "3.13.7", "rapidsmpf": null}, "records": {"1": [{"query": 1, "duration": 3.6397354789078236, "shuffle_stats": null}, {"query": 1, "duration": 2.641171406954527, "shuffle_stats": null}], "2": [{"query": 2, "duration": 0.4379019048064947, "shuffle_stats": null}, {"query": 2, "duration": 0.3898875010199845, "shuffle_stats": null}], "3": [{"query": 3, "duration": 2.3516018842346966, "shuffle_stats": null}, {"query": 3, "duration": 2.3232951653189957, "shuffle_stats": null}], "4": [{"query": 4, "duration": 1.3856513118371367, "shuffle_stats": null}, {"query": 4, "duration": 1.3312080642208457, "shuffle_stats": null}], "5": [{"query": 5, "duration": 2.9755894858390093, "shuffle_stats": null}, {"query": 5, "duration": 3.6367362341843545, "shuffle_stats": null}], "6": [{"query": 6, "duration": 1.3454122920520604, "shuffle_stats": null}, {"query": 6, "duration": 1.3441804316826165, "shuffle_stats": null}], "7": [{"query": 7, "duration": 3.291757849045098, "shuffle_stats": null}, {"query": 7, "duration": 3.295462945010513, "shuffle_stats": null}], "8": [{"query": 8, "duration": 3.9435508949682117, "shuffle_stats": null}, {"query": 8, "duration": 3.90307339373976, "shuffle_stats": null}], "9": [{"query": 9, "duration": 5.971479965839535, "shuffle_stats": null}, {"query": 9, "duration": 7.103332689963281, "shuffle_stats": null}], "10": [{"query": 10, "duration": 3.6968108317814767, "shuffle_stats": null}, {"query": 10, "duration": 3.28237788612023, "shuffle_stats": null}], "11": [{"query": 11, "duration": 0.34587280498817563, "shuffle_stats": null}, {"query": 11, "duration": 0.33308263309299946, "shuffle_stats": null}], "12": [{"query": 12, "duration": 2.3720550439320505, "shuffle_stats": null}, {"query": 12, "duration": 2.3155864560976624, "shuffle_stats": null}], "13": [{"query": 13, "duration": 3.5541658513247967, "shuffle_stats": null}, {"query": 13, "duration": 3.4868649072013795, "shuffle_stats": null}], "14": [{"query": 14, "duration": 1.8379328828305006, "shuffle_stats": null}, {"query": 14, "duration": 1.8286241302266717, "shuffle_stats": null}], "15": [{"query": 15, "duration": 1.6941107548773289, "shuffle_stats": null}, {"query": 15, "duration": 1.6617944450117648, "shuffle_stats": null}], "16": [{"query": 16, "duration": 1.9736291877925396, "shuffle_stats": null}, {"query": 16, "duration": 1.9502548277378082, "shuffle_stats": null}], "17": [{"query": 17, "duration": 3.005155229009688, "shuffle_stats": null}, {"query": 17, "duration": 3.0089298533275723, "shuffle_stats": null}], "18": [{"query": 18, "duration": 2.3264262438751757, "shuffle_stats": null}, {"query": 18, "duration": 2.711486048065126, "shuffle_stats": null}], "19": [{"query": 19, "duration": 2.6837993119843304, "shuffle_stats": null}, {"query": 19, "duration": 2.5807790379039943, "shuffle_stats": null}], "20": [{"query": 20, "duration": 2.2770903329364955, "shuffle_stats": null}, {"query": 20, "duration": 2.263146643061191, "shuffle_stats": null}], "21": [{"query": 21, "duration": 5.44541895808652, "shuffle_stats": null}, {"query": 21, "duration": 5.420293725095689, "shuffle_stats": null}], "22": [{"query": 22, "duration": 0.42978309467434883, "shuffle_stats": null}, {"query": 22, "duration": 0.4130848487839103, "shuffle_stats": null}]}, "dataset_path": "/velox-100", "scale_factor": 100, "shuffle": null, "gather_shuffle_stats": false, "broadcast_join_limit": null, "blocksize": null, "max_rows_per_partition": null, "threads": 1, "iterations": 2, "timestamp": "2025-09-27T02:48:34.866401+00:00", "hardware": {"gpus": [{"name": "Tesla V100-SXM2-32GB", "index": 0, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 1, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 2, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 3, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 4, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 5, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 6, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}, {"name": "Tesla V100-SXM2-32GB", "index": 7, "free_memory": 34068758528, "used_memory": 290979840, "total_memory": 34359738368}]}, "rmm_async": true, "rapidsmpf_oom_protection": false, "rapidsmpf_spill": false, "spill_device": 0.5, "query_set": "pdsh", "stats_planning": false, "config_options": {"raise_on_fail": true, "parquet_options": {"chunked": true, "n_output_chunks": 1, "chunk_read_limit": 0, "pass_read_limit": 0, "max_footer_samples": 3, "max_row_group_samples": 1}, "executor": {"name": "streaming", "scheduler": "synchronous", "fallback_mode": "warn", "max_rows_per_partition": 1000000, "unique_fraction": {"c_custkey": 0.05, "l_orderkey": 1.0, "l_partkey": 0.1, "o_custkey": 0.25}, "target_partition_size": 1000000000, "groupby_n_ary": 32, "broadcast_join_limit": 32, "shuffle_method": "tasks", "rapidsmpf_spill": false, "sink_to_directory": false, "stats_planning": {"use_io_partitioning": true, "use_reduction_planning": false, "use_join_heuristics": true, "use_sampling": true, "default_selectivity": 0.8}}, "device": null}}

I ran both branches 3 times and got results within 1-2 seconds of the above on both branches. Any changes in workflow performance are well within the run-to-run noise.

Investigating profiles

Here is the async-priming (priming disabled) profile. I also looked at profiles for branch-25.12 (priming enabled) and the only significant difference I saw was in the first touch, described below.

First-touch access

On branch-25.12, the "first touch" of the memory is the priming step in the MR constructor. This is a large allocation (free / 2), so it takes considerable time to allocate and deallocate. Subsequent allocations are fast, as expected.

Name	Start	Duration	TID
cudaMallocFromPoolAsync	6.72649s	132.726 ms	670507

On the async-priming branch, the first touch of the memory pool is a small allocation, only 16 bytes, but it takes a few milliseconds (probably doing work to get the pool set up). As above, subsequent allocations are fast, and take the same amount of time as in the primed pool. I did not see any differences in the timings for the first few allocations (some of which are bytes and some of which are several megabytes) between primed and unprimed.

Name	Start	Duration	TID
cudaMallocFromPoolAsync	1.53844s	9.473 ms	1278614

I would say that the cost of first-touch access shouldn't be a problem, because we are skipping the more expensive priming call in the memory resource constructor, which doesn't appear to show any benefits. Overall we are doing less work with priming disabled.

bdice · 2025-09-27T03:50:19Z

cpp/benchmarks/async_priming/async_priming_bench.cpp

Now that I have run the PDS-H benchmarks, I am less sure if adding this benchmark is really useful. Reviewers, please weigh in on whether you'd like to see this added or if it provides no value.

cpp/include/rmm/mr/device/cuda_async_memory_resource.hpp

KyleFromNVIDIA

Approved trivial CMake changes

wence-

I am neutral on the benchmark, but +1 on the priming change.

bdice · 2025-09-29T13:55:24Z

/merge

GregoryKimball · 2025-09-29T19:06:22Z

Here is a visualization for the data @bdice posted above.

It looks like this PDS-H run used query setting all. I'm not seeing any difference in Q1 execution in first iteration with and without priming. Looks good to me!

Summary: This PR skips the default "priming" step of the RMM async memory resource. This reduces initialization costs and has benefits for multi-process applications. xref: - rapidsai/rmm#2060 - rapidsai/rmm#1931 - rapidsai/rmm#2051 Pull Request resolved: #14997 Reviewed By: Yuhta Differential Revision: D83668107 Pulled By: kgpai fbshipit-source-id: 41b6bd5807f60b0e1c76a1c91dd26f7e5451255a

bdice added 2 commits September 24, 2025 14:49

Disable async MR priming by default.

d399b8e

Add benchmark

1af36cc

github-project-automation bot added this to RMM Project Board Sep 24, 2025

bdice added 2 commits September 25, 2025 15:12

Revise benchmark

e727e50

Revise benchmark

f9cacfb

bdice changed the base branch from branch-25.10 to branch-25.12 September 25, 2025 20:23

Synchronize device

2cb0115

Remove unnecessary data

1fe21bb

bdice commented Sep 27, 2025

View reviewed changes

bdice marked this pull request as ready for review September 27, 2025 03:50

bdice requested review from a team as code owners September 27, 2025 03:50

bdice requested review from shrshi and harrism September 27, 2025 03:50

bdice added feature request New feature or request non-breaking Non-breaking change labels Sep 27, 2025

bdice self-assigned this Sep 27, 2025

bdice commented Sep 27, 2025

View reviewed changes

cpp/include/rmm/mr/device/cuda_async_memory_resource.hpp Outdated Show resolved Hide resolved

Remove note about priming making later allocations faster.

a2db22c

bdice mentioned this pull request Sep 29, 2025

[PERF] Measure impact of async allocator priming the memory pool #1931

Closed

bdice moved this to In Progress in RMM Project Board Sep 29, 2025

KyleFromNVIDIA approved these changes Sep 29, 2025

View reviewed changes

wence- approved these changes Sep 29, 2025

View reviewed changes

davidwendt approved these changes Sep 29, 2025

View reviewed changes

rapids-bot bot merged commit ade7c9f into rapidsai:branch-25.12 Sep 29, 2025
78 checks passed

github-project-automation bot moved this from In Progress to Done in RMM Project Board Sep 29, 2025

This was referenced Sep 29, 2025

Discourage applications from providing non-default values in async MR #2060

Open

feat(cudf): Disable async MR priming in Velox cuDF facebookincubator/velox#14997

Closed

wence- mentioned this pull request Oct 1, 2025

Adding pinned host buffer impl rapidsai/rapidsmpf#549

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable async MR priming by default #2051

Disable async MR priming by default #2051

Uh oh!

bdice commented Sep 24, 2025

Uh oh!

copy-pr-bot bot commented Sep 24, 2025

Uh oh!

copy-pr-bot bot commented Sep 25, 2025

Uh oh!

bdice commented Sep 26, 2025 •

edited

Loading

Uh oh!

bdice commented Sep 27, 2025 •

edited

Loading

Uh oh!

bdice Sep 27, 2025

Uh oh!

Uh oh!

KyleFromNVIDIA left a comment

Uh oh!

wence- left a comment

Uh oh!

bdice commented Sep 29, 2025

Uh oh!

Uh oh!

GregoryKimball commented Sep 29, 2025

Uh oh!

Uh oh!

Disable async MR priming by default #2051

Disable async MR priming by default #2051

Uh oh!

Conversation

bdice commented Sep 24, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Sep 24, 2025

Uh oh!

copy-pr-bot bot commented Sep 25, 2025

Uh oh!

bdice commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdice commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overall Performance

Investigating profiles

First-touch access

Uh oh!

bdice Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

bdice commented Sep 29, 2025

Uh oh!

Uh oh!

GregoryKimball commented Sep 29, 2025

Uh oh!

Uh oh!

bdice commented Sep 26, 2025 •

edited

Loading

bdice commented Sep 27, 2025 •

edited

Loading