perf(virtqueue): replace vec-based `MemPools` with bitmap-based `IndexAlloc` #2049

mkroening · 2025-11-04T08:19:03Z

This replaces the vec-based MemPool with a bitmap-based IndexAlloc. To track 256 indexes, we now need 32 bytes instead of 512 bytes.

These are measurements from an Apple M2 of creating the allocator, allocating all indices, and then deallocating them again:

len	size old	size new	time old	time new
256	512	32	1.242 µs	1.370 µs
1024	2048	64	. 5.505 µs	7.293 µs
2048	4096	128	10.460 µs	21.273 µs

While this is a strict slowdown in this case, I think it is still worth it.

github-actions

Benchmark Results

Benchmark	Current: `663348a`	Previous: `06380c9`	Performance Ratio
startup_benchmark Build Time	`113.89` s	`111.30` s	`1.02`
startup_benchmark File Size	`0.91` MB	`0.91` MB	`1.00`
Startup Time - 1 core	`0.89` s (`±0.03` s)	`0.94` s (`±0.02` s)	`0.95`
Startup Time - 2 cores	`0.90` s (`±0.03` s)	`0.94` s (`±0.02` s)	`0.96`
Startup Time - 4 cores	`0.92` s (`±0.03` s)	`0.95` s (`±0.03` s)	`0.97`
multithreaded_benchmark Build Time	`110.24` s	`111.47` s	`0.99`
multithreaded_benchmark File Size	`1.01` MB	`1.01` MB	`1.00`
Multithreaded Pi Efficiency - 2 Threads	`89.83` % (`±9.75` %)	`88.67` % (`±8.42` %)	`1.01`
Multithreaded Pi Efficiency - 4 Threads	`43.82` % (`±2.86` %)	`44.22` % (`±3.59` %)	`0.99`
Multithreaded Pi Efficiency - 8 Threads	`25.72` % (`±2.26` %)	`25.40` % (`±2.21` %)	`1.01`
micro_benchmarks Build Time	`296.93` s	`298.11` s	`1.00`
micro_benchmarks File Size	`1.02` MB	`1.02` MB	`1.00`
Scheduling time - 1 thread	`168.68` ticks (`±18.95` ticks)	`174.36` ticks (`±24.88` ticks)	`0.97`
Scheduling time - 2 threads	`104.37` ticks (`±21.43` ticks)	`102.58` ticks (`±17.96` ticks)	`1.02`
Micro - Time for syscall (getpid)	`11.01` ticks (`±5.23` ticks)	`12.86` ticks (`±5.49` ticks)	`0.86`
Memcpy speed - (built_in) block size 4096	`59637.77` MByte/s (`±42372.57` MByte/s)	`54678.51` MByte/s (`±39898.38` MByte/s)	`1.09`
Memcpy speed - (built_in) block size 1048576	`12960.08` MByte/s (`±10659.99` MByte/s)	`13904.98` MByte/s (`±11992.20` MByte/s)	`0.93`
Memcpy speed - (built_in) block size 16777216	`10050.70` MByte/s (`±8137.58` MByte/s)	`9871.01` MByte/s (`±8007.61` MByte/s)	`1.02`
Memset speed - (built_in) block size 4096	`59714.07` MByte/s (`±42415.64` MByte/s)	`54863.90` MByte/s (`±40016.65` MByte/s)	`1.09`
Memset speed - (built_in) block size 1048576	`13186.88` MByte/s (`±10769.73` MByte/s)	`14267.94` MByte/s (`±12189.39` MByte/s)	`0.92`
Memset speed - (built_in) block size 16777216	`10287.70` MByte/s (`±8271.47` MByte/s)	`10102.41` MByte/s (`±8139.80` MByte/s)	`1.02`
Memcpy speed - (rust) block size 4096	`55410.73` MByte/s (`±40567.99` MByte/s)	`53878.52` MByte/s (`±40203.47` MByte/s)	`1.03`
Memcpy speed - (rust) block size 1048576	`14137.26` MByte/s (`±11717.97` MByte/s)	`15011.79` MByte/s (`±13075.07` MByte/s)	`0.94`
Memcpy speed - (rust) block size 16777216	`9851.41` MByte/s (`±7955.33` MByte/s)	`9892.57` MByte/s (`±8037.27` MByte/s)	`1.00`
Memset speed - (rust) block size 4096	`55969.88` MByte/s (`±40866.13` MByte/s)	`54652.16` MByte/s (`±40703.26` MByte/s)	`1.02`
Memset speed - (rust) block size 1048576	`14512.63` MByte/s (`±11933.75` MByte/s)	`15339.99` MByte/s (`±13218.29` MByte/s)	`0.95`
Memset speed - (rust) block size 16777216	`10112.19` MByte/s (`±8117.36` MByte/s)	`10120.58` MByte/s (`±8166.06` MByte/s)	`1.00`
alloc_benchmarks Build Time	`293.46` s	`292.86` s	`1.00`
alloc_benchmarks File Size	`0.98` MB	`0.98` MB	`1.00`
Allocations - Allocation success	`100.00` %	`100.00` %	`1`
Allocations - Deallocation success	`100.00` %	`100.00` %	`1`
Allocations - Pre-fail Allocations	`100.00` %	`100.00` %	`1`
Allocations - Average Allocation time	`12010.36` Ticks (`±872.02` Ticks)	`22133.78` Ticks (`±874.32` Ticks)	`0.54`
Allocations - Average Allocation time (no fail)	`12010.36` Ticks (`±872.02` Ticks)	`22133.78` Ticks (`±874.32` Ticks)	`0.54`
Allocations - Average Deallocation time	`2405.42` Ticks (`±226.21` Ticks)	`3223.16` Ticks (`±1689.20` Ticks)	`0.75`
mutex_benchmark Build Time	`295.47` s	`295.42` s	`1.00`
mutex_benchmark File Size	`1.02` MB	`1.02` MB	`1.00`
Mutex Stress Test Average Time per Iteration - 1 Threads	`38.10` ns (`±4.15` ns)	`37.08` ns (`±4.18` ns)	`1.03`
Mutex Stress Test Average Time per Iteration - 2 Threads	`30.98` ns (`±3.34` ns)	`30.58` ns (`±3.16` ns)	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

src/drivers/virtio/virtqueue/mod.rs

cagatay-y · 2025-11-06T16:24:15Z

src/drivers/virtio/virtqueue/mod.rs

+			for (word_index, word) in self.bits.iter_mut().enumerate() {
+				let trailing_ones = word.trailing_ones();
+				if trailing_ones < usize::BITS {
+					let mask = 1 << trailing_ones;
+					*word |= mask;
+					let index = word_index * USIZE_BITS + usize::try_from(trailing_ones).unwrap();
+					return Some(index);
+				}
+			}
+
+			None


Suggested change

for (word_index, word) in self.bits.iter_mut().enumerate() {

let trailing_ones = word.trailing_ones();

if trailing_ones < usize::BITS {

let mask = 1 << trailing_ones;

*word |= mask;

let index = word_index * USIZE_BITS + usize::try_from(trailing_ones).unwrap();

return Some(index);

}

}

None

let (word_index, trailing_ones) = self

.bits

.iter()

.copied()

.map(usize::trailing_ones)

.enumerate()

.find(|(_, trailing_ones)| *trailing_ones < usize::BITS)?;

let mask = 1 << trailing_ones;

self.bits[word_index] |= mask;

let index = word_index * USIZE_BITS + usize::try_from(trailing_ones).unwrap();

Some(index)

I am not sure if it would be an improvement but wanted to offer it as an option. It would save us from some nesting.

Interesting! I have looked into this, and the compiler fails to optimize the bounds check when setting the bit. Also, maybe because the trailing ones calculation is too far away now, the compiler no longer optimizes the masking from shl and or to bts.

For details, see Compiler Explorer.

So I'd keep it as is, even though the performance difference is small, of course (about 5%). :D

src/drivers/virtio/virtqueue/mod.rs

Gelbpunkt

The only thing I could come up with additionally was using u128 but it was consistently slightly slower than usize in benchmarks for me

…xAlloc`

mkroening self-assigned this Nov 4, 2025

mkroening changed the title ~~perf(virtqueue): remove unused MemPool::limit field~~ perf(virtqueue): replace vec-based MemPools with bitmap-based IndexAlloc Nov 4, 2025

mkroening force-pushed the mempool-bitvec branch 2 times, most recently from 7320231 to fc37370 Compare November 4, 2025 08:53

mkroening marked this pull request as ready for review November 4, 2025 08:54

mkroening requested review from Gelbpunkt and cagatay-y and removed request for Gelbpunkt November 4, 2025 08:54

github-actions bot reviewed Nov 4, 2025

View reviewed changes

mkroening marked this pull request as draft November 4, 2025 13:05

mkroening force-pushed the mempool-bitvec branch 2 times, most recently from e3958f3 to 34f9060 Compare November 4, 2025 18:06

cagatay-y approved these changes Nov 6, 2025

View reviewed changes

perf(virtqueue): remove unused MemPool::limit field

a52430a

mkroening force-pushed the mempool-bitvec branch from 34f9060 to a52430a Compare November 6, 2025 16:41

mkroening marked this pull request as ready for review November 6, 2025 17:05

Gelbpunkt approved these changes Nov 13, 2025

View reviewed changes

perf(virtqueue): replace vec-based MemPools with bitmap-based `Inde…

663348a

…xAlloc`

mkroening force-pushed the mempool-bitvec branch from 78d301c to 663348a Compare November 17, 2025 08:24

mkroening added this pull request to the merge queue Nov 17, 2025

Merged via the queue into main with commit 4b38f57 Nov 17, 2025
17 checks passed

mkroening deleted the mempool-bitvec branch November 19, 2025 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(virtqueue): replace vec-based `MemPools` with bitmap-based `IndexAlloc` #2049

perf(virtqueue): replace vec-based `MemPools` with bitmap-based `IndexAlloc` #2049

Uh oh!

mkroening commented Nov 4, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cagatay-y Nov 6, 2025

Uh oh!

mkroening Nov 6, 2025

Uh oh!

Uh oh!

Gelbpunkt left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf(virtqueue): replace vec-based MemPools with bitmap-based IndexAlloc #2049

perf(virtqueue): replace vec-based MemPools with bitmap-based IndexAlloc #2049

Uh oh!

Conversation

mkroening commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Benchmark Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cagatay-y Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

mkroening Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Gelbpunkt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf(virtqueue): replace vec-based `MemPools` with bitmap-based `IndexAlloc` #2049

perf(virtqueue): replace vec-based `MemPools` with bitmap-based `IndexAlloc` #2049

mkroening commented Nov 4, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading