Skip to content

Conversation

@mkroening
Copy link
Member

@mkroening mkroening commented Nov 4, 2025

This replaces the vec-based MemPool with a bitmap-based IndexAlloc. To track 256 indexes, we now need 32 bytes instead of 512 bytes.

These are measurements from an Apple M2 of creating the allocator, allocating all indices, and then deallocating them again:

len size old size new time old time new
256 512 32 1.242 µs 1.370 µs
1024 2048 64 . 5.505 µs 7.293 µs
2048 4096 128 10.460 µs 21.273 µs

While this is a strict slowdown in this case, I think it is still worth it.

@mkroening mkroening self-assigned this Nov 4, 2025
@mkroening mkroening changed the title perf(virtqueue): remove unused MemPool::limit field perf(virtqueue): replace vec-based MemPools with bitmap-based IndexAlloc Nov 4, 2025
@mkroening mkroening force-pushed the mempool-bitvec branch 2 times, most recently from 7320231 to fc37370 Compare November 4, 2025 08:53
@mkroening mkroening marked this pull request as ready for review November 4, 2025 08:54
@mkroening mkroening requested review from Gelbpunkt and cagatay-y and removed request for Gelbpunkt November 4, 2025 08:54
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Results

Benchmark Current: 663348a Previous: 06380c9 Performance Ratio
startup_benchmark Build Time 113.89 s 111.30 s 1.02
startup_benchmark File Size 0.91 MB 0.91 MB 1.00
Startup Time - 1 core 0.89 s (±0.03 s) 0.94 s (±0.02 s) 0.95
Startup Time - 2 cores 0.90 s (±0.03 s) 0.94 s (±0.02 s) 0.96
Startup Time - 4 cores 0.92 s (±0.03 s) 0.95 s (±0.03 s) 0.97
multithreaded_benchmark Build Time 110.24 s 111.47 s 0.99
multithreaded_benchmark File Size 1.01 MB 1.01 MB 1.00
Multithreaded Pi Efficiency - 2 Threads 89.83 % (±9.75 %) 88.67 % (±8.42 %) 1.01
Multithreaded Pi Efficiency - 4 Threads 43.82 % (±2.86 %) 44.22 % (±3.59 %) 0.99
Multithreaded Pi Efficiency - 8 Threads 25.72 % (±2.26 %) 25.40 % (±2.21 %) 1.01
micro_benchmarks Build Time 296.93 s 298.11 s 1.00
micro_benchmarks File Size 1.02 MB 1.02 MB 1.00
Scheduling time - 1 thread 168.68 ticks (±18.95 ticks) 174.36 ticks (±24.88 ticks) 0.97
Scheduling time - 2 threads 104.37 ticks (±21.43 ticks) 102.58 ticks (±17.96 ticks) 1.02
Micro - Time for syscall (getpid) 11.01 ticks (±5.23 ticks) 12.86 ticks (±5.49 ticks) 0.86
Memcpy speed - (built_in) block size 4096 59637.77 MByte/s (±42372.57 MByte/s) 54678.51 MByte/s (±39898.38 MByte/s) 1.09
Memcpy speed - (built_in) block size 1048576 12960.08 MByte/s (±10659.99 MByte/s) 13904.98 MByte/s (±11992.20 MByte/s) 0.93
Memcpy speed - (built_in) block size 16777216 10050.70 MByte/s (±8137.58 MByte/s) 9871.01 MByte/s (±8007.61 MByte/s) 1.02
Memset speed - (built_in) block size 4096 59714.07 MByte/s (±42415.64 MByte/s) 54863.90 MByte/s (±40016.65 MByte/s) 1.09
Memset speed - (built_in) block size 1048576 13186.88 MByte/s (±10769.73 MByte/s) 14267.94 MByte/s (±12189.39 MByte/s) 0.92
Memset speed - (built_in) block size 16777216 10287.70 MByte/s (±8271.47 MByte/s) 10102.41 MByte/s (±8139.80 MByte/s) 1.02
Memcpy speed - (rust) block size 4096 55410.73 MByte/s (±40567.99 MByte/s) 53878.52 MByte/s (±40203.47 MByte/s) 1.03
Memcpy speed - (rust) block size 1048576 14137.26 MByte/s (±11717.97 MByte/s) 15011.79 MByte/s (±13075.07 MByte/s) 0.94
Memcpy speed - (rust) block size 16777216 9851.41 MByte/s (±7955.33 MByte/s) 9892.57 MByte/s (±8037.27 MByte/s) 1.00
Memset speed - (rust) block size 4096 55969.88 MByte/s (±40866.13 MByte/s) 54652.16 MByte/s (±40703.26 MByte/s) 1.02
Memset speed - (rust) block size 1048576 14512.63 MByte/s (±11933.75 MByte/s) 15339.99 MByte/s (±13218.29 MByte/s) 0.95
Memset speed - (rust) block size 16777216 10112.19 MByte/s (±8117.36 MByte/s) 10120.58 MByte/s (±8166.06 MByte/s) 1.00
alloc_benchmarks Build Time 293.46 s 292.86 s 1.00
alloc_benchmarks File Size 0.98 MB 0.98 MB 1.00
Allocations - Allocation success 100.00 % 100.00 % 1
Allocations - Deallocation success 100.00 % 100.00 % 1
Allocations - Pre-fail Allocations 100.00 % 100.00 % 1
Allocations - Average Allocation time 12010.36 Ticks (±872.02 Ticks) 22133.78 Ticks (±874.32 Ticks) 0.54
Allocations - Average Allocation time (no fail) 12010.36 Ticks (±872.02 Ticks) 22133.78 Ticks (±874.32 Ticks) 0.54
Allocations - Average Deallocation time 2405.42 Ticks (±226.21 Ticks) 3223.16 Ticks (±1689.20 Ticks) 0.75
mutex_benchmark Build Time 295.47 s 295.42 s 1.00
mutex_benchmark File Size 1.02 MB 1.02 MB 1.00
Mutex Stress Test Average Time per Iteration - 1 Threads 38.10 ns (±4.15 ns) 37.08 ns (±4.18 ns) 1.03
Mutex Stress Test Average Time per Iteration - 2 Threads 30.98 ns (±3.34 ns) 30.58 ns (±3.16 ns) 1.01

This comment was automatically generated by workflow using github-action-benchmark.

@mkroening mkroening marked this pull request as draft November 4, 2025 13:05
@mkroening mkroening force-pushed the mempool-bitvec branch 2 times, most recently from e3958f3 to 34f9060 Compare November 4, 2025 18:06
Comment on lines +546 to +554
for (word_index, word) in self.bits.iter_mut().enumerate() {
let trailing_ones = word.trailing_ones();
if trailing_ones < usize::BITS {
let mask = 1 << trailing_ones;
*word |= mask;
let index = word_index * USIZE_BITS + usize::try_from(trailing_ones).unwrap();
return Some(index);
}
}

None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (word_index, word) in self.bits.iter_mut().enumerate() {
let trailing_ones = word.trailing_ones();
if trailing_ones < usize::BITS {
let mask = 1 << trailing_ones;
*word |= mask;
let index = word_index * USIZE_BITS + usize::try_from(trailing_ones).unwrap();
return Some(index);
}
}
None
let (word_index, trailing_ones) = self
.bits
.iter()
.copied()
.map(usize::trailing_ones)
.enumerate()
.find(|(_, trailing_ones)| *trailing_ones < usize::BITS)?;
let mask = 1 << trailing_ones;
self.bits[word_index] |= mask;
let index = word_index * USIZE_BITS + usize::try_from(trailing_ones).unwrap();
Some(index)

I am not sure if it would be an improvement but wanted to offer it as an option. It would save us from some nesting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I have looked into this, and the compiler fails to optimize the bounds check when setting the bit. Also, maybe because the trailing ones calculation is too far away now, the compiler no longer optimizes the masking from shl and or to bts.

For details, see Compiler Explorer.

So I'd keep it as is, even though the performance difference is small, of course (about 5%). :D

@mkroening mkroening marked this pull request as ready for review November 6, 2025 17:05
Copy link
Member

@Gelbpunkt Gelbpunkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I could come up with additionally was using u128 but it was consistently slightly slower than usize in benchmarks for me

@mkroening mkroening added this pull request to the merge queue Nov 17, 2025
Merged via the queue into main with commit 4b38f57 Nov 17, 2025
17 checks passed
@mkroening mkroening deleted the mempool-bitvec branch November 19, 2025 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants