feat: small performance optimizations #493

shayanh · 2025-09-02T06:31:24Z

This PR implements several small optimizations for the reth benchmark program.

Optimizations

Improve allocations: reserve a larger vector for MPT nodes from the beginning and remove capacity increases in inserts.
Better memory alignment: encode RLP representation of each node with extra padding to make them word-aligned. This let us to avoid memcpy operations when calculating keccak(rlp_encoded) during decoding.
Improve witness db: revm's CacheDB already caches state's keys and values. We remove this duplicate caching in witness db by simplifying it. This has a trade-off where now we might calculate keccak of some addresses and storage slots more than once. I tried to cache keccak hash of addresses and storage slots but it turned out to be worse, so I removed it.

Results

Benchmark Runs

Summary

Metric	`shayanh/perf-opt`	`main`	% Change
Proof time	205.40	208.76	-1.61%
Parallel proof time	14.76	14.60	+1.10%
Total cells used	4,278,572,934	4,492,961,526	-4.77%
Executed instructions	128,306,456	132,404,772	-3.09%

Overall this seems to improve all metrics except parallel proof time, which is surprising. I'm not sure where the extra 160ms comes from and I think it might be some irrelevant performance regression. We get a better answer if we run the benchmarks on each branch multiple times.

Qumeric · 2025-09-02T17:45:53Z

"Parallel proof time" regression might be because it's less segments now. If so, you could compare with constant amount of segments (by decreasing "Total main trace cells (excluding memory)" accordingly)

I had similar results in my initial optimization PR and I believe the reason was segment amount

shayanh · 2025-09-02T17:54:18Z

@Qumeric where do you see total number of segments in the benchmark results?

Qumeric · 2025-09-02T18:01:52Z

I think it's not reported (would be nice to add) but I believe that it always hits roughly this max amount with reasonably small range. Except the last segment that has some padding.

I think if you decrease max cells by proportionally (-4.77%), likely you would get performance matching reductions in cells/segments

Qumeric

LGTM

Good find about revm cache, I guess they added it at some point after the initial version of MPT

Qumeric · 2025-09-02T19:38:16Z

crates/mptnew/src/trie.rs

        // More advanced improvement: either pre-execute block at guest to know exact allocations in
        // advance, or allocate a separate arena specifically for updates.
-        let capacity = num_nodes + num_nodes / 10;
+        let capacity = num_nodes + (num_nodes / 2);


Maybe worth trying to tune it. One way is to run with dhat on some block, find roughly the optimal value and set it to slightly more than that.

Unlikely to change much but maybe we could get something like -0.5% for ~free

I ran dhat but I didn't find any sign of vector doubling. What do you look for in dhat's output?

shayanh · 2025-09-03T00:23:26Z

I think there is just some variance in the order of 100ms in parallel proving time.

Another main run with parallel proving time 14.67 (link)
Another shayanh/perf-opt run with parallel proving time 14.68 (link)
shayanh/perf-opt run with a smaller max cells value has parallel proving time of 14.70 (link).

jonathanpwang · 2025-09-05T06:18:02Z

crates/mptnew/src/trie.rs

        let rlp_node = &rlp_node_header_start[..rlp_node_length];
+
+        let padding_len = (MIN_ALIGN - (rlp_node_length % MIN_ALIGN)) % MIN_ALIGN;
+        unsafe { advance_unchecked(bytes, padding_len) };


best to add some comment with // SAFETY for every unsafe use

jonathanpwang

LGTM, please add SAFETY comments

- reserve a bigger vector for trie nodes from the beginning. remove capacity increase in inserts. - use an uninit array for branch childs during decoding

Encode RLP representation of each node with extra padding to make them word-aligned. This let us to avoid memcpy operations when calculating `keccak(rlp_encoded)` during decoding.

- Revm's `CacheDB` already caches state's keys and values. We remove the duplicate caching in witness db by simplifying it. - Add a keccak cache of address and storage slots. With this we avoid hashing some addresses and storage slots more than once.

shayanh force-pushed the shayanh/perf-opt branch from 80b7216 to 24d2ad3 Compare September 2, 2025 15:51

shayanh marked this pull request as ready for review September 2, 2025 17:34

shayanh requested review from jonathanpwang and Qumeric September 2, 2025 17:34

Qumeric approved these changes Sep 2, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

jonathanpwang reviewed Sep 5, 2025

View reviewed changes

jonathanpwang approved these changes Sep 5, 2025

View reviewed changes

shayanh added 6 commits September 10, 2025 15:18

perf: improve allocs

4558428

- reserve a bigger vector for trie nodes from the beginning. remove capacity increase in inserts. - use an uninit array for branch childs during decoding

perf: decode keccak alignment

6980a4c

Encode RLP representation of each node with extra padding to make them word-aligned. This let us to avoid memcpy operations when calculating `keccak(rlp_encoded)` during decoding.

feat: improve witness db

cba5e19

- Revm's `CacheDB` already caches state's keys and values. We remove the duplicate caching in witness db by simplifying it. - Add a keccak cache of address and storage slots. With this we avoid hashing some addresses and storage slots more than once.

fix: fmt

e2da3d4

perf: remove keccak cache

5d2e960

chore: add SAFETY comments

9ca2fc2

shayanh force-pushed the shayanh/perf-opt branch from da3af64 to 9ca2fc2 Compare September 10, 2025 22:30

shayanh merged commit 0b92abe into main Sep 10, 2025
2 checks passed

shayanh deleted the shayanh/perf-opt branch September 10, 2025 22:54

jonathanpwang added the input-format The input format of the host binary changed label Sep 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: small performance optimizations #493

feat: small performance optimizations #493

Uh oh!

shayanh commented Sep 2, 2025 •

edited

Loading

Uh oh!

Qumeric commented Sep 2, 2025 •

edited

Loading

Uh oh!

shayanh commented Sep 2, 2025

Uh oh!

Qumeric commented Sep 2, 2025

Uh oh!

Qumeric left a comment •

edited

Loading

Uh oh!

Qumeric Sep 2, 2025

Uh oh!

shayanh Sep 2, 2025

Uh oh!

This comment was marked as resolved.

shayanh commented Sep 3, 2025

Uh oh!

jonathanpwang Sep 5, 2025

Uh oh!

jonathanpwang left a comment

Uh oh!

Uh oh!

Uh oh!

feat: small performance optimizations #493

feat: small performance optimizations #493

Uh oh!

Conversation

shayanh commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimizations

Results

Benchmark Runs

Summary

Uh oh!

Qumeric commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shayanh commented Sep 2, 2025

Uh oh!

Qumeric commented Sep 2, 2025

Uh oh!

Qumeric left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Qumeric Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

shayanh Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

shayanh commented Sep 3, 2025

Uh oh!

jonathanpwang Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

jonathanpwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shayanh commented Sep 2, 2025 •

edited

Loading

Qumeric commented Sep 2, 2025 •

edited

Loading

Qumeric left a comment •

edited

Loading