Skip to content

Conversation

shayanh
Copy link
Contributor

@shayanh shayanh commented Sep 2, 2025

This PR implements several small optimizations for the reth benchmark program.

Optimizations

  1. Improve allocations: reserve a larger vector for MPT nodes from the beginning and remove capacity increases in inserts.
  2. Better memory alignment: encode RLP representation of each node with extra padding to make them word-aligned. This let us to avoid memcpy operations when calculating keccak(rlp_encoded) during decoding.
  3. Improve witness db: revm's CacheDB already caches state's keys and values. We remove this duplicate caching in witness db by simplifying it. This has a trade-off where now we might calculate keccak of some addresses and storage slots more than once. I tried to cache keccak hash of addresses and storage slots but it turned out to be worse, so I removed it.

Results

Benchmark Runs

Summary

Metric shayanh/perf-opt main % Change
Proof time 205.40 208.76 -1.61%
Parallel proof time 14.76 14.60 +1.10%
Total cells used 4,278,572,934 4,492,961,526 -4.77%
Executed instructions 128,306,456 132,404,772 -3.09%

Overall this seems to improve all metrics except parallel proof time, which is surprising. I'm not sure where the extra 160ms comes from and I think it might be some irrelevant performance regression. We get a better answer if we run the benchmarks on each branch multiple times.

@shayanh shayanh marked this pull request as ready for review September 2, 2025 17:34
@Qumeric
Copy link
Contributor

Qumeric commented Sep 2, 2025

"Parallel proof time" regression might be because it's less segments now. If so, you could compare with constant amount of segments (by decreasing "Total main trace cells (excluding memory)" accordingly)

I had similar results in my initial optimization PR and I believe the reason was segment amount

@shayanh
Copy link
Contributor Author

shayanh commented Sep 2, 2025

@Qumeric where do you see total number of segments in the benchmark results?

@Qumeric
Copy link
Contributor

Qumeric commented Sep 2, 2025

I think it's not reported (would be nice to add) but I believe that it always hits roughly this max amount with reasonably small range. Except the last segment that has some padding.

I think if you decrease max cells by proportionally (-4.77%), likely you would get performance matching reductions in cells/segments

Copy link
Contributor

@Qumeric Qumeric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Good find about revm cache, I guess they added it at some point after the initial version of MPT

// More advanced improvement: either pre-execute block at guest to know exact allocations in
// advance, or allocate a separate arena specifically for updates.
let capacity = num_nodes + num_nodes / 10;
let capacity = num_nodes + (num_nodes / 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth trying to tune it. One way is to run with dhat on some block, find roughly the optimal value and set it to slightly more than that.

Unlikely to change much but maybe we could get something like -0.5% for ~free

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran dhat but I didn't find any sign of vector doubling. What do you look for in dhat's output?

@shayanh

This comment was marked as resolved.

@shayanh
Copy link
Contributor Author

shayanh commented Sep 3, 2025

I think there is just some variance in the order of 100ms in parallel proving time.

  • Another main run with parallel proving time 14.67 (link)
  • Another shayanh/perf-opt run with parallel proving time 14.68 (link)
  • shayanh/perf-opt run with a smaller max cells value has parallel proving time of 14.70 (link).

let rlp_node = &rlp_node_header_start[..rlp_node_length];

let padding_len = (MIN_ALIGN - (rlp_node_length % MIN_ALIGN)) % MIN_ALIGN;
unsafe { advance_unchecked(bytes, padding_len) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best to add some comment with // SAFETY for every unsafe use

Copy link
Contributor

@jonathanpwang jonathanpwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please add SAFETY comments

- reserve a bigger vector for trie nodes from the beginning. remove capacity increase in inserts.
- use an uninit array for branch childs during decoding
Encode RLP representation of each node with extra padding to make them word-aligned. This let us to avoid memcpy operations when calculating `keccak(rlp_encoded)` during decoding.
- Revm's `CacheDB` already caches state's keys and values. We remove the duplicate caching in witness db by simplifying it.
- Add a keccak cache of address and storage slots. With this we avoid hashing some addresses and storage slots more than once.
@shayanh shayanh merged commit 0b92abe into main Sep 10, 2025
2 checks passed
@shayanh shayanh deleted the shayanh/perf-opt branch September 10, 2025 22:54
@jonathanpwang jonathanpwang added the input-format The input format of the host binary changed label Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
input-format The input format of the host binary changed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants