-
Notifications
You must be signed in to change notification settings - Fork 6
feat: small performance optimizations #493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
80b7216
to
24d2ad3
Compare
"Parallel proof time" regression might be because it's less segments now. If so, you could compare with constant amount of segments (by decreasing "Total main trace cells (excluding memory)" accordingly) I had similar results in my initial optimization PR and I believe the reason was segment amount |
@Qumeric where do you see total number of segments in the benchmark results? |
I think it's not reported (would be nice to add) but I believe that it always hits roughly this max amount with reasonably small range. Except the last segment that has some padding. I think if you decrease max cells by proportionally (-4.77%), likely you would get performance matching reductions in cells/segments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Good find about revm cache, I guess they added it at some point after the initial version of MPT
// More advanced improvement: either pre-execute block at guest to know exact allocations in | ||
// advance, or allocate a separate arena specifically for updates. | ||
let capacity = num_nodes + num_nodes / 10; | ||
let capacity = num_nodes + (num_nodes / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth trying to tune it. One way is to run with dhat on some block, find roughly the optimal value and set it to slightly more than that.
Unlikely to change much but maybe we could get something like -0.5% for ~free
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran dhat but I didn't find any sign of vector doubling. What do you look for in dhat's output?
This comment was marked as resolved.
This comment was marked as resolved.
I think there is just some variance in the order of 100ms in parallel proving time. |
let rlp_node = &rlp_node_header_start[..rlp_node_length]; | ||
|
||
let padding_len = (MIN_ALIGN - (rlp_node_length % MIN_ALIGN)) % MIN_ALIGN; | ||
unsafe { advance_unchecked(bytes, padding_len) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best to add some comment with // SAFETY for every unsafe use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please add SAFETY comments
- reserve a bigger vector for trie nodes from the beginning. remove capacity increase in inserts. - use an uninit array for branch childs during decoding
Encode RLP representation of each node with extra padding to make them word-aligned. This let us to avoid memcpy operations when calculating `keccak(rlp_encoded)` during decoding.
- Revm's `CacheDB` already caches state's keys and values. We remove the duplicate caching in witness db by simplifying it. - Add a keccak cache of address and storage slots. With this we avoid hashing some addresses and storage slots more than once.
da3af64
to
9ca2fc2
Compare
This PR implements several small optimizations for the reth benchmark program.
Optimizations
keccak(rlp_encoded)
during decoding.CacheDB
already caches state's keys and values. We remove this duplicate caching in witness db by simplifying it. This has a trade-off where now we might calculate keccak of some addresses and storage slots more than once. I tried to cache keccak hash of addresses and storage slots but it turned out to be worse, so I removed it.Results
Benchmark Runs
shayanh/perf-opt
main
Summary
shayanh/perf-opt
main
Overall this seems to improve all metrics except parallel proof time, which is surprising. I'm not sure where the extra 160ms comes from and I think it might be some irrelevant performance regression. We get a better answer if we run the benchmarks on each branch multiple times.