-
Notifications
You must be signed in to change notification settings - Fork 695
[AOT] Memory Info Restore Mechanism with Better Performance #4113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Jiax-cn
wants to merge
2
commits into
bytecodealliance:main
Choose a base branch
from
Jiax-cn:feature/optimize_memory_reload
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the main purpose of this PR is to minimize or eliminate
load
instructions in memory operations. However, the changes eliminated all the fast accesses(if
branch) but retained the slow ones(else
branch). Does this actually address the original problem?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two
else
branches are different. In this PR,mem_info[0].mem_base_addr
comes fromLLVMBuildAlloca
, which allocates memory on the stack (often in registers due tomem2reg
optimization or in cache), whereas the previous version loads from global memory.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's become more confusing.
There is a critical need to pass the base address of linear memory in various functions. Therefore, it's necessary to have some sort of global variable to hold this address.
In the original code, the process is as follows: the base address is always retrieved from the global variable, stored in a temporary variable
mem_base_addr
, and then this temporary variable is used to compute the final address.In the PR, a new local variable,
mem_base_addr
, is introduced to hold the base address after obtaining it from the temporary variablemem_base_addr1
. Although loading from a local variable isn't a significant issue, it does raise a small question as to why this is necessary. After all, the base address is already present in the temporary variable(if using the original design).And after changes to linear memory, such as
memory.grow
, it is still necessary to reload values from the global variable, then from the temporary variable, and finally save them to the local variable. This doesn't appear to be more optimized, in my view.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your thought is correct, but you have overlooked the optimization of
LLVMBuildAlloca
variables bymem2reg
.For example:
The
load/store
of%y
be optimized:Similarly,
can be optimized to:
In this PR, if you dump the IR of substr.wasm above, you will see that the load/store operations of
%mem_base_addr
are also optimized. As a result, the final outcome remains the same as in the original code.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've devised a straightforward test case to assess the enhancement. Here is the code:
And I used the following command:
--bounds-checks=1 --format=llvmir-op
, to create optimized llvmir. The--bounds-checks=1
is employed to apply thenoinline
attribute.Several intriguing findings emerged from the comparison of the before and after scenarios:
🆙 If I'm mistaken, please correct me.
This leads to my confusion: if there is no difference for basic cases and no enhancement for redundant loading, what is the rationale for changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is to eliminate redundant memory info loads between multiple load/store instructions when the memory remains unchanged.
I use the following WAST example to compare the (optimized) IR generated by this PR with the original code, identifying specific scenarios where this optimization applies. I don’t need the
noinline
attribute, so I only used the--format=llvmir-op
option.Same IR in f0 and f1.

In f3 and f4, the IR generated by this PR is different.

This PR primarily optimizes store-store and store-load scenarios. As for load-load and load-store scenarios, I believe they might have been optimized.
In your example, f0\f1 shows this PR won't produce no addtional IR compared to the original version. The memory in f2 is unchanged. f3 is the load-store case. f4 and f5 contains two call and one load instruction. Therefore, their IRs remains the same.