doc: Remove suggestion to build extensions in parallel#7899
doc: Remove suggestion to build extensions in parallel#7899Flamefire wants to merge 1 commit intodeepspeedai:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a997f7366f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
b4b7f0d to
88e6ba8
Compare
As the extensions share the build folder building them in parallel can cause failures or wrong results due to extensions overwriting the files of other extensions. Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
|
@Flamefire thanks for unraveling this old mystery. However, I am concerned since avoiding parallel builds will increase build times undesirably. Moreover, this problem only applies to only two ops: def absolute_name(self):
return f'deepspeed.ops.stochastic_transformer.{self.NAME}_op'This will require modifying the references as well. What do you think? |
|
Unfortunately not. For the case you mentioned it would not be enough because it is not the op name but the files used: It will compile e.g. And finally the ninja.build files created by PyTorch during the extension build ALL overwrite each other. It only worked when the build started soon enough such that the file was not yet overwritten by another op build |
|
FWIW: I opened an issue for this bug/unsupported use case with setuptools: pypa/setuptools#5196 |
As the extensions share the build folder building them in parallel can cause failures or wrong results due to extensions overwriting the files of other extensions.
Closes #949 which is one instance of an actual failure: transformer_op AND the stochastic_transformer_op compile the same file in the same folder with different options in parallel.
Other issues include mangled ninja build files (created by PyTorch):