Skip to content

doc: Remove suggestion to build extensions in parallel#7899

Open
Flamefire wants to merge 1 commit intodeepspeedai:masterfrom
Flamefire:doc-parallel
Open

doc: Remove suggestion to build extensions in parallel#7899
Flamefire wants to merge 1 commit intodeepspeedai:masterfrom
Flamefire:doc-parallel

Conversation

@Flamefire
Copy link
Contributor

As the extensions share the build folder building them in parallel can cause failures or wrong results due to extensions overwriting the files of other extensions.

Closes #949 which is one instance of an actual failure: transformer_op AND the stochastic_transformer_op compile the same file in the same folder with different options in parallel.

Other issues include mangled ninja build files (created by PyTorch):

      ninja: error: build.ninja:31: expected '=', got lexing error
      on3.12/site-packages/torch/include/torch/csrc/api/include -I/software/...
            ^ near here

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a997f7366f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Flamefire Flamefire force-pushed the doc-parallel branch 2 times, most recently from b4b7f0d to 88e6ba8 Compare March 12, 2026 16:02
As the extensions share the build folder building them in parallel can
cause failures or wrong results due to extensions overwriting
the files of other extensions.

Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
@sfc-gh-truwase
Copy link
Collaborator

@Flamefire thanks for unraveling this old mystery. However, I am concerned since avoiding parallel builds will increase build times undesirably. Moreover, this problem only applies to only two ops: transformer_op and stochastic_transformer_op. Would it make sense to change ops into different modules, such as renaming here to

    def absolute_name(self):
        return f'deepspeed.ops.stochastic_transformer.{self.NAME}_op'

This will require modifying the references as well. What do you think?

@Flamefire
Copy link
Contributor Author

Unfortunately not. For the case you mentioned it would not be enough because it is not the op name but the files used: It will compile e.g. csrc/transformer/gelu_kernels.cu to gelu_kernels.o. This is done for the different file in the same build folder. Trivially true because it literally uses the exact same sources.
Other ops have similar issues as they share some files, e.g. csrc/aio/common/deepspeed_aio_common.cp

And finally the ninja.build files created by PyTorch during the extension build ALL overwrite each other. It only worked when the build started soon enough such that the file was not yet overwritten by another op build

@Flamefire
Copy link
Contributor Author

FWIW: I opened an issue for this bug/unsupported use case with setuptools: pypa/setuptools#5196

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ImportError: dynamic module does not define module export function (PyInit_transformer_op)

2 participants