Skip to content

[DeepSpeed] Fix evaluate()/predict() before train()#44889

Open
roycho96 wants to merge 7 commits intohuggingface:mainfrom
roycho96:fix/evaluate-before-train
Open

[DeepSpeed] Fix evaluate()/predict() before train()#44889
roycho96 wants to merge 7 commits intohuggingface:mainfrom
roycho96:fix/evaluate-before-train

Conversation

@roycho96
Copy link

What does this PR do?

Calling trainer.evaluate() before trainer.train() with DeepSpeed is broken in three ways:

  1. ZeRO-3 stale state crash: evaluate() creates an inference engine. train() starts with accelerator.free_memory() which destroys it and sets deepspeed_engine_wrapped = None, but Trainer's model_wrapped/deepspeed still hold stale refs, causing AttributeError: 'NoneType' has no attribute 'backward'.

  2. ZeRO-3 shared config mutation: deepspeed_init(inference=True) mutates the shared config in-place: trainer_config_finalize(num_training_steps=0) bakes scheduler "auto" values to 0 (causing mismatch on subsequent train()), and del_config_sub_tree("optimizer") removes the optimizer config (fallback to HF optimizer).

  3. ZeRO-1/2 evaluate blocked: evaluation_loop() unconditionally calls deepspeed_init(inference=True) which raises ValueError for non-ZeRO-3 stages. ZeRO-1/2 have full parameters on each GPU and don't need an inference engine.

evaluate() before train() is a valid pattern. For example, baseline metric collection, conditional training based on eval results, or data pipeline validation. eval_on_start=True covers some cases but doesn't help when the user needs eval metrics before deciding whether to train.

Fix

  • Reset stale Trainer refs at start of _prepare_for_training() when deepspeed_engine_wrapped is already None but model_wrapped still points to a destroyed engine. Follows existing pattern in _update_auto_batch_size().

  • In evaluation_loop(), gate deepspeed_init(inference=True) on is_zero3() only. For ZeRO-1/2, use prepare_model(evaluation_mode=True) to bypass DS engine creation entirely. Protect shared config with backup/restore in try/finally.

eval_on_start=True is unaffected.

Related issues

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@SunMarc @3outeille

@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44889&sha=c1f161

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant