[DeepSpeed] Fix evaluate()/predict() before train() by roycho96 · Pull Request #44889 · huggingface/transformers

roycho96 · 2026-03-20T15:08:32Z

What does this PR do?

Calling trainer.evaluate() before trainer.train() with DeepSpeed is broken in three ways:

ZeRO-3 stale state crash: evaluate() creates an inference engine. train() starts with accelerator.free_memory() which destroys it and sets deepspeed_engine_wrapped = None, but Trainer's model_wrapped/deepspeed still hold stale refs, causing AttributeError: 'NoneType' has no attribute 'backward'.
ZeRO-3 shared config mutation: deepspeed_init(inference=True) mutates the shared config in-place: trainer_config_finalize(num_training_steps=0) bakes scheduler "auto" values to 0 (causing mismatch on subsequent train()), and del_config_sub_tree("optimizer") removes the optimizer config (fallback to HF optimizer).
ZeRO-1/2 evaluate blocked: evaluation_loop() unconditionally calls deepspeed_init(inference=True) which raises ValueError for non-ZeRO-3 stages. ZeRO-1/2 have full parameters on each GPU and don't need an inference engine.

evaluate() before train() is a valid pattern. For example, baseline metric collection, conditional training based on eval results, or data pipeline validation. eval_on_start=True covers some cases but doesn't help when the user needs eval metrics before deciding whether to train.

Fix

Reset stale Trainer refs at start of _prepare_for_training() when deepspeed_engine_wrapped is already None but model_wrapped still points to a destroyed engine. Follows existing pattern in _update_auto_batch_size().
In evaluation_loop(), gate deepspeed_init(inference=True) on is_zero3() only. For ZeRO-1/2, use prepare_model(evaluation_mode=True) to bypass DS engine creation entirely. Protect shared config with backup/restore in try/finally.

eval_on_start=True is unaffected.

Related issues

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@SunMarc @3outeille

github-actions · 2026-03-21T11:06:07Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44889&sha=c1f161

roycho96 added 7 commits March 20, 2026 23:54

fix: reset stale DeepSpeed inference engine refs before training setup

3ac3a11

fix: fix stale state conditions

438b0f9

add test

f2da58d

fix: allow evaluation before train for DeepSpeed ZeRO-2

d9f6f3d

add test to verify DS config survives evaluate() before train()

95b6360

format test file

d65a30e

Merge branch 'main' into fix/evaluate-before-train

c1f1618

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepSpeed] Fix evaluate()/predict() before train()#44889

[DeepSpeed] Fix evaluate()/predict() before train()#44889
roycho96 wants to merge 7 commits intohuggingface:mainfrom
roycho96:fix/evaluate-before-train

roycho96 commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

roycho96 commented Mar 20, 2026

What does this PR do?

Fix

Related issues

Before submitting

Who can review?

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant