Skip to content

Fix: filesystem edge case in ModelCheckpoint #18252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

schmidt-ai
Copy link
Contributor

Fixes #17912.

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 7, 2023
@schmidt-ai schmidt-ai changed the title fix filesystem bug Fix: filesystem bug in ModelCheckpoint Aug 7, 2023
@schmidt-ai schmidt-ai changed the title Fix: filesystem bug in ModelCheckpoint Fix: filesystem edge case in ModelCheckpoint Aug 7, 2023
@schmidt-ai schmidt-ai marked this pull request as ready for review August 8, 2023 18:52
@schmidt-ai schmidt-ai requested a review from Borda as a code owner August 8, 2023 19:22
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending the PR 🎉

Comment on lines 717 to 720
if final_path is None:
assert trainer.ckpt_path == final_path
else:
assert mc._fs._strip_protocol(trainer.ckpt_path) == Path(final_path).as_posix()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test here isn't failing without the fix. It never gets a path with a protocol. We need to write a new test where we pretend to be loading from a path with a protocol.

A unit test in test_model_checkpoint.py that calls _find_last_checkpoints should be enough to cover this case. Should I help with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I definitely got caught up here, if you can tell from the commit history 😂 Any help would be much appreciated.

I was also concerned about the impact on ModelCheckpoint.file_exists. I think the core question is: should the selected fsspec filesystem be an attribute of ModelCheckpoint or of Trainer? Maybe I'm thinking about it wrong...

Copy link
Contributor Author

@schmidt-ai schmidt-ai Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test here isn't failing without the fix. It never gets a path with a protocol.

Hmm, it does fail for me locally. For local filepaths, LocalFileSystem._unstrip_protocol adds the file:// prefix to paths.

@awaelchli awaelchli added callback: model checkpoint community This PR is from the community labels Aug 9, 2023
@awaelchli awaelchli added this to the 2.1 milestone Aug 9, 2023
@awaelchli awaelchli added the bug Something isn't working label Aug 9, 2023
@awaelchli awaelchli modified the milestones: 2.1, 2.0.x Aug 9, 2023
@schmidt-ai
Copy link
Contributor Author

@awaelchli I could take a stab at adding a test case if you'd prefer? Though I may need a little additional detail on your idea for it!

@codecov
Copy link

codecov bot commented Oct 12, 2023

Codecov Report

Merging #18252 (8a760f2) into master (f52fedc) will decrease coverage by 35%.
The diff coverage is 100%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #18252      +/-   ##
==========================================
- Coverage      84%      49%     -35%     
==========================================
  Files         439      431       -8     
  Lines       34484    34341     -143     
==========================================
- Hits        28877    16890   -11987     
- Misses       5607    17451   +11844     

@Borda
Copy link
Member

Borda commented Nov 18, 2023

@schmidt-ai @awaelchli how is it going here? :)

Copy link

gitguardian bot commented Jan 16, 2024

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@awaelchli awaelchli modified the milestones: 2.1.x, 2.2.x Feb 8, 2024
@mergify mergify bot removed the has conflicts label Feb 16, 2024
@awaelchli awaelchli modified the milestones: 2.2.x, 2.3.x Jun 13, 2024
@awaelchli awaelchli modified the milestones: 2.3.x, 2.4.x Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working callback: model checkpoint community This PR is from the community has conflicts pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Edge case causes incorrect filesystem to be selected for finding cloud checkpoints
3 participants