-
Notifications
You must be signed in to change notification settings - Fork 2
Closed
Description
When using Checkpoint() as a context manager, the checkpoint data gets deleted after the scope is exited. This prevents accessing the same checkpoint data multiple times.
Environment
Python 3.11.9
Metaflow 2.12.29+obcheckpoint(0.2.1)
Reproducing the bug
- Save this as 'TestFlow.py'
from metaflow import FlowSpec, step, checkpoint
import os
class TestFlow(FlowSpec):
@step
def start(self):
self.next(self.checkpoint_step)
@checkpoint
@step
def checkpoint_step(self):
from metaflow import current
cp_path = os.path.join(current.checkpoint.directory, "testing.txt")
with open(cp_path, "w", encoding="utf8") as f:
f.write("TESTING")
current.checkpoint.save()
self.next(self.end)
@step
def end(self):
pass
if __name__ == "__main__":
run = TestFlow()
print(run)-
Run
python TestFlow.py run -
Run this script to list the checkpoint folder twice:
from metaflow import Checkpoint, Flow
import os
task = Flow("TestFlow").latest_successful_run["checkpoint_step"].task
def list_checkpoint_dir(task):
with Checkpoint() as cp:
latest = cp.list(task=task)[0]
cp.load(latest)
print(f"Files in the directory: {cp.directory}")
print(os.listdir(cp.directory))
# First listing - show the testing.txt
# Outputs:
# > Files in the directory: /var/folders/ql/...
# > ['testing.txt']
list_checkpoint_dir(task)
# Second listing - the testing.txt is missing
# Outputs:
# > Files in the directory: /var/folders/ql/....
# > []
list_checkpoint_dir(task)When I look into the checkpoint storage location under .metaflow directory, thetesting.txt is missing, so it is indeed removed from the datastore.
Diagnosis
I traced the bug to this function:
Lines 379 to 412 in aff2bc5
| def _load_objects( | |
| self, | |
| key, | |
| # The key here is ideally a key from the datastore that we want to load | |
| # This can be a checkpoint key or a model key. | |
| local_directory, | |
| # Its assumed here that the local path where everything is getting | |
| # extracted is a directory. | |
| ): | |
| list_path_results = list(self.list_paths([key], recursive=True)) | |
| # print(list_path_results) | |
| keys = [p.key for p in list_path_results] | |
| # We directly call load bytes here because `self.get_file` will add the root of the datastore | |
| # to the path and we don't want that. | |
| with self._backend.load_files(keys) as get_results: | |
| for list_key, path, meta in get_results: | |
| if path is None: | |
| continue | |
| path_within_dir = os.path.relpath(list_key, self.resolve_key_path(key)) | |
| # We need to construct the right path over here based on the | |
| # where the key is present in the object. | |
| # Figure a relative path from the end of the key | |
| # to the actual file/directory within it. | |
| # We do this because we want the entire directory structure | |
| # to be preserved when we download the objects on local. | |
| move_to_path = os.path.join(local_directory, path_within_dir) | |
| Path(move_to_path).parent.mkdir(parents=True, exist_ok=True) | |
| shutil.move( | |
| path, | |
| move_to_path, | |
| ) | |
| return local_directory |
If I change the shutil.move call to shutil.copy, this issue disappeared. However, this copy operation might be too expensive for some. Does anyone have an idea for a proper fix?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels