Skip to content

Checkpoint context manager deletes underlying data after exiting scope #2

@ccstanley

Description

@ccstanley

When using Checkpoint() as a context manager, the checkpoint data gets deleted after the scope is exited. This prevents accessing the same checkpoint data multiple times.

Environment

Python 3.11.9
Metaflow 2.12.29+obcheckpoint(0.2.1)

Reproducing the bug

  1. Save this as 'TestFlow.py'
from metaflow import FlowSpec, step, checkpoint
import os


class TestFlow(FlowSpec):
    @step
    def start(self):
        self.next(self.checkpoint_step)

    @checkpoint
    @step
    def checkpoint_step(self):
        from metaflow import current

        cp_path = os.path.join(current.checkpoint.directory, "testing.txt")
        with open(cp_path, "w", encoding="utf8") as f:
            f.write("TESTING")

        current.checkpoint.save()

        self.next(self.end)

    @step
    def end(self):
        pass


if __name__ == "__main__":
    run = TestFlow()
    print(run)
  1. Run python TestFlow.py run

  2. Run this script to list the checkpoint folder twice:

from metaflow import Checkpoint, Flow
import os

task = Flow("TestFlow").latest_successful_run["checkpoint_step"].task


def list_checkpoint_dir(task):
    with Checkpoint() as cp:
        latest = cp.list(task=task)[0]
        cp.load(latest)

        print(f"Files in the directory: {cp.directory}")
        print(os.listdir(cp.directory))


# First listing - show the testing.txt
# Outputs:
#     > Files in the directory: /var/folders/ql/...
#     > ['testing.txt']
list_checkpoint_dir(task)

# Second listing - the testing.txt is missing
# Outputs:
#     > Files in the directory: /var/folders/ql/....
#     > []
list_checkpoint_dir(task)

When I look into the checkpoint storage location under .metaflow directory, thetesting.txt is missing, so it is indeed removed from the datastore.

Diagnosis

I traced the bug to this function:

def _load_objects(
self,
key,
# The key here is ideally a key from the datastore that we want to load
# This can be a checkpoint key or a model key.
local_directory,
# Its assumed here that the local path where everything is getting
# extracted is a directory.
):
list_path_results = list(self.list_paths([key], recursive=True))
# print(list_path_results)
keys = [p.key for p in list_path_results]
# We directly call load bytes here because `self.get_file` will add the root of the datastore
# to the path and we don't want that.
with self._backend.load_files(keys) as get_results:
for list_key, path, meta in get_results:
if path is None:
continue
path_within_dir = os.path.relpath(list_key, self.resolve_key_path(key))
# We need to construct the right path over here based on the
# where the key is present in the object.
# Figure a relative path from the end of the key
# to the actual file/directory within it.
# We do this because we want the entire directory structure
# to be preserved when we download the objects on local.
move_to_path = os.path.join(local_directory, path_within_dir)
Path(move_to_path).parent.mkdir(parents=True, exist_ok=True)
shutil.move(
path,
move_to_path,
)
return local_directory

If I change the shutil.move call to shutil.copy, this issue disappeared. However, this copy operation might be too expensive for some. Does anyone have an idea for a proper fix?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions