Skip to content

Inconsistent usage of paths in the experimental loggers #20521

@expnn

Description

@expnn

Outline & Motivation

The usage of log_dir and save_dir are confusion.

Imaging we are implement a new logger, let's see how we can define the save_dir property.

class CustomLogger(Logger):
    def __init__(self, all_experiment_path):
        self._root_dir = all_experiment_path

    @property
    @override
    def save_dir(self) -> str:
        option1 = os.path.join(self._root_dir, self.name, str(self.version))
        # option2 = self._root_dir
        # option3 = os.path.join(self._root_dir, self.name)
        return option1  # or option2 or option3

The first option includes the experiment name and version as subdirectories, and the second one does not.

As CustomLogger is not a subclass of TensorBoardLogger or CSVLogger. the Trainer.log_dir is defined to be logger.save_dir if the first logger is a CustomLogger, according to here.
The config.yaml file will be saved to logger.save_dir/config.yaml. The checkpoint file will be saved to logger.save_dir/logger.name/logger.version, which
expand to be logger._root_dir/logger.name/logger.version/logger.name/logger.version, which is with duplicate directory hierarchy, according to __resolve_ckpt_dir() function.

I DO NOT like it. The following tree view shows this clearly.

<all_experiment_root>
└── experiment_name
    ├── version_0
    │   ├── config.yaml
    │   └── experiment_name
    │        └── version_0
    │             └── checkpoints
    │                  └── epoch=7-step=3000.ckpt
    └──  version_1
        ├── config.yaml
        └── experiment_name
              └── version_1
                   └── checkpoints
                        └── epoch=28-step=10875.ckpt

However, there is no way to remove the duplicate hierarchy for now. If the second or third option is adopted, the config.yaml file will be saved in a directory irrelevant to the experiment version, and will be overwrite by next experiment. This is not acceptable.

Pitch

I'm expecting the result hierarchy like this:

<all_experiment_root>
└── experiment_name
    ├── version_0
    │   ├── config.yaml
    │   └── checkpoints
    │         └── epoch=7-step=3000.ckpt
    └── version_1
        ├── config.yaml
        └── checkpoints
        └── epoch=28-step=10875.ckpt

I tried to fix this problem and was willingly to submit a PR. Fixing this issue seems more complicated than imagined.

Additional context

Because I'm new to Lightning, I'm not sure I fully understand Logger's design. From the current code, several attributes such as root_dir, log_dir, and save_dir are not clearly defined. For example, in the defination of lightning.fabric.loggers.tensorboard.TensorBoardLogger class, the experiment name is not included in the root_dir path, but for lightning.pytorch.loggers.tensorboard.TensorBoardLogger, it is included.

The root_dir and log_dir properties can be None introduces unnecessary complexities. The documentation of log_dir says:

Return the root directory where all versions of an experiment get saved, or `None` if the logger does not save data locally.

If I understand it correctly, the property root_dir and log_dir should never be None for all concrete loggers. This is because the config.yaml should always be saved to somewhere like root_dir/name/version. Even if a logger does not save config.yaml locally, we can still initialize root_dir to the current directory ('.'), but not really use it.

I have two questions:

  • Why we need the save_dir property? Why always setting root_dir to be <root>/name and log_dir to be <root>/name/version not sufficient?

  • What makes TensorBoardLogger and CSVLogger special that result in the Trainer.log_dir to be logger.log_dir instead of logger.save_dir?

cc @justusschock @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions