-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Outline & Motivation
The usage of log_dir
and save_dir
are confusion.
Imaging we are implement a new logger, let's see how we can define the save_dir
property.
class CustomLogger(Logger):
def __init__(self, all_experiment_path):
self._root_dir = all_experiment_path
@property
@override
def save_dir(self) -> str:
option1 = os.path.join(self._root_dir, self.name, str(self.version))
# option2 = self._root_dir
# option3 = os.path.join(self._root_dir, self.name)
return option1 # or option2 or option3
The first option includes the experiment name and version as subdirectories, and the second one does not.
As CustomLogger
is not a subclass of TensorBoardLogger
or CSVLogger
. the Trainer.log_dir
is defined to be logger.save_dir
if the first logger is a CustomLogger
, according to here.
The config.yaml
file will be saved to logger.save_dir
/config.yaml
. The checkpoint file will be saved to logger.save_dir
/logger.name
/logger.version
, which
expand to be logger._root_dir
/logger.name
/logger.version
/logger.name
/logger.version
, which is with duplicate directory hierarchy, according to __resolve_ckpt_dir()
function.
I DO NOT like it. The following tree view shows this clearly.
<all_experiment_root>
└── experiment_name
├── version_0
│ ├── config.yaml
│ └── experiment_name
│ └── version_0
│ └── checkpoints
│ └── epoch=7-step=3000.ckpt
└── version_1
├── config.yaml
└── experiment_name
└── version_1
└── checkpoints
└── epoch=28-step=10875.ckpt
However, there is no way to remove the duplicate hierarchy for now. If the second or third option is adopted, the config.yaml
file will be saved in a directory irrelevant to the experiment version, and will be overwrite by next experiment. This is not acceptable.
Pitch
I'm expecting the result hierarchy like this:
<all_experiment_root>
└── experiment_name
├── version_0
│ ├── config.yaml
│ └── checkpoints
│ └── epoch=7-step=3000.ckpt
└── version_1
├── config.yaml
└── checkpoints
└── epoch=28-step=10875.ckpt
I tried to fix this problem and was willingly to submit a PR. Fixing this issue seems more complicated than imagined.
Additional context
Because I'm new to Lightning, I'm not sure I fully understand Logger's design. From the current code, several attributes such as root_dir
, log_dir
, and save_dir
are not clearly defined. For example, in the defination of lightning.fabric.loggers.tensorboard.TensorBoardLogger
class, the experiment name is not included in the root_dir
path, but for lightning.pytorch.loggers.tensorboard.TensorBoardLogger
, it is included.
The root_dir
and log_dir
properties can be None
introduces unnecessary complexities. The documentation of log_dir
says:
Return the root directory where all versions of an experiment get saved, or `None` if the logger does not save data locally.
If I understand it correctly, the property root_dir
and log_dir
should never be None
for all concrete loggers. This is because the config.yaml
should always be saved to somewhere like root_dir/name/version
. Even if a logger does not save config.yaml
locally, we can still initialize root_dir
to the current directory ('.'), but not really use it.
I have two questions:
-
Why we need the
save_dir
property? Why always settingroot_dir
to be<root>/name
andlog_dir
to be<root>/name/version
not sufficient? -
What makes
TensorBoardLogger
andCSVLogger
special that result in theTrainer.log_dir
to belogger.log_dir
instead oflogger.save_dir
?