- 
                Notifications
    
You must be signed in to change notification settings  - Fork 75
 
Basic implementation for mlos_benchd service #949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| self._opt_targets = opt_targets | ||
| self._ts_start = ts_start or datetime.now(UTC) | ||
| self._ts_end: datetime | None = None | ||
| self._status = Status.PENDING | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should match what was stored in the backend for resumable Experiments, right?
| This method is called by `Storage.Experiment.__enter__()`. | ||
| """ | ||
| self._status = Status.RUNNING | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some asserts on expected status to check for invalid state transitions.
Wouldn't be terrible to document the expected state transitions in a README.md or docstring either.
| """ | ||
| self._status = Status.RUNNING | ||
| self._driver_name = platform.node() | ||
| self._driver_pid = os.getpid() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem ok to initialize the values from None, but might be problematic for resuming an Experiment.
There are a few cases I can think of:
- An individual 
mlos_benchdriver process fails and needs to be restarted. - The 
mlos_benchdservice dies, but themlos_benchprocess is still running. - The whole driver host fails and either needs to be restarted on that same backend or else picked up by a new one (for simplicity, let's assume we only support the former for now. We can add a heartbeat mechanism later to support the latter).
 
| 
               | 
          ||
| def _setup(self) -> None: | ||
| super()._setup() | ||
| def _ensure_persisted(self) -> None: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _ensure_persisted(self) -> None: | |
| def _save(self) -> None: | 
or
| def _ensure_persisted(self) -> None: | |
| def _try_save(self) -> None: | 
or
| def _ensure_persisted(self) -> None: | |
| def _persist(self) -> None: | 
| def _setup(self) -> None: | ||
| super()._setup() | ||
| self._ensure_persisted() | ||
| with self._engine.begin() as conn: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to separate that out to an _update method, or just incorporate it into _save
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also rename _save to _create to separate the INSERT from UPDATE calls
| except exc.SQLAlchemyError: | ||
| # probably a conflict | ||
| trans.rollback() | ||
| 
               | 
          
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 
               | 
          ||
| 
               | 
          ||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser(description="mlos_benchd") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a long text description, similar to mlos_bench.launcher
| help="Number of workers to use. Default is 1.", | ||
| ) | ||
| parser.add_argument( | ||
| "--poll_interval", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also provide poll-interval variants, like mlos_bench.launcher
| "--environment", | ||
| root_env_config, | ||
| "--experiment_id", | ||
| exp_id, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll eventually need to include other things here too:
- cli config
 - globals (could be subsumed by cli config)
 - working dir (maybe we adjust the storage backend to store the target directory and cli-config and optionally globals instead of root_env_config)
 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start! Thanks so much ❤️
I left a few comments for initial changes.
We'll also need tests.
| This script is responsible for polling the storage for runnable experiments and | ||
| executing them in parallel. | ||
| See the current ``--help`` `output for details. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| See the current ``--help`` `output for details. | |
| See the current ``--help`` output for details. | 
| """ | ||
| mlos_bench background execution daemon. | ||
| This script is responsible for polling the storage for runnable experiments and | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This script is responsible for polling the storage for runnable experiments and | |
| This script is responsible for polling the :py:mod:`~mlos_bench.storage` for runnable :py:class:`.Experiment`s and | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor tweaks to help make all of the docstring generation cross referencing. Might need some tweaks.
| default=1, | ||
| help="Polling interval in seconds. Default is 1.", | ||
| ) | ||
| _main(parser.parse_args()) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testing, may also want some sort of hidden argument or environment variable used to set the MAX_ITERATIONS.
Pull Request
Basic implementation for mlos_benchd service
This PR introduces:
Experimentvia the storage API, separate from running via CLI.mlos_benchdscript that polls storage for runnable experiments, then executes them.This allows for the separation of experiment creation (e.g. scheduling), and their execution (on potentially multiple hosts).
The new
mlos_benchdscript can run on any host to poll for new experiments, as long as they monitor the right storage backend.Builds upon schema changes from #931
Example usage
Create new experiment, via notebook or any python environment:
mlos_benchd:
Limitations:
Description
Type of Change
Indicate the type of change by choosing one (or more) of the following:
Testing
Manual testing, unit tests to come.
Additional Notes (optional)
Add any additional context or information for reviewers.