Basic implementation for mlos_benchd service #949

eujing · 2025-02-05T23:11:56Z

Pull Request

Basic implementation for mlos_benchd service

This PR introduces:

Creation of Experiment via the storage API, separate from running via CLI.
A new mlos_benchd script that polls storage for runnable experiments, then executes them.

This allows for the separation of experiment creation (e.g. scheduling), and their execution (on potentially multiple hosts).
The new mlos_benchd script can run on any host to poll for new experiments, as long as they monitor the right storage backend.

Builds upon schema changes from #931

Example usage

Create new experiment, via notebook or any python environment:

from mlos_bench.storage import from_config
storage = from_config(config="mlos_bench/mlos_bench/config/storage/sqlite.jsonc")
exp = storage.experiment(experiment_id="test-exp-01", trial_id=0, root_env_config="mlos_bench/mlos_bench/config/environments/apps/fake/test_local_env.jsonc", description="description", tunables={}, opt_targets={"score": "min"})
exp.save()

mlos_benchd:

/workspaces/MLOS (main) $ python mlos_bench/mlos_bench/mlos_benchd.py --storage "mlos_bench/mlos_bench/config/storage/sqlite.jsonc"
...
No runnable experiment found. Sleeping for 1 second.
No runnable experiment found. Sleeping for 1 second.
No runnable experiment found. Sleeping for 1 second.
...
2025-02-05 20:35:06,208 launcher.py:51 __init__ INFO Launch: mlos_bench
...
2025-02-05 20:35:08,270 base_scheduler.py:275 get_best_observation INFO Env: Local Shell Test Environment best score: {'score': 123.4}
2025-02-05 20:35:08,271 run.py:74 _main INFO Final score: {'score': 123.4}
No runnable experiment found. Sleeping for 1 second.

Limitations:

Assumes all experiment-related config files are already available on the host
Does not support per-experiment global overrides yet

Description

Issue link: mlos_bench_service #732

Type of Change

Indicate the type of change by choosing one (or more) of the following:

✨ New feature

Testing

Manual testing, unit tests to come.

Additional Notes (optional)

Add any additional context or information for reviewers.

bpkroth · 2025-02-06T16:43:29Z

mlos_bench/mlos_bench/storage/base_storage.py

            self._opt_targets = opt_targets
+            self._ts_start = ts_start or datetime.now(UTC)
+            self._ts_end: datetime | None = None
+            self._status = Status.PENDING


This should match what was stored in the backend for resumable Experiments, right?

bpkroth · 2025-02-06T16:45:11Z

mlos_bench/mlos_bench/storage/base_storage.py


            This method is called by `Storage.Experiment.__enter__()`.
            """
+            self._status = Status.RUNNING


Maybe add some asserts on expected status to check for invalid state transitions.
Wouldn't be terrible to document the expected state transitions in a README.md or docstring either.

bpkroth · 2025-02-06T16:50:22Z

mlos_bench/mlos_bench/storage/base_storage.py

            """
+            self._status = Status.RUNNING
+            self._driver_name = platform.node()
+            self._driver_pid = os.getpid()


These seem ok to initialize the values from None, but might be problematic for resuming an Experiment.

There are a few cases I can think of:

An individual mlos_bench driver process fails and needs to be restarted.

The mlos_benchd service dies, but the mlos_bench process is still running.

The whole driver host fails and either needs to be restarted on that same backend or else picked up by a new one (for simplicity, let's assume we only support the former for now. We can add a heartbeat mechanism later to support the latter).

bpkroth · 2025-02-06T16:51:39Z

mlos_bench/mlos_bench/storage/sql/experiment.py


-    def _setup(self) -> None:
-        super()._setup()
+    def _ensure_persisted(self) -> None:


Suggested change

def _ensure_persisted(self) -> None:

def _save(self) -> None:

or

Suggested change

def _ensure_persisted(self) -> None:

def _try_save(self) -> None:

or

Suggested change

def _ensure_persisted(self) -> None:

def _persist(self) -> None:

bpkroth · 2025-02-06T16:52:59Z

mlos_bench/mlos_bench/storage/sql/experiment.py

+    def _setup(self) -> None:
+        super()._setup()
+        self._ensure_persisted()
+        with self._engine.begin() as conn:


Might need to separate that out to an _update method, or just incorporate it into _save

Could also rename _save to _create to separate the INSERT from UPDATE calls

bpkroth · 2025-02-06T16:54:37Z

mlos_bench/mlos_bench/storage/sql/storage.py

+                except exc.SQLAlchemyError:
+                    # probably a conflict
+                    trans.rollback()
+


Suggested change

bpkroth · 2025-02-06T16:57:52Z

mlos_bench/mlos_bench/mlos_benchd.py

+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="mlos_benchd")


Add a long text description, similar to mlos_bench.launcher

bpkroth · 2025-02-06T16:58:45Z

mlos_bench/mlos_bench/mlos_benchd.py

+        help="Number of workers to use. Default is 1.",
+    )
+    parser.add_argument(
+        "--poll_interval",


Also provide poll-interval variants, like mlos_bench.launcher

bpkroth · 2025-02-06T17:00:41Z

mlos_bench/mlos_bench/mlos_benchd.py

+                    "--environment",
+                    root_env_config,
+                    "--experiment_id",
+                    exp_id,


We'll eventually need to include other things here too:

cli config

globals (could be subsumed by cli config)

working dir (maybe we adjust the storage backend to store the target directory and cli-config and optionally globals instead of root_env_config)

bpkroth

This is a great start! Thanks so much ❤️

I left a few comments for initial changes.

We'll also need tests.

bpkroth · 2025-02-06T17:02:46Z

mlos_bench/mlos_bench/mlos_benchd.py

+This script is responsible for polling the storage for runnable experiments and
+executing them in parallel.
+
+See the current ``--help`` `output for details.


Suggested change

See the current ``--help`` `output for details.

See the current ``--help`` output for details.

bpkroth · 2025-02-06T17:03:46Z

mlos_bench/mlos_bench/mlos_benchd.py

+"""
+mlos_bench background execution daemon.
+
+This script is responsible for polling the storage for runnable experiments and


Suggested change

This script is responsible for polling the storage for runnable experiments and

This script is responsible for polling the :py:mod:`~mlos_bench.storage` for runnable :py:class:`.Experiment`s and

Some minor tweaks to help make all of the docstring generation cross referencing. Might need some tweaks.

bpkroth · 2025-02-06T17:07:32Z

mlos_bench/mlos_bench/mlos_benchd.py

+        default=1,
+        help="Polling interval in seconds. Default is 1.",
+    )
+    _main(parser.parse_args())


For testing, may also want some sort of hidden argument or environment variable used to set the MAX_ITERATIONS.

Eu Jing Chua added 2 commits February 5, 2025 22:57

Initial changes

94d7e03

Update cli args

336fa91

bpkroth reviewed Feb 6, 2025

View reviewed changes

mlos_bench/mlos_bench/storage/sql/storage.py

except exc.SQLAlchemyError:

# probably a conflict

trans.rollback()

Copy link

Contributor

bpkroth Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

bpkroth reviewed Feb 6, 2025

View reviewed changes

	def _ensure_persisted(self) -> None:
	def _save(self) -> None:

	def _ensure_persisted(self) -> None:
	def _try_save(self) -> None:

	def _ensure_persisted(self) -> None:
	def _persist(self) -> None:



		if __name__ == "__main__":
		parser = argparse.ArgumentParser(description="mlos_benchd")

	See the current ``--help`` `output for details.
	See the current ``--help`` output for details.

	This script is responsible for polling the storage for runnable experiments and
	This script is responsible for polling the :py:mod:`~mlos_bench.storage` for runnable :py:class:`.Experiment`s and

Basic implementation for mlos_benchd service #949

Are you sure you want to change the base?

Basic implementation for mlos_benchd service #949

Uh oh!

Conversation

eujing commented Feb 5, 2025

Pull Request