[AutoTuner] Restore resume check #3070

luarss · 2025-04-13T10:43:21Z

This pull request includes several changes to the AutoTuner tests to re-enable previously disabled tests, improve the test setup, and add new utility methods.

Fixes #3005

Re-enabling tests and improving setup:

flow/test/test_autotuner.sh: Re-enabled the test_tune_resume test in the AutoTuner script by uncommenting the test execution line.

Codebase simplification and improvements:

tools/AutoTuner/test/resume_check.py: Removed unnecessary directory changes and imported the glob module to facilitate file pattern matching. [1] [2]
tools/AutoTuner/test/resume_check.py: Introduced the check_trial_times method to check the modification times of trial iterations, improving the robustness of the test_tune_resume test.

vvbandeira · 2025-04-21T16:25:29Z

tools/AutoTuner/test/resume_check.py

+                 If no folders are found, returns a default value of 9e99.
+        """
+        if iteration < 0 or iteration >= self.iterations:
+            raise ValueError("Iteration must be between 0 and iterations - 1")


Suggested change

raise ValueError("Iteration must be between 0 and iterations - 1")

raise ValueError("Iteration must be between 0 and (iterations - 1)")

vvbandeira · 2025-04-21T16:35:52Z

tools/AutoTuner/test/resume_check.py

+        folders = glob.glob(os.path.join(experiment_dir, f"variant-*-or-{iteration}"))
+        return max((os.path.getmtime(folder) for folder in folders), default=9e99)


This is a little obfuscated:

The function's name does not match the expected return value type/format. A check should return True/False.

Returning 9e99 is not clear on what is going on.

Creating a folder does not confirm the run status. For this test to be true to its purpose, we need to guarantee that it has not finished, not just started.

We should consider using a get_experiment_status function to check if all iterations have finished running; this function would return at least "RUNNING" and "FINISHED", other states might be helpful but are not required now.

The function's name does not match the expected return value type/format. A check should return True/False.

Can possibly change it to get_trial_times

Returning 9e99 is not clear on what is going on.

It is just a dummy value, to compare for latest modified time in while loop line 121-128

# Check if first config is complete while True: cur_modified_time = self.check_trial_times() print(f"Current modified time: {cur_modified_time}") print(f"Latest modified time: {latest_modified_time}") if abs(cur_modified_time - latest_modified_time) < 1e-3: break latest_modified_time = cur_modified_time time.sleep(10)

Creating a folder does not confirm the run status. For this test to be true to its purpose, we need to guarantee that it has not finished, not just started.

This function returns the latest modified time of a given iteration (folder names are matched using iteration glob) - so if a run is completed the folder should no longer be modified.

4, We should consider using a get_experiment_status function to check if all iterations have finished running; this function would return at least "RUNNING" and "FINISHED", other states might be helpful but are not required now.

get_experiment_status is helpful in general, but might not be too useful in resume_check because we need to check iteration completion, as opposed to experiment completion (or all_iterations completion)

Signed-off-by: Jack Luar <jluar@precisioninno.com>

…ror handling Signed-off-by: Jack Luar <jluar@precisioninno.com>

Signed-off-by: Jack Luar <jluar@precisioninno.com>

luarss added the autotuner Flow autotuner label Apr 13, 2025

luarss requested a review from vvbandeira April 14, 2025 01:00

luarss force-pushed the topic/resume-unit-test branch from 99ea816 to 8e18ff7 Compare April 17, 2025 13:07

vvbandeira requested changes Apr 21, 2025

View reviewed changes

luarss closed this Apr 22, 2025

luarss reopened this Apr 22, 2025

luarss added 4 commits May 7, 2025 16:31

restore resume check using last modified filetime

75f399c

Signed-off-by: Jack Luar <jluar@precisioninno.com>

refactor resume check: rename exec variable and improve subprocess er…

0dab7e3

…ror handling Signed-off-by: Jack Luar <jluar@precisioninno.com>

make error clearer

3009894

Signed-off-by: Jack Luar <jluar@precisioninno.com>

clarify function name

cf5ed3d

Signed-off-by: Jack Luar <jluar@precisioninno.com>

luarss force-pushed the topic/resume-unit-test branch from 8e18ff7 to cf5ed3d Compare May 7, 2025 16:33

luarss added 2 commits May 9, 2025 16:55

fix function call

ab31e29

Signed-off-by: Jack Luar <jluar@precisioninno.com>

revert list comprehension into for loop for better readability

e8cdcfe

Signed-off-by: Jack Luar <jluar@precisioninno.com>

luarss requested a review from vvbandeira May 15, 2025 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoTuner] Restore resume check #3070

[AutoTuner] Restore resume check #3070

luarss commented Apr 13, 2025 •

edited

Loading

vvbandeira Apr 21, 2025

vvbandeira Apr 21, 2025

luarss Apr 21, 2025 •

edited

Loading

	raise ValueError("Iteration must be between 0 and iterations - 1")
	raise ValueError("Iteration must be between 0 and (iterations - 1)")

		folders = glob.glob(os.path.join(experiment_dir, f"variant-*-or-{iteration}"))
		return max((os.path.getmtime(folder) for folder in folders), default=9e99)

[AutoTuner] Restore resume check #3070

Are you sure you want to change the base?

[AutoTuner] Restore resume check #3070

Conversation

luarss commented Apr 13, 2025 • edited Loading

vvbandeira Apr 21, 2025

Choose a reason for hiding this comment

vvbandeira Apr 21, 2025

Choose a reason for hiding this comment

luarss Apr 21, 2025 • edited Loading

Choose a reason for hiding this comment

luarss commented Apr 13, 2025 •

edited

Loading

luarss Apr 21, 2025 •

edited

Loading