Skip to content

[AutoTuner] Restore resume check #3070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

luarss
Copy link
Contributor

@luarss luarss commented Apr 13, 2025

This pull request includes several changes to the AutoTuner tests to re-enable previously disabled tests, improve the test setup, and add new utility methods.

Fixes #3005

Re-enabling tests and improving setup:

Codebase simplification and improvements:

@luarss luarss added the autotuner Flow autotuner label Apr 13, 2025
@luarss luarss requested a review from vvbandeira April 14, 2025 01:00
@luarss luarss force-pushed the topic/resume-unit-test branch from 99ea816 to 8e18ff7 Compare April 17, 2025 13:07
If no folders are found, returns a default value of 9e99.
"""
if iteration < 0 or iteration >= self.iterations:
raise ValueError("Iteration must be between 0 and iterations - 1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("Iteration must be between 0 and iterations - 1")
raise ValueError("Iteration must be between 0 and (iterations - 1)")

Comment on lines 108 to 109
folders = glob.glob(os.path.join(experiment_dir, f"variant-*-or-{iteration}"))
return max((os.path.getmtime(folder) for folder in folders), default=9e99)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little obfuscated:

  1. The function's name does not match the expected return value type/format. A check should return True/False.
  2. Returning 9e99 is not clear on what is going on.
  3. Creating a folder does not confirm the run status. For this test to be true to its purpose, we need to guarantee that it has not finished, not just started.

We should consider using a get_experiment_status function to check if all iterations have finished running; this function would return at least "RUNNING" and "FINISHED", other states might be helpful but are not required now.

Copy link
Contributor Author

@luarss luarss Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The function's name does not match the expected return value type/format. A check should return True/False.

Can possibly change it to get_trial_times

  1. Returning 9e99 is not clear on what is going on.

It is just a dummy value, to compare for latest modified time in while loop line 121-128

            # Check if first config is complete
            while True:
                cur_modified_time = self.check_trial_times()
                print(f"Current modified time: {cur_modified_time}")
                print(f"Latest modified time: {latest_modified_time}")
                if abs(cur_modified_time - latest_modified_time) < 1e-3:
                    break
                latest_modified_time = cur_modified_time
                time.sleep(10)
  1. Creating a folder does not confirm the run status. For this test to be true to its purpose, we need to guarantee that it has not finished, not just started.

This function returns the latest modified time of a given iteration (folder names are matched using iteration glob) - so if a run is completed the folder should no longer be modified.

4, We should consider using a get_experiment_status function to check if all iterations have finished running; this function would return at least "RUNNING" and "FINISHED", other states might be helpful but are not required now.

get_experiment_status is helpful in general, but might not be too useful in resume_check because we need to check iteration completion, as opposed to experiment completion (or all_iterations completion)

@luarss luarss closed this Apr 22, 2025
@luarss luarss reopened this Apr 22, 2025
luarss added 4 commits May 7, 2025 16:31
Signed-off-by: Jack Luar <jluar@precisioninno.com>
…ror handling

Signed-off-by: Jack Luar <jluar@precisioninno.com>
Signed-off-by: Jack Luar <jluar@precisioninno.com>
Signed-off-by: Jack Luar <jluar@precisioninno.com>
@luarss luarss force-pushed the topic/resume-unit-test branch from 8e18ff7 to cf5ed3d Compare May 7, 2025 16:33
luarss added 2 commits May 9, 2025 16:55
Signed-off-by: Jack Luar <jluar@precisioninno.com>
Signed-off-by: Jack Luar <jluar@precisioninno.com>
@luarss luarss requested a review from vvbandeira May 15, 2025 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autotuner Flow autotuner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-enable AutoTuner ResumeCheck tests
2 participants