Skip to content

AutoTuner failing in distributed mode run #449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vijayank88 opened this issue May 9, 2022 · 1 comment · May be fixed by #2858
Open

AutoTuner failing in distributed mode run #449

vijayank88 opened this issue May 9, 2022 · 1 comment · May be fixed by #2858
Assignees
Labels
autotuner Flow autotuner

Comments

@vijayank88
Copy link
Collaborator

Describe the bug
I have tried AutoTuner feature in single mode machine locally its working fine.

With recent update, I tried to run AutoTuner in distributed mode using following command:
python3.7 distributed.py --design fuserisc_v1 --platform sky130hd --config ../designs/sky130hd/fuserisc_v1/autotuner.json --jobs 2000 --server localhost tune --samples 200
But flow failed to complete:

Log:

(run pid=825) ... 180 more trials not shown (180 TERMINATED)
(run pid=825) 
(run pid=825) 
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2022-05-03 13:13:37,973	WARNING dataclient.py:221 -- Encountered connection issues in the data channel. Attempting to reconnect.
2022-05-03 13:14:08,189	WARNING dataclient.py:226 -- Failed to reconnect the data channel
Traceback (most recent call last):
  File "distributed.py", line 947, in <module>
    analysis = tune.run(TrainClass, **tune_args)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/tune/tune.py", line 363, in run
    while ray.wait([remote_future], timeout=0.2)[1]:
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/api.py", line 61, in wait
    return self.worker.wait(*args, **kwargs)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/worker.py", line 435, in wait
    resp = self._call_stub("WaitObject", req, metadata=self.metadata)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/worker.py", line 291, in _call_stub
    raise ConnectionError("Client is shutting down.")
ConnectionError: Client is shutting down.

Expected behavior
Flow should complete successfully in Distributed mode.

@dralabeing FYI

@luarss
Copy link
Contributor

luarss commented Mar 29, 2024

@vijayank88 Is this still an issue? If so, is it possible to share the necessary files for reproduction?

Edit: After trying out it appears that the issue might be the --server localhost. Ray only needs us to supply the --server, --port switch when we are using Ray Cluster[1].

Correct usage:

python3 distributed.py --design fuserisc_v1 --platform sky130hd --config ../designs/sky130hd/fuserisc_v1/autotuner.json --jobs 2000 tune --samples 200

[1] https://docs.ray.io/en/latest/cluster/key-concepts.html#ray-cluster

@vvbandeira vvbandeira removed their assignment Jun 28, 2024
@luarss luarss linked a pull request Feb 16, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autotuner Flow autotuner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants