AutoTuner failing in distributed mode run #449

vijayank88 · 2022-05-09T16:04:53Z

Describe the bug
I have tried AutoTuner feature in single mode machine locally its working fine.

With recent update, I tried to run AutoTuner in distributed mode using following command:
python3.7 distributed.py --design fuserisc_v1 --platform sky130hd --config ../designs/sky130hd/fuserisc_v1/autotuner.json --jobs 2000 --server localhost tune --samples 200
But flow failed to complete:

Log:

(run pid=825) ... 180 more trials not shown (180 TERMINATED)
(run pid=825) 
(run pid=825) 
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2022-05-03 13:13:37,973	WARNING dataclient.py:221 -- Encountered connection issues in the data channel. Attempting to reconnect.
2022-05-03 13:14:08,189	WARNING dataclient.py:226 -- Failed to reconnect the data channel
Traceback (most recent call last):
  File "distributed.py", line 947, in <module>
    analysis = tune.run(TrainClass, **tune_args)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/tune/tune.py", line 363, in run
    while ray.wait([remote_future], timeout=0.2)[1]:
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/api.py", line 61, in wait
    return self.worker.wait(*args, **kwargs)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/worker.py", line 435, in wait
    resp = self._call_stub("WaitObject", req, metadata=self.metadata)
  File "/home/vijayan/.local/lib/python3.7/site-packages/ray/util/client/worker.py", line 291, in _call_stub
    raise ConnectionError("Client is shutting down.")
ConnectionError: Client is shutting down.

Expected behavior
Flow should complete successfully in Distributed mode.

@dralabeing FYI

The text was updated successfully, but these errors were encountered:

luarss · 2024-03-29T08:46:23Z

@vijayank88 Is this still an issue? If so, is it possible to share the necessary files for reproduction?

Edit: After trying out it appears that the issue might be the --server localhost. Ray only needs us to supply the --server, --port switch when we are using Ray Cluster[1].

Correct usage:

python3 distributed.py --design fuserisc_v1 --platform sky130hd --config ../designs/sky130hd/fuserisc_v1/autotuner.json --jobs 2000 tune --samples 200

[1] https://docs.ray.io/en/latest/cluster/key-concepts.html#ray-cluster

vijayank88 assigned vvbandeira May 11, 2022

maliberty added the autotuner Flow autotuner label Mar 7, 2024

maliberty assigned luarss Mar 16, 2024

vvbandeira removed their assignment Jun 28, 2024

luarss linked a pull request Feb 16, 2025 that will close this issue

[Autotuner] Distributed implementation #2858

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTuner failing in distributed mode run #449

AutoTuner failing in distributed mode run #449

vijayank88 commented May 9, 2022

luarss commented Mar 29, 2024 •

edited

Loading

AutoTuner failing in distributed mode run #449

AutoTuner failing in distributed mode run #449

Comments

vijayank88 commented May 9, 2022

luarss commented Mar 29, 2024 • edited Loading

luarss commented Mar 29, 2024 •

edited

Loading