-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Filing this issue based on reports from @msilvafe and @aashrita because I have a proposed fix.
With jackhammer
no longer interacting with the OCS agents, since we split those docker containers into a separate compose file (following the implementation of #451), it seems a jackhammer hammer
will hang until the pysmurf-controller agents are rebooted manually by the user using Host Manager.
This was first reported by @msilvafe on June 5th on satp1. @aashrita provided some output from the same issue occurring on satp3 on June 8th. Running the following it hung here until agents were restarted via host manager:
cryo@smurf-so10-satp3:~/docker/pcie/v2.1.1$ jackhammer hammer 4
You are hard-resetting slots [4]. Are you sure (y/n)? y
Dumping docker logs to /data/logs/17494/1749402654
Saving 'docker ps' to /data/logs/17494/1749402654/docker_state.log
Saving ocs-det-controller-c2s5 logs to /data/logs/17494/1749402654/ocs-det-controller-c2s5.log
Saving ocs-det-controller-c2s4 logs to /data/logs/17494/1749402654/ocs-det-controller-c2s4.log
Saving ocs-det-crate-2 logs to /data/logs/17494/1749402654/ocs-det-crate-2.log
Saving ocs-daq-sync-smurf-so10 logs to /data/logs/17494/1749402654/ocs-daq-sync-smurf-so10.log
Saving ocs-det-monitor-so10 logs to /data/logs/17494/1749402654/ocs-det-monitor-so10.log
Proposed Solution
I think the best solution here would be to get jackhammer
to resume restarting any required ocs agents, just via the host manager, rather than directly using docker compose
itself. This may require adding the host manager instance-id to sys_config.yaml
, or maybe just assuming there is only one host manager (which there should be now, in all cases on site).