- 
                Notifications
    
You must be signed in to change notification settings  - Fork 104
 
Description
Issue description
- 
issue description: We encountered an "Internal Server Error" with 3 nodes joint training. The training process has successfully gone through 21 epochs and around 20 merge rounds, but the error message came in.
 - 
occurrence - consistent or rare: consistent
 - 
error messages:
2024-02-24 23:04:51,605 : SwarmCallback : INFO : Starting Swarm merging round ...
2024-02-25 04:16:08,067 : SwarmCallback : ERROR : Sync Swarm call to SL container failed - SL error: (500)
Reason: INTERNAL SERVER ERROR
HTTP response headers: HTTPHeaderDict({'Server': 'TwistedWeb/21.7.0', 'Date': 'Sun, 25 Feb 2024 04:16:04 GMT', 'Content-Type': 'application/problem+json', 'Content-Length': '251'})
HTTP response body: {
"detail": "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.",
"status": 500,
"title": "Internal Server Error",
"type": "about:blank"
} - 
commands used for starting containers:
 - 
docker logs [APLS, SPIRE, SN, SL, SWCI]:
 
Run the SN container
sudo $script_dir/../../swarm_learning_scripts/run-sn 
-d --rm 
--name=sn_node 
--network=host-net 
--host-ip="$ip_addr" 
"$sn_command" 
--sn-p2p-port=30303 
--sn-api-port=30304 
--key=cert/sn-"$host_index"-key.pem 
--cert=cert/sn-"$host_index"-cert.pem 
--capath=cert/ca/capath 
--apls-ip="$sentinel" \
Run the SWOP container
sudo $script_dir/../../swarm_learning_scripts/run-swop --rm -d
--name=swop"$ip_addr" 
--network=host-net 
--sn-ip="$sentinel" 
--sn-api-port=30304 
--usr-dir=workspace/"$workspace"/swop 
--profile-file-name=swop_profile_"$ip_addr".yaml 
--key=cert/swop-"$host_index"-key.pem 
--cert=cert/swop-"$host_index"-cert.pem 
--capath=cert/ca/capath 
-e http_proxy= -e https_proxy= 
--apls-ip="$sentinel" 
-e SWOP_KEEP_CONTAINERS=True
Start the SWCI container
sudo "$script_dir/../../swarm_learning_scripts/run-swci" 
-d --rm --name="swci-$ip_addr" 
--network="host-net" 
--usr-dir="workspace/$workspace/swci" 
--init-script-name="swci-init" 
--key="cert/swci-$host_index-key.pem" 
--cert="cert/swci-$host_index-cert.pem" 
--capath="cert/ca/capath" 
-e "http_proxy=" -e "https_proxy=" --apls-ip="$sentinel" 
-e "SWCI_RUN_TASK_MAX_WAIT_TIME=5000" 
-e "SWCI_GENERIC_TASK_MAX_WAIT_TIME=5000"
Swarm Learning Version:
- Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )
2.2.0 
OS and ML Platform
- details of host OS: Ubuntu 22.04.4 LTS
 - details of ML platform used: Quadro RTX 6000
 - details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes):
3 machines, all of them running SN and SWCI nodes. We are hosting SWCI node 
Quick Checklist: Respond [Yes/No]
- APLS server web GUI shows available Licenses? /
 - If Multiple systems are used, can each system access every other system? Yes
 - Is Password-less SSH configuration setup for all the systems?
 - If GPU or other protected resources are used, does the account have sufficient privileges to access and use them?
 - Is the user id a member of the docker group?
 
Additional notes
- Are you running documented example without any modification? Yes
 - Add any additional information about use case or any notes which supports for issue investigation: