Internal Server Error encountered during training process

#  Issue description
- issue description: We encountered an "Internal Server Error" with 3 nodes joint training. The training process has successfully gone through 21 epochs and around 20 merge rounds, but the error message came in.
- occurrence - consistent or rare: consistent
- error messages: 
2024-02-24 23:04:51,605 : SwarmCallback : INFO : Starting Swarm merging round ...
2024-02-25 04:16:08,067 : SwarmCallback : ERROR : Sync Swarm call to SL container failed - SL error: (500)
Reason: INTERNAL SERVER ERROR
HTTP response headers: HTTPHeaderDict({'Server': 'TwistedWeb/21.7.0', 'Date': 'Sun, 25 Feb 2024 04:16:04 GMT', 'Content-Type': 'application/problem+json', 'Content-Length': '251'})
HTTP response body: {
  "detail": "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.",
  "status": 500,
  "title": "Internal Server Error",
  "type": "about:blank"
}

- commands used for starting containers: 
- docker logs [APLS, SPIRE, SN, SL, SWCI]: 
# Run the SN container
sudo $script_dir/../../swarm_learning_scripts/run-sn \
     -d --rm \
     --name=sn_node \
     --network=host-net \
     --host-ip="$ip_addr" \
     "$sn_command" \
     --sn-p2p-port=30303 \
     --sn-api-port=30304 \
     --key=cert/sn-"$host_index"-key.pem \
     --cert=cert/sn-"$host_index"-cert.pem \
     --capath=cert/ca/capath \
     --apls-ip="$sentinel" \

# Run the SWOP container
sudo $script_dir/../../swarm_learning_scripts/run-swop --rm -d\
  --name=swop"$ip_addr" \
  --network=host-net \
  --sn-ip="$sentinel" \
  --sn-api-port=30304 \
  --usr-dir=workspace/"$workspace"/swop \
  --profile-file-name=swop_profile_"$ip_addr".yaml \
  --key=cert/swop-"$host_index"-key.pem \
  --cert=cert/swop-"$host_index"-cert.pem \
  --capath=cert/ca/capath \
  -e http_proxy= -e https_proxy= \
  --apls-ip="$sentinel" \
  -e SWOP_KEEP_CONTAINERS=True

# Start the SWCI container
sudo "$script_dir/../../swarm_learning_scripts/run-swci" \
  -d --rm --name="swci-$ip_addr" \
  --network="host-net" \
  --usr-dir="workspace/$workspace/swci" \
  --init-script-name="swci-init" \
  --key="cert/swci-$host_index-key.pem" \
  --cert="cert/swci-$host_index-cert.pem" \
  --capath="cert/ca/capath" \
  -e "http_proxy=" -e "https_proxy=" --apls-ip="$sentinel" \
  -e "SWCI_RUN_TASK_MAX_WAIT_TIME=5000" \
  -e "SWCI_GENERIC_TASK_MAX_WAIT_TIME=5000"

# Swarm Learning Version:
- Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )
2.2.0

# OS and ML Platform
- details of host OS: Ubuntu 22.04.4 LTS
- details of ML platform used: Quadro RTX 6000
- details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): 
3 machines, all of them running SN and SWCI nodes. We are hosting SWCI node

# Quick Checklist: Respond [Yes/No]
- APLS server web GUI shows available Licenses? /
- If Multiple systems are used, can each system access every other system? Yes
- Is Password-less SSH configuration setup for all the systems? 
- If GPU or other protected resources are used, does the account have sufficient privileges to access and use them?
- Is the user id a member of the docker group?

# Additional notes
- Are you running documented example without any modification? Yes
- Add any additional information about use case or any notes which supports for issue investigation: 

## NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Internal Server Error encountered during training process #239

Issue description

Run the SN container

Run the SWOP container

Start the SWCI container

Swarm Learning Version:

OS and ML Platform

Quick Checklist: Respond [Yes/No]

Additional notes

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Internal Server Error encountered during training process #239

Description

Issue description

Run the SN container

Run the SWOP container

Start the SWCI container

Swarm Learning Version:

OS and ML Platform

Quick Checklist: Respond [Yes/No]

Additional notes

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions