Inconsistant final models trained by Swarm Learning

#  Issue description
- issue description: We observed inconsistency in the final models trained by Swarm Learning. We have two nodes involved in swarm learning. However whether we are picking the last or best checkpoint for prediction the results are significantly different from each other.
- occurrence - consistent or rare: Consistent
- error messages: None
- commands used for starting containers: 
- docker logs [APLS, SPIRE, SN, SL, SWCI]: 
[swop_u.log](https://github.com/HewlettPackard/swarm-learning/files/14447803/swop_u.log)
[swci_u.log](https://github.com/HewlettPackard/swarm-learning/files/14447805/swci_u.log)
[sn_u.log](https://github.com/HewlettPackard/swarm-learning/files/14447806/sn_u.log)
[sl_u.log](https://github.com/HewlettPackard/swarm-learning/files/14447807/sl_u.log)
[ml_u.log](https://github.com/HewlettPackard/swarm-learning/files/14447808/ml_u.log)

Python scripts used to reproduce this problem:
[base_model.txt](https://github.com/HewlettPackard/swarm-learning/files/14447831/base_model.txt)
[main.txt](https://github.com/HewlettPackard/swarm-learning/files/14447832/main.txt)

# Swarm Learning Version:
2.2.0

- Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )
2.2.0

# OS and ML Platform
- details of host OS: 
- details of ML platform used: pytorch-lightening
- details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): 
2 machines, 2 sl-ml node pairs

# Quick Checklist: Respond [Yes/No]
- APLS server web GUI shows available Licenses?
- If Multiple systems are used, can each system access every other system?
- Is Password-less SSH configuration setup for all the systems?
- If GPU or other protected resources are used, does the account have sufficient privileges to access and use them?
- Is the user id a member of the docker group?

# Additional notes
- Are you running documented example without any modification?
- Add any additional information about use case or any notes which supports for issue investigation: 

## NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistant final models trained by Swarm Learning #240

Issue description

Swarm Learning Version:

OS and ML Platform

Quick Checklist: Respond [Yes/No]

Additional notes

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistant final models trained by Swarm Learning #240

Description

Issue description

Swarm Learning Version:

OS and ML Platform

Quick Checklist: Respond [Yes/No]

Additional notes

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions