-
Notifications
You must be signed in to change notification settings - Fork 104
Description
Issue description
-
issue description: Sentinel node fails to acquire a floating license from AutoPass License Server (APLS). The license server is reachable and TLS handshake is successful, but the license entitlement for Swarm Learning is not being served. Logs show PDFile not found 50089_2.0_HPE-Swarm_2.0, indicating the .pd file is either missing, corrupted, or not registered.
-
occurrence - consistent or rare: Consistent across all Sentinel launches.
-
error messages: com.hp.autopassj.common.exception.CommunicationException: 6002 – Unable to connect to server. Server might be wrongly configured or down. APLS logs: PDFile not found 50089_2.0_HPE-Swarm_2.0
-
commands used for starting containers:
sudo /opt/hpe/swarm-learning/scripts/bin/run-sn
--host-ip sentinel-fixed
--sentinel
--sn-api-port 8443
--cert $CERT_PATH/sentinel-signed.pem
--key $CERT_PATH/sentinel.key
--capath $CERT_PATH/capath
--network apls-net
-e SWARM_LICENSE_AUTOPASSJ_SERVER_PRIMARY_IP=$APLS_IP
-e JAVA_TOOL_OPTIONS="-DproxySet=false ..."
-e http_proxy="" -e https_proxy=""
-e no_proxy="localhost,127.0.0.1,apls-fixed,$APLS_IP"
-v $CERT_PATH:/trusted
-v /tmp/blockchain:/platform/swarm/SMLNODE -
docker logs [APLS, SPIRE, SN, SL, SWCI]: APLS: PDFile not found 50089_2.0_HPE-Swarm_2.0
LicenseLockCodeHandler :: Valid LockCode value found C269D67-9081DBA
LicenseServerLicenseManagement : getAutopassInstance() :: License file path is /usr/local/tomcat/webapps/autopass/WEB-INF/classes/hpaplslicfile.txt
Swarm Learning Version:
- $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning 2.0.0
OS and ML Platform
- details of host OS: Ubuntu 20.04 LTS (Docker CE installed)
- details of ML platform used: Custom ML pipeline using Swarm Learning Sentinel node for distributed training coordination
- details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes):
3 machines total
1 Sentinel node
2 Swarm Learning nodes
1 AutoPass License Server container
Docker bridge network: apls-net
Quick Checklist: Respond [Yes/No]
- APLS server web GUI shows available Licenses? No
- If Multiple systems are used, can each system access every other system? Yes
- Is Password-less SSH configuration setup for all the systems? Yes
- If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? Yes
- Is the user id a member of the docker group? Yes
Additional notes
- Are you running documented example without any modification? No Custom deployment with validated certs and entitlement injection
- Add any additional information about use case or any notes which supports for issue investigation:
TLS handshake to APLS is successful
SAN validation confirmed
License server logs show clean startup but no active Swarm entitlement
.pd file 50089_2.0_HPE-Swarm_2.0.pd is present but not parsed
License ID 1100000380:1 is not served
Sentinel fails consistently with CommunicationException: 6002