Skip to content

Sentinel Fails with CommunicationException: 6002 Due to Missing Swarm License Definition in AutoPass Server #256

@gendxx

Description

@gendxx

Issue description

  • issue description: Sentinel node fails to acquire a floating license from AutoPass License Server (APLS). The license server is reachable and TLS handshake is successful, but the license entitlement for Swarm Learning is not being served. Logs show PDFile not found 50089_2.0_HPE-Swarm_2.0, indicating the .pd file is either missing, corrupted, or not registered.

  • occurrence - consistent or rare: Consistent across all Sentinel launches.

  • error messages: com.hp.autopassj.common.exception.CommunicationException: 6002 – Unable to connect to server. Server might be wrongly configured or down. APLS logs: PDFile not found 50089_2.0_HPE-Swarm_2.0

  • commands used for starting containers:
    sudo /opt/hpe/swarm-learning/scripts/bin/run-sn
    --host-ip sentinel-fixed
    --sentinel
    --sn-api-port 8443
    --cert $CERT_PATH/sentinel-signed.pem
    --key $CERT_PATH/sentinel.key
    --capath $CERT_PATH/capath
    --network apls-net
    -e SWARM_LICENSE_AUTOPASSJ_SERVER_PRIMARY_IP=$APLS_IP
    -e JAVA_TOOL_OPTIONS="-DproxySet=false ..."
    -e http_proxy="" -e https_proxy=""
    -e no_proxy="localhost,127.0.0.1,apls-fixed,$APLS_IP"
    -v $CERT_PATH:/trusted
    -v /tmp/blockchain:/platform/swarm/SMLNODE

  • docker logs [APLS, SPIRE, SN, SL, SWCI]: APLS: PDFile not found 50089_2.0_HPE-Swarm_2.0
    LicenseLockCodeHandler :: Valid LockCode value found C269D67-9081DBA
    LicenseServerLicenseManagement : getAutopassInstance() :: License file path is /usr/local/tomcat/webapps/autopass/WEB-INF/classes/hpaplslicfile.txt

Swarm Learning Version:

  • $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning
    hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning 2.0.0

OS and ML Platform

  • details of host OS: Ubuntu 20.04 LTS (Docker CE installed)
  • details of ML platform used: Custom ML pipeline using Swarm Learning Sentinel node for distributed training coordination
  • details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes):
    3 machines total

1 Sentinel node

2 Swarm Learning nodes

1 AutoPass License Server container

Docker bridge network: apls-net

Quick Checklist: Respond [Yes/No]

  • APLS server web GUI shows available Licenses? No
  • If Multiple systems are used, can each system access every other system? Yes
  • Is Password-less SSH configuration setup for all the systems? Yes
  • If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? Yes
  • Is the user id a member of the docker group? Yes

Additional notes

  • Are you running documented example without any modification? No Custom deployment with validated certs and entitlement injection
  • Add any additional information about use case or any notes which supports for issue investigation:
    TLS handshake to APLS is successful

SAN validation confirmed

License server logs show clean startup but no active Swarm entitlement

.pd file 50089_2.0_HPE-Swarm_2.0.pd is present but not parsed

License ID 1100000380:1 is not served

Sentinel fails consistently with CommunicationException: 6002

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions