-
Notifications
You must be signed in to change notification settings - Fork 104
Open
Description
Issue description
- issue description: fail to run mnist-pyt example
- occurrence - consistent or rare: always
- error messages:
When I tried to run SWCI and assign a task to build up an user Docker, it showed an error that it failed to copy the build context.
SWCI:7 > ASSIGN TASK build_test1 TO defaulttaskbb.taskdb.sml.hpe WITH 1 PEERS
Task assigned to TaskRunner
SWCI:8 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe
WAITING FOR TASKRUNNER TO COMPLETE - Maximum wait time is : 120 mins
###
TASKRUNNER FINISHED
STATE : ERROR
TIME : 2024-06-09 06:47:53
SWCI:8 > ERROR : Task has failed, check TASKRUNNER PEER STATUS for Error description
SWCI:9 > GET TASKRUNNER PEER STATUS defaulttaskbb.taskdb.sml.hpe 0
NAME : demo
SWOP_UID : e57f35f3-a1ad-4aec-a704-066363f55808
OPERATION_ID : 9575480544290973496
PEER_COUNT : 1
UPDATE_TS : 2024-06-09 06:47:53
SWOP_PEER_INDEX : 0
SWOP_PEER_STATUS : ERROR
SWOP_PEER_STATUS_DESC : Failed to copy build context
SWCI:10 > EXIT
and the taskdef
######################################################################
# (C)Copyright 2021-2023 Hewlett Packard Enterprise Development LP
######################################################################
Name: build_test1
TaskType: MAKE_USER_CONTAINER
Author: HPESwarm
Prereq: ROOTTASK
Outcome: user-image-pyt1.5
Body:
BuildContext: sl-cli-lib
BuildType: INLINE
BuildSteps:
- FROM pytorch/pytorch
- ' '
- RUN apt-get update && apt-get install \
- ' build-essential python3-dev python3-pip \'
- ' python3-setuptools --no-install-recommends -y'
- ' '
- RUN conda install pip
- ' '
- RUN pip3 install --upgrade pip protobuf==3.15.6 && pip3 install \
- ' torchvision matplotlib opencv-python pandas torchmetrics'
- ' '
- commands used for starting containers:
- docker logs [APLS, SPIRE, SN, SL, SWCI]:
Logs from SWOP:
2024-06-09 06:21:48,907 : swarm.swop : INFO : SWOPBuildTask: Validating profile
2024-06-09 06:21:55,512 : swarm.swop : INFO : Extracted container id and image info from /tmp/container_info_file file
2024-06-09 06:21:55,516 : swarm.swop : INFO : SWOPBuildTask: Temp Container Creation Failed
2024-06-09 06:21:55,516 : swarm.swop : INFO : 400 Client Error for http+docker://localhost/v1.40/images/create?tag=c61eb2ae2da3e3745e64da0f2799aeee134e87a9084acf5a8884451802051972&fromImage=sha256: Bad Request ("failed to resolve image name: short-name "sha256:c61eb2ae2da3e3745e64da0f2799aeee134e87a9084acf5a8884451802051972" did not resolve to an alias and no unqualified-search registries are defined in "/etc/containers/registries.conf"")
2024-06-09 06:21:55,516 : swarm.swop : INFO : SWOPBuildTask: Temp Container Creation Failed
2024-06-09 06:21:55,516 : swarm.swop : WARNING : SWOPBuildTask: Failed to copy build context
I tried to fix this problem by myself, I edited the "/etc/containers/registries.conf" in this way:
unqualified-search-registries = ['registry.fedoraproject.org', 'registry.access.redhat.com', 'registry.centos.org', 'docker.io']
[[registry]]
location = "docker.io"
But it still failed to build up the user Docker after I edited the configuration. Here is the error after I added these lines to the configuration:
2024-06-09 06:47:44,631 : swarm.swop : INFO : SWOPBuildTask: Validating profile
2024-06-09 06:47:51,232 : swarm.swop : INFO : Extracted container id and image info from /tmp/container_info_file file
2024-06-09 06:47:51,236 : swarm.swop : INFO : SWOPBuildTask: Temp Container Creation Failed
2024-06-09 06:47:51,236 : swarm.swop : INFO : 404 Client Error for http+docker://localhost/v1.40/images/sha256:c61eb2ae2da3e3745e64da0f2799aeee134e87a9084acf5a8884451802051972/json: Not Found ("failed to find image sha256:c61eb2ae2da3e3745e64da0f2799aeee134e87a9084acf5a8884451802051972: sha256:c61eb2ae2da3e3745e64da0f2799aeee134e87a9084acf5a8884451802051972: No such image")
2024-06-09 06:47:51,236 : swarm.swop : INFO : SWOPBuildTask: Temp Container Creation Failed
2024-06-09 06:47:51,236 : swarm.swop : WARNING : SWOPBuildTask: Failed to copy build context
Swarm Learning Version:
- Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )
2.2.0
OS and ML Platform
- details of host OS: Ubuntu22.04
- details of ML platform used: pytorch
- details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): just the example from mnist-pyt, 1 machine, 1 SL node, 1 SN node
Quick Checklist: Respond [Yes/No]
- APLS server web GUI shows available Licenses? YES
- If Multiple systems are used, can each system access every other system? No mul systems
- Is Password-less SSH configuration setup for all the systems? No
- If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? No
- Is the user id a member of the docker group? Yes
Additional notes
- Are you running documented example without any modification?
I use the ngrok to map the local APLS to other domain and port. But I think it doesn't matter. - Add any additional information about use case or any notes which supports for issue investigation:
NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.
Maybe this is the issue with podman-docker?
Metadata
Metadata
Assignees
Labels
No labels