Refactor download (container) by VannTen · Pull Request #12937 · kubernetes-sigs/kubespray

VannTen · 2026-01-30T15:48:47Z

What type of PR is this?
/kind design

What this PR does / why we need it:
Refactor the downloading of container image to:

work with multi-arch cluster and download_delegate
be much faster

Which issue(s) this PR fixes:
Fixes #12677
Fixes #11663
Fixes #9094

Special notes for your reviewer:
release-note to be done, some breaking change in inventory variables.
This is on top of #12299
~~@tico88612 @rptaylor if you want an early peek, but this is still in rough shape.~~

Missing currently:

fetching images downloaded on one node for another to localhost
copying images to the nodes where they are needed
load image into the container engine
The above has been made into a role, plug it in everywhere needed
plug into correct places in the playbooks
delete the old implementation

Does this PR introduce a user-facing change?:

k8s-ci-robot · 2026-01-30T15:48:50Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2026-01-30T15:48:50Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

VannTen · 2026-01-30T15:49:07Z

/label ci-short

(for now)

VannTen · 2026-03-13T14:02:26Z

/label ci-extended

rptaylor · 2026-03-13T18:27:23Z

-    remote_src: true
-  with_items:
-    - "{{ crio_libexec_files }}"
+- name: Cri-o | Install artefacts


Trivial nitpick: it's spelled "artifact" in most other places, but not totally consistent ...

Hum, yeah, given there is other instances (including in variables so changing it would be breaking let's keep it)

rptaylor · 2026-03-13T18:40:39Z

@@ -0,0 +1,134 @@
+---
+- name: Download | Check localhost is in play


The other assertion is ('localhost' in ansible_play_hosts_all) compared to this one ('localhost') in ansible_play_hosts. Isn't this one anyway redundant, because the meta/main.yml has a dependency on download/common ?

rptaylor · 2026-03-13T18:44:42Z

+                             map('extract', hostvars, morekeys=['downloads_list']) |
+                             flatten | unique(attribute='value.dest') }}"
+  block:
+


Although it's just a matter of style / consistency, a newline at the start of the block seems odd.
Using a consistent style for newlines (or not) between tasks in the block would be nice.

Fixed the task without the separating newlines.
The newline at the start felt easier to read to me to isolate the first ask from the block, odd ?

rptaylor · 2026-03-13T18:49:57Z

+  group_by:
+    key: download_delegate_{{ download_delegate }}
+
+- name: Download_file | Set downloads delegated to localhost for all hosts


Should the first part of the task name (before |) be consistent? It alternates between "Download" and "Download file":

Download | Check localhost is in play Download | instantiate download dir var Download | group hosts by download_delegate Download_file | Set downloads delegated to localhost for all hosts Download | Download files Download | Create download directory Download_file | Download file

There was some copy-pasting here 🤣
Honestly that naming scheme kinda does not make sense for kubespray, but I'll fix it for consistency.

rptaylor

At a cursory glance it looks fine to me. I just commented on some style/consistency things.
Thanks for the big effort @VannTen !

The file artefacts are handled by the new download/file role

The molecule_run.sh isn't really needed since the switch to gitlab-ci matrix runs.

Those key are only used as part of the downloads role which will be removed, and it's more convenient to use a single key directly in Jinja pipelines.

- compute from k8s API present images per nodes - substract present images to needed images - compute unique set of images per downloader (distinguishing different archs) - copy them to oci-archive format

The logic for images to fetch back on localhost is the same as for download/file. (== copy images delegated from node A to B, with neither being localhost) Copy it for now.

Patch the control plane to never pull ; this will fail if the images are not correctly pre-loaded, which is the point.

Marking the etcd as external and configuring the cri socket was only needed because kubespray used to let kubeadm download images directly (with kubeadm config images pull). etcd: 061f5a3 (Explicitely set etcd endpoint in kubeadm-images.yaml (kubernetes-sigs#4063), 2019-02-13) cri socket: 62a8961 (Fix installation using CRIO about download images failed, 2018-12-23) This is no longer the case since 23c9071 (Added file and container image caching (kubernetes-sigs#4828), 2019-06-10), which pass the actual download responsability to the download role, so remove those workarounds.

Put kubeadm_images definition into it's own role (as we need to refer to the definition even if download was skipped) Compute the kubeadm images on localhost, and wire it up in the wider downloads for the node selection part (we define a skeleton for kubeadm image matching download structure) Move the minimal kubeadm config file used for images listing to the local_release_dir (because localhost does not use kube_config_dir)

Node used as downloader (aka, present in `download_delegate` variable for other nodes or themselves) needs skopeo to download images. An unfortunate side effect it that we need to define the dynamic download_delegate_* groups early, because they are now needed for the evaluation of the download variable.

This role is supposed to be invoked by other roles to load specific images into nodes It's designed to do the least amount of work, and only load images if they aren't already present. In order to allow decoupling the images download itself (possibly on another node) and the loading, it will list the images if the list from download/container role is not available. (we're using skopeo for docker as the docker CLI apparently can't import OCI archive correctly)

All the functionnality has now been ported to download/file and download/container

ansible-lint is pretty opiniated, and a lots of its opinion are not particularly useful for Kubespray, so delete the comment about not adding skip rule entries. We should not add them willy-nilly but some stuff just does not make sense, in that case name[casing] when prefixing tasks name with role name.

VannTen · 2026-04-03T15:22:24Z

/retest-failed

VannTen · 2026-04-07T07:22:53Z

/cc @rptaylor
(in particular the docs part)

VannTen · 2026-04-07T07:53:35Z

@bbaassssiiee I think you're using the offline script right ? Would you mind providing some feedback on this ?

In particular I think this should make the offline script unnecessary by using kubespray directly to create a single archive which can be transferred to the offline env.
But as I'm not using that I might be missing some usage pattern.

bbaassssiiee · 2026-04-07T09:42:55Z

Dropping a tarball over the fence??

I don't know where you work, but a single archive is simply unacceptable in regulated industries. We need to store and scan each binary and each container image for vulnerabilities and malware. For that I use the contrib/offline scripts, and even contributed upload2artifactory.py

I have staging environments for that process, one runs in the cloud to download the originals, it can push the images to our private registry and the binaries to our private repository. In the next environment I can test completeness, because that one is restricted to use the private registry and private repository exclusively. Once the images and binaries are scanned I can test applications on top of the 'clean' cluster.

VannTen · 2026-04-07T12:20:54Z

What I'm trying to understand is whether the offline scripts / process rely on the artifacts filename, as the PR subsantially changes thoses (embedding architecture in the name for uniqueness, in particular). (or really, any other assumptions which this PR could break.) AFAICT, upload2artifactory.py just upload all files in a directory recursively, right ? So it shouldn't depend on the file name ? I'm imagining the process to be: ``` # on the online environment which has access to internet $ ansible-playbook -i <your_inventory> --tags download -> only download binaries and artefacts to `local_release_dir` # would required: # - download_delegate: 'localhost' in the inventory # - ansible_architecture explicity defined in inventory to not rely on # facts collection $ transfer-process local_release_dir # with transfer-process being a placeholder for whatever process used to # transfer from online to offline env, # the result should be a directory with the same contents in the offline # env # in the offline env $ ansible-playbook -i <your_inventory> --skip-tags download ``` Which would make the manage-offline-* scripts themselves obsolete, hopefully. Does that make sense ? Do you see any problems this would cause for you process, for instance ? This would probably needs some workflows adjustements anyway, but hopefully those adjustements would simplify things overall

bbaassssiiee · 2026-04-07T12:45:48Z

At the moment our offline.yml maps *_image_repo to our registry_host, except for this local_path_provisioner_helper_image_repo (Don't know why that edge case is there).
All the file URLs map to files_repo/sitename like shown below, in Kubespray the path to the files is maintained.

I guess being able to recurse two directories, or directory trees (images and files) would make creating a script for offline copying a task for your favorite LLM.

---
# https://kubespray.io/#/docs/operations/offline-environment
# For /etc/containerd/config.toml
containerd_registries_mirrors:
  - prefix: docker.io
    mirrors:
      - host: https://registry-1.docker.io
        capabilities: ["pull", "resolve"]
        skip_verify: false
  - prefix: "{{ registry_host }}"
    mirrors:
      - host: "https://{{ registry_host }}"
        capabilities: ["pull", "resolve"]
        skip_verify: false
        header:
          Authorization: ["Basic {{ (registry_user + ':' + registry_pass) | b64encode }}"]

# Registry overrides
kube_image_repo: "{{ registry_host }}"
gcr_image_repo: "{{ registry_host }}"
docker_image_repo: "{{ registry_host }}"
quay_image_repo: "{{ registry_host }}"
github_image_repo: "{{ registry_host }}"
local_path_provisioner_helper_image_repo: "{{ registry_host }}/busybox"
github_url: "{{ files_repo }}/github.com"
dl_k8s_io_url: "{{ files_repo }}/dl.k8s.io"
storage_googleapis_url: "{{ files_repo }}/storage.googleapis.com"
get_helm_url: "{{ files_repo }}/get.helm.sh"

rptaylor · 2026-04-07T19:09:35Z

/cc @rptaylor (in particular the docs part)

If you tag me I assume you want more nit-picking hah ;p
Looks good, just a few small grammar comments. Thanks for writing this up!

k8s-ci-robot requested review from cyclinder and tico88612 January 30, 2026 15:48

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 30, 2026

k8s-ci-robot added the ci-short Run a quick CI pipeline label Jan 30, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 8, 2026

VannTen mentioned this pull request Feb 13, 2026

Refactor(defaults): centralize remove node defaults #13004

Merged

Srishti-j18 mentioned this pull request Feb 13, 2026

eliminate all instances of default filter in roles => var should have a default defined only once #11822

Open

VannTen mentioned this pull request Mar 11, 2026

download: cache node image list before download loop #13083

Closed

VannTen force-pushed the cleanup/sane_download_container branch from 329aea7 to 2d82181 Compare March 13, 2026 09:21

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 13, 2026

VannTen force-pushed the cleanup/sane_download_container branch 2 times, most recently from f627fa9 to 2416578 Compare March 13, 2026 10:54

This was referenced Mar 13, 2026

Refactor download (file) #12299

Open

Copy file from cache to nodes fails under SELinux #12508

Open

VannTen force-pushed the cleanup/sane_download_container branch from 2416578 to e849f34 Compare March 13, 2026 14:01

k8s-ci-robot added the ci-extended Run additional tests label Mar 13, 2026

rptaylor reviewed Mar 13, 2026

View reviewed changes

rptaylor approved these changes Mar 13, 2026

View reviewed changes

VannTen added 21 commits April 3, 2026 16:09

Document and assert requirement to include localhost in limits

2e74c18

Only download container images in legacy download role

5809b8c

The file artefacts are handled by the new download/file role

CI: add helm for kubelet-csr-approver/custom-cni-helm tests

77ba8aa

CI: Test other download_delegate options

9a4b2e0

CI/molecule: use common vars as well and simplify run

a0ed780

The molecule_run.sh isn't really needed since the switch to gitlab-ci matrix runs.

download: merge repo and tag key in downloads container dict

c04e5f0

Those key are only used as part of the downloads role which will be removed, and it's more convenient to use a single key directly in Jinja pipelines.

Download container images needed on 'download_delegated' nodes

ea8d30f

- compute from k8s API present images per nodes - substract present images to needed images - compute unique set of images per downloader (distinguishing different archs) - copy them to oci-archive format

download/container: copy logic from download/file for localhost fetch

d9ac4c6

The logic for images to fetch back on localhost is the same as for download/file. (== copy images delegated from node A to B, with neither being localhost) Copy it for now.

CI: test download/container pre pull

685a6ca

Patch the control plane to never pull ; this will fail if the images are not correctly pre-loaded, which is the point.

download: convert container download to "free-from enabled"

7dd02fd

Replace remaining download invocations with download/container

3a402ac

download: create container and binaries download dir in one step

3f8e84c

install kubeadm on localhost for kubeadm_images

e094a7e

Remove download role

58aed67

All the functionnality has now been ported to download/file and download/container

download: remove old variables which are no longer used after refactor

dda2829

Update docs wrt to new downloads semantics

ae4e107

VannTen mentioned this pull request Apr 7, 2026

Wrap the image pull command with timeout to prevent indefinite hangs #13120

Open

VannTen mentioned this pull request Apr 7, 2026

CI: test download/container pre pull #13152

Merged

		@@ -0,0 +1,134 @@
		---
		- name: Download \| Check localhost is in play

Conversation

VannTen commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jan 30, 2026

Uh oh!

k8s-ci-robot commented Jan 30, 2026

Uh oh!

VannTen commented Jan 30, 2026

Uh oh!

VannTen commented Mar 13, 2026

Uh oh!

rptaylor Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

VannTen Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

rptaylor Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

rptaylor Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

VannTen Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

rptaylor Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

VannTen Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

rptaylor left a comment

Choose a reason for hiding this comment

Uh oh!

VannTen commented Apr 3, 2026

Uh oh!

VannTen commented Apr 7, 2026

Uh oh!

VannTen commented Apr 7, 2026

Uh oh!

bbaassssiiee commented Apr 7, 2026

Uh oh!

VannTen commented Apr 7, 2026 via email

Uh oh!

bbaassssiiee commented Apr 7, 2026

Uh oh!

rptaylor commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

VannTen commented Jan 30, 2026 •

edited

Loading