Skip to content

Conversation

glennpratt
Copy link

@glennpratt glennpratt commented May 19, 2025

Proposed Changes

The experimental k3s flag --disable-agent is passed through already, however, items that contact containerd directly in rke2 server do not respect this flag.

Supporting this flag allows running rke2 server independently of local static pods. This might be useful for:

  • Customizing the apiserver execution environment
  • A vcluster distro like k3s
  • Running rke2 server as a pod for quick testing

This is a hidden experimental feature. We could copy the same server configuration docs from k3s or keep them omitted.

Types of Changes

Fix for hidden experimental feature.

Verification

I verified this change by applying the patch onto v1.32.4-rke2r1 and running that build as a StatefulSet. Etcd and apiserver were not rebuilt, used the published images.

https://gist.github.com/glennpratt/6272c94db3093127a948a37c5a378a0e

Containers(rke2-bootstrap-system/rke2-server-0)[4]

NAME              READY     STATE         IMAGE                                                                        
rke2-server-init  true      Completed     docker.local/rke2:tilt-7412b621e3de84bd                                      
rke2-server       true      Running       docker.local/rke2:tilt-7412b621e3de84bd                                      
kube-apiserver    true      Running       index.docker.io/rancher/hardened-kubernetes:v1.32.4-rke2r1-build20250423     
etcd              true      Running       index.docker.io/rancher/hardened-etcd:v3.5.21-k3s1-build20250411  

Testing

This change is not currently covered by unit tests. Happy to add any tests desired, but I may need some assistance if it's beyond unit tests.

Linked Issues

User-Facing Change

This is a hidden feature, undocumented in rke2 but documented in k3s.

Improved support for `--disable-agent` (experimental, hidden server flag)

Further Comments

@glennpratt glennpratt requested a review from a team as a code owner May 19, 2025 23:14
@brandond
Copy link
Member

brandond commented May 19, 2025

This is highly unlikely to be accepted.

--disable-agent is not supported for RKE2 and will never work as it does in K3s. RKE2 is built around deploying the datastore and control-plane components as static pods. Static pods require the kubelet and container runtime, which is exactly what is excluded from startup when you use the --disable-agent flag.

The closest you'd probably ever get with RKE2 is starting up the kubelet and container runtime, but then leaving the kubelet disconnected from the apiserver so that all it handles is static pods. And I don't really get why you'd want to do that, as you're cutting out nothing except the ability to manage the server as a Kubernetes node. You might as well just taint the node, instead of breaking a bunch of other things.

@brandond
Copy link
Member

brandond commented May 19, 2025

I guess I'm not even sure what you're trying to accomplish here. Do you want to start up just the RKE2 supervisor API (the bit that is exposed on port 9345) without any of the control plane components running?

If that's what you want, you might take a look at something I was tinkering with a while back: https://github.com/brandond/s8r - but note that this hasn't been updated for recent reorganization of some packages in k3s.

@brandond
Copy link
Member

brandond commented May 20, 2025

If having ONLY the supervisor API isn't what you wanted, then perhaps this is closer to what you want:
master...brandond:rke2:disable-agent

Although even this will have issues, as there are still things that will expect a Node reference to eventually become available. As noted in the K3s docs, --disable-agent is incompatible with embedded etcd.

@glennpratt
Copy link
Author

glennpratt commented May 20, 2025

Thanks for the quick response @brandond. Yes, having only the Supervisor API started by the long running rke2 server command was my goal.

I think I want it for similar reasons I would in k3s and similar to vcluster. To have a Node-less control plane that uses an external etcd (e.g. kine), has no root privileges on the machine it's running on and starts quickly. Normal full VM RKE2 Agent nodes would join to this.

Granted, with this simple change, I have to orchestrate the apiserver myself. In my example, the flags could still be consumed from generated manifests, but the host bind mounts no longer make any sense.

The example Gist has ephemeral etcd just to be a self-contained example. I will attach a full agent node to it and see how that goes. It at least runs without errors and responds to kubectl requests with the linked gist example. In testing with Tilt, it's also just nice how fast I can iterate with it.

@brandond
Copy link
Member

To have a Node-less control plane that uses an external etcd (e.g. kine), has no root privileges on the machine it's running on and starts quickly.
Granted, with this simple change, I have to orchestrate the apiserver myself.

If you want a functional control-plane with RKE2 you'll need the kubelet and containerd. Is there any particular reason you're not just using K3s?

@glennpratt
Copy link
Author

glennpratt commented May 20, 2025

What I want is to run control planes as Pods without VMs. This already works well with other Kubernetes distros. I would like to make RKE2 available as well, for the reasons written on the tin, some of which are not claimed by k3s. And because RKE2 is the kubernetes my specific internal product already provides to our users.

Ideally, most parts of the RKE2 control plane (e.g. not DaemonSets) could run as unprivileged Pods in an orchestration cluster that are invisible to the resulting user cluster. I realize this is not supported and requires duplication of effort on my end, that's fine and expected.

I opened this issue somewhat early in the evaluation phase to see how receptive you might be, I was actually pleasantly surprised how easy it was to get running without containerd - presumably because of the k3s inheritance and previous efforts there around vcluster.

@brandond
Copy link
Member

brandond commented May 20, 2025

What I want is part of an internal project to unify K8s control plane hosting internally and externally at my company.

What exactly do you mean by this? What does this look like, in practice?

Ideally, most parts of the RKE2 control plane (e.g. not DaemonSets) could run as unprivileged pods in an orchestration cluster that are invisible to the resulting user cluster.

There are no daemonsets. The RKE2 control-plane is entirely comprised of static pods, using manifests created by the RKE2 supervisor process and executed by the kubelet.

While I would love to enable some variety of an agentless server, even in an unsupported capacity - we're unlikely to move forward with any approach that doesn't result in a working cluster. So, an agentless server using kine or external etcd, where RKE2 continues to manage the control-plane pods through containerd + kubelet, is something that we could enable. An agentless server where the supervisor comes up and does nothing until someone externally provides the control-plane components using some external automation, is not something we'd be interested in.

Note that neither K3s nor RKE2 are architected to support heterogenous clusters. They both expect that the correct distro's supervisor API is available, and that all nodes are running the same distro. If you're attempting to mix and match distros, or run RKE2 agents without RKE2 servers and supervisor controllers, things are unlikely to work well.

What you're talking about would probably require a custom executor that integrated with the host cluster's Kubernetes API to create pods alongside the RKE2 server pod, instead of relying on the kubelet+containerd to run the pods. As far as I know the kubelet cannot be run within an unprivileged pod.

https://github.com/rancher/rke2/tree/master/pkg/podexecutor

@glennpratt
Copy link
Author

glennpratt commented May 21, 2025

What does this look like, in practice?

K8s control planes run inside orchestration Kubernetes clusters as Pods.

There are no DaemonSets

I know, it was just an example, there are DaemonSets in a typical RKE2 control plane Node. These aren't part of Pod-based control planes, which I suppose is obvious.

heterogenous clusters

Clusters are homogenous, it's just the control plane is Node-less, is provided a datastore and run as Pods with minimal privileges, not VMs.

custom executor that integrated with the host cluster's Kubernetes API

This is interesting and I could work towards this, but there are complexities like mounts that may be awkward to wire up unless it's end-user pluggable. These control planes are typical Kubernetes applications, they have no host mounts and things like configs, certs and tokens are generated in an operator then provided in typical k8s fashion, e.g. mounting from Secrets. Rotations are performed by rollout (Pod replacement).

It's also a little weird for scheduling, I'm not sure you can elegantly express that the rke2 Pod is going to create an apiserver Pod next to it - two containers in a Pod fit more naturally there, but you would generate that in advance. I'd be happy to even use rke2 in the manifest generation phase, but that would be in something like an operator or a Job spawned by an operator.

If rke2 server doesn't insist on creating the apiserver Pod, perhaps only validating it, that could work.

@brandond
Copy link
Member

I'm tinkering with the outline of a new executor implementation that would do something like what you're describing. It would basically convert the RKE2 supervisor to a cluster operator that hosts the supervisor API and manages control-plane pod deployments.

The current code for managing etcd is very closely tied to the idea of each supervisor managing a local etcd member, so I'm ignoring that and assuming that there is an existing etcd operator that could be deployed to manage that, and the cluster can simply be pointed at a service endpoint. You'd lose the ability to manage snapshots and such via the RKE2 CLI, but rewiring the current logic to work with StatefulSets or the like is way more involved than I want to get for a proof of concept.

@glennpratt
Copy link
Author

glennpratt commented May 21, 2025

An example would be a Pod-based cluster-api control plane provider. (I'm not sure Pod-based is anything more than an idea one might implement from cluster-api's perspective, it's not defined by types in the code AFAIK.)

At a glance, something that is part of the "workload" cluster's steady operation or listening to remote traffic, e.g. the Supervisor API, I'd want isolated from an operator. I'd want it to have little or no access to the management apiserver and separate concerns about eviction / scaling / versioning.

@brandond
Copy link
Member

brandond commented May 22, 2025

I've been poking at this a bit more and it's... more complicated than I initially anticipated. It is easy enough to get the supervisor to run the control plane pod as deployments in a host cluster, I have that working. Other than the previously mentioned difficulties around managing the etcd cluster pods, the supervisor API is proving somewhat difficult to break out from the apiserver.

The control-plane pods normally run with host network, so the components can always find eachother on the loopback address, and clients can rely on every apiserver IP also hosting the supervisor API on a different port. This will need to either be broken out on the k3s side, or the apiserver pods will need to proxy the supervisor port back to the supervisor pods.

There are also complications around the agent tunnels that the apiserver uses to connect up to the kubelet for kubectl exec and kubectl logs. Normally the agents directly connect to the supervisor port on all apiserver nodes so that the apiservers can dial them through the tunnel, but this won't work if the pods are all hidden behind a single endpoint. Only the single pod that the agent is connected to will be able to connect to the agent, unless we do some extra work to peer the remotedialer instances between pods. This will also require changes on the k3s side.

@glennpratt
Copy link
Author

glennpratt commented May 23, 2025

I think in my own trials that's addressed by a couple things.

Things that expect to talk on loopback go in the same Pod.
Each Pod of an STS either gets host ports or its own IP visible to agents. So they can be addressed directly.

A LoadBalancer Service doesn't work if addressing a specific Pod matters.

@brandond
Copy link
Member

brandond commented May 23, 2025

If things are going to be properly abstracted, they need to work behind a LoadBalancer or Ingress. I wouldn't personally settle for anything that required agents to be able to connect directly to one or more pods.

It is unfortunate that the apiserver only supports setting a single value for its advertise-address flag, and ensures that only the advertised addresses of running apiservers are present in the kubernetes service endpoint list. Makes it hard to hide multiple apiservers behind a loadbalancer that may itself have a dynamic list of IPs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants