-
Notifications
You must be signed in to change notification settings - Fork 19
🌱 Kubeconfig-Based Provider #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
🌱 Kubeconfig-Based Provider #45
Conversation
…oviders. Remove extra unnecessary aspects like isReady, hashing, etc. Add script to create kubeconfig secrets using remote serviceaccounts.
…ime into kubeconfig-based-provider
…uster of secret data hasn't changed
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: FourFifthsCode The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @FourFifthsCode! |
4d8f5cb
to
389fe13
Compare
closes #26 |
@sttts @embik @corentone does this address your concerns from #26? |
* Adds a multi cluster controller capable of watching remote clusters and reconciling their DNSRecord resources * Uses https://github.com/kubernetes-sigs/multicluster-runtime and kubeconfig provider from kubernetes-sigs/multicluster-runtime#45 Signed-off-by: Michael Nairn <mnairn@redhat.com>
Hi @FourFifthsCode, thanks for the PR, sorry for missing it. I'll assign myself to review it ASAP. |
@embik no worries, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry some questions may come from my ignorance of the details of multicluster-runtime. it does feel like a few of the things you do could be handled by the runtime library instead of reimplemented here.
Command line options: | ||
- `-c, --context`: Kubeconfig context to use (required) | ||
- `--name`: Name for the secret (defaults to context name) | ||
- `-n, --namespace`: Namespace to create the secret in (default: "default") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want to use default namespace for credentials?
maybe make it required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think default namespace is ok, but I did change the default service account to use a different name, don't think we should use default for that.
|
||
## How It Works | ||
|
||
1. The kubeconfig provider watches for secrets with a specific label in a namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it always be a secret within the same namespace as the controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namespace if configurable on the provider
examples/kubeconfig/README.md
Outdated
- Creates a new controller-runtime cluster | ||
- Makes the cluster available to your controllers | ||
3. Your controllers can access any cluster through the manager | ||
4. RBAC rules ensure your operator has the necessary permissions in each cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clarify where those rules are.
RBAC Rules on the remote clusters ensure the SA operator on the cluster of the controller has the necessary permissions in the remote clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated :)
examples/kubeconfig/README.md
Outdated
- `-n, --namespace`: Namespace to create the secret in (default: "default") | ||
- `-a, --service-account`: Service account name to use from the remote cluster (default: "default") | ||
|
||
### 2. Customizing RBAC Rules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step is not too clear to me. Who sets that. Also, what is the Binding?
Also maybe we want to suggest a role and not a clusterRole?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added abilty to create role or cluster role in script
examples/kubeconfig/main.go
Outdated
} | ||
|
||
// Create the provider first, then the manager with the provider | ||
entryLog.Info("Creating provider") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we log providerOpts here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im thinking of removing this log entry since it will be logged with options on provider.Run
|
||
// handleSecret processes a secret containing kubeconfig data | ||
func (p *Provider) handleSecret(ctx context.Context, secret *corev1.Secret, mgr mcmanager.Manager) error { | ||
if secret == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be a SecretToActiveCluster() func
providers/kubeconfig/provider.go
Outdated
} | ||
|
||
log.Info("Cluster already exists, updating it") | ||
if err := p.removeCluster(clusterName); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the cluster need to be removed, could it be dangerous?
Because its going to shut down and restart the whole controller for this cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a clean way to update a cluster.Cluster
(e.g. restart its underlying cache) and the change in the kubeconfig that is detected here could be anything (from a credentials refresh to a completely different server URL). Stopping controllers and re-starting them is probably the best way to make sure changes from the kubeconfig get reflected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, my initial feeling is that if a secret is deleted, I would expect to stop reconciling on the associated cluster without having to restart a pod to re-initialize. If that means restarting the controller I would be inclined to accept that trade-off. Are there any other implications or trade-offs I'm missing with controller restarts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming a kubeconfig is only establishing the pipe, is there a mode where we could swap the credentials without flushing the whole Cluster? We could have it assume everything is the same on the other end? (the side problem is that if the endpoint suddenly point to another cluster, it would be broken)
@FourFifthsCode I do agree, if a secret is deleted we should NOT flush the whole MC-Controller. Clusters should be able to come and go.
One thing I'm slightly worried about removing the cluster is if the MC-Controller relies on that signal. Imagine a controller that looks at service endpoints for ex, when the cluster is removed and added, there is a risk that the controller removes the endpoints globally? I wouldn't want creds refresh to be safe if thats possible.
@embik are we limited by the current cluster.Cluster
definition at this point? maybe we need our own lightweight Cluster
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have it assume everything is the same on the other end? (the side problem is that if the endpoint suddenly point to another cluster, it would be broken)
That's a big assumption I think we cannot guarantee unfortunately. I would value correctness (i.e. temporarily drop the cluster from global endpoints) over potential breakage (Kubernetes API endpoint changed and now the running controller is blind to changes because the cluster.Cluster
and its active watches are totally broken).
@embik are we limited by the current cluster.Cluster definition at this point? maybe we need our own lightweight Cluster ?
I don't think it's just cluster.Cluster
, none of the controller-runtime types we build on are really meant to be restarted (other than "stop cluster, start cluster"). We can look at whether it would be possible for us to provide implementations of the interfaces that can be restarted, but I would maybe argue that is out of scope for this particular PR.
providers/kubeconfig/provider.go
Outdated
return fmt.Errorf("failed to create cluster: %w", err) | ||
} | ||
|
||
// Copy indexers to avoid holding lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary?
I'm feeling this would be common to all providers. Should we move such logic into the main runtime code?
I feel that the provider should only have to do:
cl := cluster.New(config)
then mgr.Engage(cl)
CC @embik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could well be an over-optimization, but the only reason for copying the array is to reduce deadlocks by shortening the lock time.
Anything main runtime code can do that provider doesn't need to implement sounds great if it doesn't impact the abstraction flexibility too much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the part that worries me with the locks is that they are a bit scattered around and you have some in the top level reconcile and some in the internal removeCluster.
I do have some love for the defer pattern:
xx.Lock()
defer xx.Unlock()
because Im sure every error handling is automatically covered so no risk of lost locked lock.
Also, I think it'd be good to have locking only at a single level, either in "inside funcs" or in "top level ones".
It's purely a readability recommendation, we could either:
-
lock more coarsely; for ex only locking top level Reconcile and IndexAllFields
1.1 Reconcile would lock RO until it has made a decision on what to do, then locks RW to proceed with the full edit
1.2 IndexField would grab a RW for the entirety of its update of indexers. -
Hide the locking into helper functions GetCluster, SetCluster, RemoveCluster, GetIndexers, SetIndexers, RemoveIndexers (and make the copy there).... it would make the code a bit less readable but your lock would be tucked hidden.
Another thought: is there a risk of removing indexed fields on a cluster being deleted? I wonder if the granular locks could create some weird half in between situations? (I locked for read but now I need to write and things have changed?)
providers/kubeconfig/provider.go
Outdated
log.Info("Successfully added cluster") | ||
|
||
// Engage the manager if provided | ||
if mgr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the manager be nil? if the manager is nil it should be an error no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically speaking, you could just start the provider and call Get
on it whenever you want to fetch a cluster. It would then not interact with the wider mc-runtime primitives, but it would still be functional in its own way. I'm inclined to allow such a use case for flexibility (maybe you want to use the provider in something else than a typical mc-runtime controller setup).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the use cases for providers not in the MC-runtime context? And why would they be in the Mc-runtime repo?
Could we maybe have this provider (maybe others too?) use standard controller-runtime code then?
The whole provider here really really looks like a controller (predicate, HandlerFunc, client and informer cache)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very interesting point, I wouldn't be opposed to using controller runtime code at all if it covers most of the use case. I think I'll give that a try and see how far I get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was that you can use an mc-runtime provider (that usually would be used for mc-runtime controllers) without the full mc-runtime machinery if you have a problem to solve that mc-runtime is perhaps not flexible enough for. We were close to using that approach in kcp-dev/api-syncagent and I'm pretty confident others might also hit limitations of the mcManager in mc-runtime while wanting to build on top of the primitives offered in mc-runtime and controller-runtime.
It's not meant to say that some providers should be used without mc-runtime, but it gives implementers the flexibility to use them in other scenarios.
I don't think such providers can and/or should be implemented in controller-runtime standard code then, there isn't much of a different here (providers don't use too much mc-runtime specific code anywaY).
|
||
// Start the cluster | ||
go func() { | ||
if err := cl.Start(clusterCtx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@embik why are the providers starting the clusters? could it be made the responsibility of the runtime?
Then a provider could focus on: cl=cluster.New; mgr.Engage.
then cl.Delete or cl.Update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As hinted at in https://github.com/kubernetes-sigs/multicluster-runtime/pull/45/files#r2149539056, this makes providers functional in their own right if you want to use them outside of a mcmanager.
In addition, some provider implementations might be constructing cluster.Cluster
in a special way - the kcp provider for example creates something called a scopedCluster
that provides restricted access to a larger cache, and such a scoped cluster cannot be started: https://github.com/kcp-dev/multicluster-provider/blob/9c638a4a0047f86564e00bbbd527c386bfabc378/apiexport/cluster.go#L134
@sttts any further thoughts on this maybe?
…etInformer from struct Also removed kubeconfig provider go.mod to simplify testing with `make test` Disabled metrics server with manager options Signed-off-by: Codey Jenkins <FourFifthsCode@users.noreply.github.com>
… watching secrets Signed-off-by: Codey Jenkins <FourFifthsCode@users.noreply.github.com>
@corentone, @embik I converted provider to use controller instead of informer as suggested... great idea! Using This also helps reduce some of the concurrency complexity from before and moves it to the controller runtime logic instead. Now we don't need an additional go routine to run the provider. Curious about what you all think, can always go back to the informer if you all feel better about that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for making it a controller, it seems that made your life a lot easier!
I think my only question now is around locking; I think you covered all cases but I was slightly worried about readability. I think this piece is not really performance critical so we may be able to afford slightly coarser locks to improve readability and reduce bug potential?
providers/kubeconfig/provider.go
Outdated
return fmt.Errorf("failed to create cluster: %w", err) | ||
} | ||
|
||
// Copy indexers to avoid holding lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the part that worries me with the locks is that they are a bit scattered around and you have some in the top level reconcile and some in the internal removeCluster.
I do have some love for the defer pattern:
xx.Lock()
defer xx.Unlock()
because Im sure every error handling is automatically covered so no risk of lost locked lock.
Also, I think it'd be good to have locking only at a single level, either in "inside funcs" or in "top level ones".
It's purely a readability recommendation, we could either:
-
lock more coarsely; for ex only locking top level Reconcile and IndexAllFields
1.1 Reconcile would lock RO until it has made a decision on what to do, then locks RW to proceed with the full edit
1.2 IndexField would grab a RW for the entirety of its update of indexers. -
Hide the locking into helper functions GetCluster, SetCluster, RemoveCluster, GetIndexers, SetIndexers, RemoveIndexers (and make the copy there).... it would make the code a bit less readable but your lock would be tucked hidden.
Another thought: is there a risk of removing indexed fields on a cluster being deleted? I wonder if the granular locks could create some weird half in between situations? (I locked for read but now I need to write and things have changed?)
Yeah, I like the defer pattern a lot too on the locks for readability, which we were mostly using before. I ran a code analysis tool on it to look for race/deadlock conditions and it made most of those suggestions. I don't have a huge opinion either way, but that was the reasoning behind them. |
I like the idea of refactoring into methods that handle more of the crud logic around cluster lifecycle, I'll take a shot at refactoring with that and see how it goes. This also feels a little more toward single responsibility as well. Probably worth a little more verbosity. |
…ller responsibilities. Improved readability in lock handling. Added more tests around kubeconfigs updating. Signed-off-by: Codey Jenkins <FourFifthsCode@users.noreply.github.com>
Refactored controller logic into smaller functions. I do like this direction better! Also added a couple more tests around kubeconfig changing (update with new config, update with same config). Removed the copy logic for now as I think it is an over-optimization and as it removes those half in-between situations. |
} | ||
|
||
// removeCluster removes a cluster by name with write lock and cleanup | ||
func (p *Provider) removeCluster(clusterName string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method looks fine, two small questions:
1/ do the order between removal from clusters and cancellation matter? can stoppage fail or something?
2/ should the remove wait for the cluster to fully stop? (should there be some channel in the start method to wait on to know it's fully returned?). I'm not actually sure what "Start" on the Cluster actually does so it likely doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Start on the Cluster is mostly involved with starting the cache. I explored possibly using the 'waitForCacheSync' methods on the cache, and followed it all the way back to informer cache waitForStarted func, and can see that it also is just using the context to watch for cancels. So I don't think there is currently any mechanism that allows us to wait, we can only trigger the cancel and report the cluster no longer exists when fetched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I don't think there is currently any mechanism that allows us to wait, we can only trigger the cancel and report the cluster no longer exists when fetched.
That's correct as far as I can say, we also added #32 to better address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the main place cluster is fetched looks like its through the multi-controller manager GetCluster
func. Which means if the cluster doesn't exist, the consumer needs to handle that use case. I did see that Im not returning the cluster ErrClusterNotFound
, I updated Get()
to return that type.
…when cluster isn't found Signed-off-by: Codey Jenkins <FourFifthsCode@users.noreply.github.com>
…ice account creation Signed-off-by: Codey Jenkins <FourFifthsCode@users.noreply.github.com>
Updated script/readme to include rbac and service account creation. Hopefully helps to clarify usage and how things work in better detail. |
Follow up to @christensenjairus PR
Kubeconfig-based multi-cluster provider. It reads secrets from the operator's namespace with a given label and engages the clusters that the secret applies to.
This works in real time for engaging and disengaging clusters as the provider watches all secrets in the operator's namespace or can be configured to watch a specific namespaces.
Related issue: #22
Original PR: #26