-
Notifications
You must be signed in to change notification settings - Fork 98
🐛 Fix flaky Helm installations by separating provider CRs from operator deployment #832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for kubernetes-sigs-cluster-api-operator ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3f03d48
to
71ebea3
Compare
This commit splits the cluster-api-operator Helm chart into two separate charts to resolve flaky installations caused by webhook validation timing issues. Problem: - Provider Custom Resources (CoreProvider, BootstrapProvider, etc.) were applied at the same time as the operator deployment. - The webhook service was not yet ready, leading to validation errors: "no endpoints available for service 'capi-operator-webhook-service'". Solution: - Create two charts: 1. cluster-api-operator: contains only the operator deployment and its resources. 2. cluster-api-operator-providers: contains all provider Custom Resources. - Installing the operator first allows the webhook to start before provider CRs are applied. Installation now requires: 1. Install operator: helm install capi-operator capi-operator/cluster-api-operator \ --create-namespace -n capi-operator-system --wait --timeout 90s 2. Install providers: helm install capi-providers \ capi-operator/cluster-api-operator-providers \ -n capi-operator-system \ --set infrastructure.docker.enabled=true \ --set cert-manager.enabled=true \ --set configSecret.name=${CREDENTIALS_SECRET_NAME} \ --set configSecret.namespace=${CREDENTIALS_SECRET_NAMESPACE} Fixes: kubernetes-sigs#534 Signed-off-by: kahirokunn <okinakahiro@gmail.com>
c2fda14
to
2ee1de5
Compare
996a078
to
097ce06
Compare
1dbedc6
to
5725787
Compare
e4c4127
to
ea36d6e
Compare
Hi @dtzar and @furkatgofurov7, When you have a moment, could you please review ? This update splits the chart into operator and provider components, ensuring the operator’s webhook is fully ready before provider CRs are applied. As a result, the Also, since this PR would result in a breaking change, I think it would be best to either create a new helm chart and gradually migrate to it, or continue to use both. I would appreciate your feedback on this as well. Thank you for your time and feedback. Best regards, |
Add a new GitHub Actions workflow for smoke testing. Signed-off-by: kahirokunn <okinakahiro@gmail.com>
ea36d6e
to
cb5a66d
Compare
Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages. The list of commits with invalid commit messages:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@kahirokunn: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Fixes: #534
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes the flaky Helm installation issue where provider Custom Resources fail to install due to webhook validation errors. The root cause is that provider CRs are being applied at the same time as the operator deployment, before the webhook service is ready.
Problem:
When installing the cluster-api-operator Helm chart, users frequently encounter errors like:
Solution:
Split the Helm chart into two separate charts:
cluster-api-operator
- Contains only the operator deployment and its resourcescluster-api-operator-providers
- Contains all provider Custom ResourcesThis ensures the operator is fully running before any provider CRs are applied.
Which issue(s) this PR fixes:
Fixes #[issue-number]
Special notes for your reviewer:
The installation process now requires two steps:
Release note: