-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat(infra): Adding new KEDA Helm templates #5289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR introduces KEDA (Kubernetes Event Driven Autoscaling) Helm templates to enable advanced autoscaling capabilities for various Onyx components. KEDA provides more sophisticated scaling triggers compared to the existing Kubernetes HPA, supporting event-driven scaling based on metrics like Redis queue depth and custom Prometheus metrics.
The changes add 7 new KEDA ScaledObject templates:
- API Server & Web Server: CPU-based scaling similar to existing HPA but with KEDA's enhanced capabilities
- Slackbot: CPU utilization-based scaling for the Slack integration component
- Model Server: Multi-type scaling template supporting different model server configurations
- Celery Workers: Three specialized templates for different worker types (common, docprocessing, docfetching) that scale based on Redis queue depth metrics via Prometheus
Each template includes conditional rendering ({{- if .Values.keda.enabled }}
) to prevent conflicts with existing HPA setups. The Celery worker templates are particularly sophisticated, using Prometheus queries to monitor Redis queue sizes and implementing different scaling strategies - aggressive scaling for docprocessing (4x up when queue > 20) and moderate scaling for docfetching (2x up when queue > 5). The Chart.yaml version is bumped from 0.2.9 to 0.2.10 to reflect this new feature addition.
These templates integrate with the existing Onyx infrastructure by targeting the same deployments that current HPA configurations manage, but offer more granular control over scaling behavior and support for custom metrics beyond CPU/memory utilization.
Confidence score: 1/5
- This PR has critical implementation issues that will prevent successful deployment and require immediate attention before merging
- Score reflects missing KEDA dependency in Chart.yaml, logical bugs in trigger conditions, and undefined configuration values that will cause template rendering failures
- Pay close attention to all KEDA template files, Chart.yaml dependencies section, and verify that corresponding values.yaml configurations exist
8 files reviewed, 10 comments
scaleTargetRef: | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
name: {{ include "onyx-stack.fullname" . }}-web-server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Deployment name inconsistency: This references -web-server
but the actual deployment uses -webserver
(without hyphen). This will cause the ScaledObject to fail finding its target.
@@ -0,0 +1,24 @@ | |||
{{- if and .Values.keda.enabled .Values.keda.webServer .Values.keda.webServer.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Risk of conflict: Both KEDA ScaledObject and existing HPA (webserver.autoscaling.enabled) can target the same deployment simultaneously, causing scaling conflicts. Consider adding mutual exclusion logic.
minReplicaCount: {{ $workerConfig.minReplicas | default 1 }} | ||
maxReplicaCount: {{ $workerConfig.maxReplicas | default 10 }} | ||
triggers: | ||
{{- if $workerConfig.triggers }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Logic bug: This condition prevents default triggers from being applied. Should be {{- if not $workerConfig.triggers }}
since the comment indicates this is the default trigger when none are specified.
{{- if $workerConfig.triggers }} | |
{{- if not $workerConfig.triggers }} |
@@ -0,0 +1,24 @@ | |||
{{- if and .Values.keda.enabled .Values.keda.apiServer .Values.keda.apiServer.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The triple conditional check will fail if any of the nested values don't exist. Consider adding the corresponding KEDA configuration section to values.yaml to prevent template rendering failures.
triggers: | ||
- type: cpu | ||
metadata: | ||
type: Utilization | ||
value: {{ .Values.keda.apiServer.cpuThreshold | default "70" | quote }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: KEDA ScaledObjects and HPA cannot coexist on the same deployment target. If both api.autoscaling.enabled and keda.apiServer.enabled are true, this will cause conflicts. Consider adding mutual exclusion logic.
minReplicaCount: {{ .Values.keda.celeryWorkers.docprocessing.minReplicas | default 1 }} | ||
maxReplicaCount: {{ .Values.keda.celeryWorkers.docprocessing.maxReplicas | default 50 }} | ||
triggers: | ||
{{- if .Values.keda.celeryWorkers.docprocessing.triggers }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Logic error: triggers section only renders when .Values.keda.celeryWorkers.docprocessing.triggers
is truthy, but comment suggests this provides 'default' behavior. Should this condition be reversed (when triggers are NOT specified) to provide defaults?
{{- if .Values.keda.celeryWorkers.docprocessing.triggers }} | |
{{- if not .Values.keda.celeryWorkers.docprocessing.triggers }} |
{{- if .Values.keda.slackbot.triggers }} | ||
- type: cpu | ||
metadata: | ||
type: Utilization | ||
value: {{ .Values.keda.slackbot.cpuThreshold | default "70" | quote }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Logic issue: CPU trigger is only applied when .Values.keda.slackbot.triggers
is truthy, but the trigger configuration itself is hardcoded. This means if triggers is false/undefined, no scaling will occur. Consider making the trigger conditional on cpuThreshold existence instead.
scaleTargetRef: | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
name: {{ include "onyx-stack.fullname" . }}-slackbot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Deployment name inconsistency: This targets {{ include "onyx-stack.fullname" . }}-slackbot
but existing HPA templates (api-hpa.yaml) target {{ include "onyx-stack.fullname" . }}
. Verify the actual slackbot deployment name to ensure correct targeting.
minReplicaCount: {{ .Values.keda.celeryWorkers.docfetching.minReplicas | default 1 }} | ||
maxReplicaCount: {{ .Values.keda.celeryWorkers.docfetching.maxReplicas | default 10 }} | ||
triggers: | ||
{{- if .Values.keda.celeryWorkers.docfetching.triggers }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Logic error: the condition checks if triggers are defined, but the comment says 'if none specified'. This will only apply the default Prometheus trigger when custom triggers are already configured, which is backwards.
{{- if .Values.keda.celeryWorkers.docfetching.triggers }} | |
{{- if not .Values.keda.celeryWorkers.docfetching.triggers }} |
{{- if $serverConfig.triggers }} | ||
- type: cpu | ||
metadata: | ||
type: Utilization | ||
value: {{ $serverConfig.cpuThreshold | default "70" | quote }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: CPU trigger logic is inverted - triggers should be created by default with an option to disable them. Currently CPU scaling only works when triggers
is explicitly set to true.
{{- if $serverConfig.triggers }} | |
- type: cpu | |
metadata: | |
type: Utilization | |
value: {{ $serverConfig.cpuThreshold | default "70" | quote }} | |
{{- end }} | |
{{- if not (eq $serverConfig.triggers false) }} | |
- type: cpu | |
metadata: | |
type: Utilization | |
value: {{ $serverConfig.cpuThreshold | default "70" | quote }} | |
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15 issues found across 8 files
React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai
to give feedback, ask questions, or re-run the review.
# Returns 4 when queue depth > 20, 0.25 when <= 20 | ||
# This creates a clear scaling decision boundary for high-volume processing | ||
( | ||
(sum(redis_key_size{key=~"connector_docprocessing.*"}) > 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PromQL comparison lacks 'bool', returning the metric value instead of 1/0; this makes the query scale by 4×queue depth rather than a constant factor.
Prompt for AI agents
Address the following comment on deployment/helm/charts/onyx/templates/keda/celery-worker-docprocessing-scaledobject.yaml at line 37:
<comment>PromQL comparison lacks 'bool', returning the metric value instead of 1/0; this makes the query scale by 4×queue depth rather than a constant factor.</comment>
<file context>
@@ -0,0 +1,46 @@
+ # Returns 4 when queue depth > 20, 0.25 when <= 20
+ # This creates a clear scaling decision boundary for high-volume processing
+ (
+ (sum(redis_key_size{key=~"connector_docprocessing.*"}) > 20)
+ * 4
+ )
</file context>
minReplicaCount: {{ .Values.keda.celeryWorkers.docprocessing.minReplicas | default 1 }} | ||
maxReplicaCount: {{ .Values.keda.celeryWorkers.docprocessing.maxReplicas | default 50 }} | ||
triggers: | ||
{{- if .Values.keda.celeryWorkers.docprocessing.triggers }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default trigger condition is inverted: defaults render when custom triggers exist, and nothing renders when none are provided. Flip the condition to match the comment/intent.
Prompt for AI agents
Address the following comment on deployment/helm/charts/onyx/templates/keda/celery-worker-docprocessing-scaledobject.yaml at line 20:
<comment>Default trigger condition is inverted: defaults render when custom triggers exist, and nothing renders when none are provided. Flip the condition to match the comment/intent.</comment>
<file context>
@@ -0,0 +1,46 @@
+ minReplicaCount: {{ .Values.keda.celeryWorkers.docprocessing.minReplicas | default 1 }}
+ maxReplicaCount: {{ .Values.keda.celeryWorkers.docprocessing.maxReplicas | default 50 }}
+ triggers:
+ {{- if .Values.keda.celeryWorkers.docprocessing.triggers }}
+ # Default Prometheus-based trigger for Redis queue depth if none specified
+ # Scaling Logic:
</file context>
apiVersion: apps/v1 | ||
kind: Deployment | ||
name: {{ include "onyx-stack.fullname" . }}-api-server | ||
pollingInterval: {{ .Values.keda.apiServer.pollingInterval | default 30 }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using default here overrides an explicit value of 0 for pollingInterval by falling back to 30; prefer preserving user-provided zero values.
Prompt for AI agents
Address the following comment on deployment/helm/charts/onyx/templates/keda/api-server-scaledobject.yaml at line 15:
<comment>Using default here overrides an explicit value of 0 for pollingInterval by falling back to 30; prefer preserving user-provided zero values.</comment>
<file context>
@@ -0,0 +1,24 @@
+ apiVersion: apps/v1
+ kind: Deployment
+ name: {{ include "onyx-stack.fullname" . }}-api-server
+ pollingInterval: {{ .Values.keda.apiServer.pollingInterval | default 30 }}
+ cooldownPeriod: {{ .Values.keda.apiServer.cooldownPeriod | default 300 }}
+ minReplicaCount: {{ .Values.keda.apiServer.minReplicas | default 1 }}
</file context>
kind: Deployment | ||
name: {{ include "onyx-stack.fullname" . }}-api-server | ||
pollingInterval: {{ .Values.keda.apiServer.pollingInterval | default 30 }} | ||
cooldownPeriod: {{ .Values.keda.apiServer.cooldownPeriod | default 300 }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using default here overrides an explicit value of 0 for cooldownPeriod, changing semantics (e.g., immediate scale-down) by forcing 300 instead.
Prompt for AI agents
Address the following comment on deployment/helm/charts/onyx/templates/keda/api-server-scaledobject.yaml at line 16:
<comment>Using default here overrides an explicit value of 0 for cooldownPeriod, changing semantics (e.g., immediate scale-down) by forcing 300 instead.</comment>
<file context>
@@ -0,0 +1,24 @@
+ kind: Deployment
+ name: {{ include "onyx-stack.fullname" . }}-api-server
+ pollingInterval: {{ .Values.keda.apiServer.pollingInterval | default 30 }}
+ cooldownPeriod: {{ .Values.keda.apiServer.cooldownPeriod | default 300 }}
+ minReplicaCount: {{ .Values.keda.apiServer.minReplicas | default 1 }}
+ maxReplicaCount: {{ .Values.keda.apiServer.maxReplicas | default 10 }}
</file context>
name: {{ include "onyx-stack.fullname" . }}-api-server | ||
pollingInterval: {{ .Values.keda.apiServer.pollingInterval | default 30 }} | ||
cooldownPeriod: {{ .Values.keda.apiServer.cooldownPeriod | default 300 }} | ||
minReplicaCount: {{ .Values.keda.apiServer.minReplicas | default 1 }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using default here prevents minReplicaCount from being set to 0 (scale-to-zero), because 0 is considered empty and falls back to 1. Prefer checking key presence to preserve 0.
Prompt for AI agents
Address the following comment on deployment/helm/charts/onyx/templates/keda/api-server-scaledobject.yaml at line 17:
<comment>Using default here prevents minReplicaCount from being set to 0 (scale-to-zero), because 0 is considered empty and falls back to 1. Prefer checking key presence to preserve 0.</comment>
<file context>
@@ -0,0 +1,24 @@
+ name: {{ include "onyx-stack.fullname" . }}-api-server
+ pollingInterval: {{ .Values.keda.apiServer.pollingInterval | default 30 }}
+ cooldownPeriod: {{ .Values.keda.apiServer.cooldownPeriod | default 300 }}
+ minReplicaCount: {{ .Values.keda.apiServer.minReplicas | default 1 }}
+ maxReplicaCount: {{ .Values.keda.apiServer.maxReplicas | default 10 }}
+ triggers:
</file context>
minReplicaCount: {{ .Values.keda.slackbot.minReplicas | default 1 }} | ||
maxReplicaCount: {{ .Values.keda.slackbot.maxReplicas | default 3 }} | ||
triggers: | ||
{{- if .Values.keda.slackbot.triggers }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The triggers list is conditionally rendered and becomes empty/null when no custom triggers are provided, producing an invalid ScaledObject; invert the condition so the default CPU trigger is used when custom triggers are absent.
Prompt for AI agents
Address the following comment on deployment/helm/charts/onyx/templates/keda/slackbot-scaledobject.yaml at line 22:
<comment>The triggers list is conditionally rendered and becomes empty/null when no custom triggers are provided, producing an invalid ScaledObject; invert the condition so the default CPU trigger is used when custom triggers are absent.</comment>
<file context>
@@ -0,0 +1,28 @@
+ minReplicaCount: {{ .Values.keda.slackbot.minReplicas | default 1 }}
+ maxReplicaCount: {{ .Values.keda.slackbot.maxReplicas | default 3 }}
+ triggers:
+ {{- if .Values.keda.slackbot.triggers }}
+ - type: cpu
+ metadata:
</file context>
@@ -0,0 +1,24 @@ | |||
{{- if and .Values.keda.enabled .Values.keda.webServer .Values.keda.webServer.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard the ScaledObject so it does not render when the webserver HPA is enabled to prevent conflicting autoscalers on the same Deployment.
Prompt for AI agents
Address the following comment on deployment/helm/charts/onyx/templates/keda/web-server-scaledobject.yaml at line 1:
<comment>Guard the ScaledObject so it does not render when the webserver HPA is enabled to prevent conflicting autoscalers on the same Deployment.</comment>
<file context>
@@ -0,0 +1,24 @@
+{{- if and .Values.keda.enabled .Values.keda.webServer .Values.keda.webServer.enabled }}
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
</file context>
minReplicaCount: {{ $serverConfig.minReplicas | default 1 }} | ||
maxReplicaCount: {{ $serverConfig.maxReplicas | default 5 }} | ||
triggers: | ||
{{- if $serverConfig.triggers }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The triggers block conditionally renders the default CPU trigger and leaves 'triggers:' empty otherwise, leading to invalid YAML and ignoring user-provided triggers.
Prompt for AI agents
Address the following comment on deployment/helm/charts/onyx/templates/keda/model-server-scaledobject.yaml at line 23:
<comment>The triggers block conditionally renders the default CPU trigger and leaves 'triggers:' empty otherwise, leading to invalid YAML and ignoring user-provided triggers.</comment>
<file context>
@@ -0,0 +1,31 @@
+ minReplicaCount: {{ $serverConfig.minReplicas | default 1 }}
+ maxReplicaCount: {{ $serverConfig.maxReplicas | default 5 }}
+ triggers:
+ {{- if $serverConfig.triggers }}
+ - type: cpu
+ metadata:
</file context>
scaleTargetRef: | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
name: {{ include "onyx-stack.fullname" $ }}-{{ $serverType }}-model-server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScaledObject scaleTargetRef.name points to a non-existent Deployment; use the actual deployment name suffix "-model" instead of "-model-server".
Prompt for AI agents
Address the following comment on deployment/helm/charts/onyx/templates/keda/model-server-scaledobject.yaml at line 17:
<comment>ScaledObject scaleTargetRef.name points to a non-existent Deployment; use the actual deployment name suffix "-model" instead of "-model-server".</comment>
<file context>
@@ -0,0 +1,31 @@
+ scaleTargetRef:
+ apiVersion: apps/v1
+ kind: Deployment
+ name: {{ include "onyx-stack.fullname" $ }}-{{ $serverType }}-model-server
+ pollingInterval: {{ $serverConfig.pollingInterval | default 30 }}
+ cooldownPeriod: {{ $serverConfig.cooldownPeriod | default 300 }}
</file context>
minReplicaCount: {{ .Values.keda.celeryWorkers.docfetching.minReplicas | default 1 }} | ||
maxReplicaCount: {{ .Values.keda.celeryWorkers.docfetching.maxReplicas | default 10 }} | ||
triggers: | ||
{{- if .Values.keda.celeryWorkers.docfetching.triggers }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default trigger is rendered only when custom triggers are set; invert the condition to avoid producing an empty triggers list when none are provided.
Prompt for AI agents
Address the following comment on deployment/helm/charts/onyx/templates/keda/celery-worker-docfetching-scaledobject.yaml at line 20:
<comment>Default trigger is rendered only when custom triggers are set; invert the condition to avoid producing an empty triggers list when none are provided.</comment>
<file context>
@@ -0,0 +1,46 @@
+ minReplicaCount: {{ .Values.keda.celeryWorkers.docfetching.minReplicas | default 1 }}
+ maxReplicaCount: {{ .Values.keda.celeryWorkers.docfetching.maxReplicas | default 10 }}
+ triggers:
+ {{- if .Values.keda.celeryWorkers.docfetching.triggers }}
+ # Default Prometheus-based trigger for Redis queue depth if none specified
+ # Scaling Logic:
</file context>
Description
[Provide a brief description of the changes in this PR]
Creating KEDA Helm templates to introduce KEDA to our clusters.
How Has This Been Tested?
[Describe the tests you ran to verify your changes]
Ran helm template commands locally to ensure that this is valid template code to get this to work.
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.