ci: split chart-testing CI into per-chart matrix jobs [RHIDP-14729] by rm3l · Pull Request #437 · redhat-developer/rhdh-chart

rm3l · 2026-06-16T14:12:55Z

Description of the change

Currently, both the PR (test.yaml) and nightly (nightly.yaml) workflows run ct install against all applicable charts in a single job. When a failure occurs:

It is difficult to determine which chart caused the failure without scrolling through lengthy logs.
A failure in one chart blocks visibility into the results of other charts.
Re-running the workflow re-tests all charts, not just the one that failed.

Example: https://github.com/redhat-developer/rhdh-chart/actions/runs/27310055619/job/80677740243

This PR converts both workflows into dynamic per-chart matrix jobs so that failures are immediately attributable to a specific chart.

This will help with the new standalone chart that will be added soon.

Which issue(s) does this PR fix or relate to

RHIDP-14729

How to test changes / Special notes to the reviewer

Example workflows:

PR
Nightly

Checklist

N/A — this PR only changes CI workflows and config, no chart code was modified.

When a chart test fails in CI, the single-job approach makes it hard to tell which chart broke — you have to scroll through lengthy logs to find the culprit. A failure in one chart also blocks visibility into the results of the others, and re-running re-tests everything. By giving each chart its own matrix job, failures are immediately attributable from the job name in the GitHub Actions UI, unrelated charts keep running, and only the broken chart needs to be re-run. Assisted-by: Claude

Assisted-by: Claude

…hart-split-chart-testing-ci-into-per-chart-matrix-jobs-for-better-failure-debuggability

Helm operations (test, uninstall) on the main branch consistently hit the 500s timeout ceiling while completing in seconds on release branches. Helm produces zero output during these waits, making it impossible to determine what it's blocking on. Adding --debug to helm-extra-args will surface Helm-level details (hooks, resource waits, etc.) to help diagnose the root cause. Assisted-by: Claude

Helm waits silently during install/test/uninstall operations, producing no output for up to 500s. This makes it impossible to see what the cluster is doing during those stalls. Add a background loop that prints pod status and recent events every 30s while ct install runs, giving visibility into what Kubernetes resources are stuck or pending. Assisted-by: Claude

Background processes die when their parent step exits, so the monitoring loop from a separate step only ran once. Move it into the ct install step so it stays alive for the duration of the test run. Assisted-by: Claude

The status job only checked for "failure", letting "cancelled" and other non-success states pass as green. Check for success/skipped instead, so any unexpected result correctly fails the job. Assisted-by: Claude

Helm v4.2.1 causes helm install to hang indefinitely, ignoring the --timeout flag entirely. No chart pods are ever created and helm produces no output until the runner kills it after 2+ hours. This reverts the Helm version change from commit 26c43ea. Assisted-by: Claude

Make the background cluster monitoring conditional on the TEST_MONITORING_HEARTBEAT_ENABLED repo variable (default: false) to avoid noisy logs in normal runs. Add a 2-hour timeout to test jobs to prevent runaway runs like the Helm v4 hang that ran until the runner killed it. Assisted-by: Claude

Changes to the test-charts action or test workflow were not triggering any chart tests because discover-charts only looked at charts/ changes. This is how the Helm v4 regression went undetected. Now detect changes to .github/actions/test-charts/, .github/workflows/test.yaml, ct.yaml, and ct-install.yaml and test all charts when they are modified. Assisted-by: Claude

…ci-into-per-chart-matrix-jobs-for-better-failure-debuggability

sonarqubecloud · 2026-06-17T07:22:36Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

rm3l · 2026-06-17T10:02:28Z

Traceback (most recent call last):
  File "/opt/app-root/src/install-dynamic-plugins.py", line 1556, in <module>
    main()
  File "/opt/app-root/src/install-dynamic-plugins.py", line 1510, in main
    merge_plugin(plugin, all_plugins, dynamic_plugins_file, level=1)
  File "/opt/app-root/src/install-dynamic-plugins.py", line 151, in merge_plugin
    return OciPackageMerger(plugin, dynamic_plugins_file, all_plugins).merge_plugin(level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/src/install-dynamic-plugins.py", line 611, in merge_plugin
    raise InstallException(
InstallException: Cannot use {{inherit}} for oci://registry.access.redhat.com/rhdh/red-hat-developer-hub-backstage-plugin-lightspeed: no existing plugin configuration found. Ensure a plugin from this image is defined in an included file with an explicit version.

======= Cleaning up temporary catalog index directory
======= Removed lock file: /dynamic-plugins-root/install-dynamic-plugins.lock

The test failures on the backstage chart seem similar to https://redhat.atlassian.net/browse/RHDHBUGS-3374, caused by wrong references to the Lightspeed plugins in the 1.10 catalog index image. So not related to this PR. Also seen on the nightly jobs: https://github.com/redhat-developer/rhdh-chart/actions/runs/27651979368/job/81777873172#step:4:1369
The same tests passed yesterday before the plugin catalog index image update.

rm3l · 2026-06-17T10:03:36Z

/override "Test (backstage)"
/override "Test Latest Release"

openshift-ci · 2026-06-17T10:03:42Z

@rm3l: Overrode contexts on behalf of rm3l: Test (backstage), Test Latest Release

Details

In response to this:

/override "Test (backstage)"
/override "Test Latest Release"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rm3l added 11 commits June 11, 2026 18:16

ci: drop 'charts/' prefix from matrix job names

144007e

Assisted-by: Claude

ci: scope backstage-specific helm args to the backstage chart

d2107f6

Assisted-by: Claude

Merge remote-tracking branch 'upstream/main' into RHIDP-14729--rhdh-c…

05a753f

…hart-split-chart-testing-ci-into-per-chart-matrix-jobs-for-better-failure-debuggability

fix(ci): run cluster monitoring in same shell as ct install

2ef559a

Background processes die when their parent step exits, so the monitoring loop from a separate step only ran once. Move it into the ct install step so it stays alive for the duration of the test run. Assisted-by: Claude

fix(ci): fail status job on any non-successful test result

c2c9a07

The status job only checked for "failure", letting "cancelled" and other non-success states pass as green. Check for success/skipped instead, so any unexpected result correctly fails the job. Assisted-by: Claude

openshift-ci Bot added the do-not-merge/work-in-progress label Jun 16, 2026

rm3l changed the title ~~RHIDP-14729: split chart-testing CI into per-chart matrix jobs~~ ci: split chart-testing CI into per-chart matrix jobs [RHIDP-14729] Jun 16, 2026

rm3l marked this pull request as ready for review June 16, 2026 15:49

rm3l requested a review from a team as a code owner June 16, 2026 15:49

openshift-ci Bot removed the do-not-merge/work-in-progress label Jun 16, 2026

openshift-ci Bot requested review from OpinionatedHeron and zdrapela June 16, 2026 15:49

rm3l mentioned this pull request Jun 17, 2026

revert(ci): revert "chore(deps): update dependency helm to v4" #439

Merged

5 tasks

Merge branch 'main' into RHIDP-14729--rhdh-chart-split-chart-testing-…

8a9163b

…ci-into-per-chart-matrix-jobs-for-better-failure-debuggability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: split chart-testing CI into per-chart matrix jobs [RHIDP-14729]#437

ci: split chart-testing CI into per-chart matrix jobs [RHIDP-14729]#437
rm3l wants to merge 12 commits into
redhat-developer:mainfrom
rm3l:RHIDP-14729--rhdh-chart-split-chart-testing-ci-into-per-chart-matrix-jobs-for-better-failure-debuggability

rm3l commented Jun 16, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jun 17, 2026

Uh oh!

rm3l commented Jun 17, 2026

Uh oh!

rm3l commented Jun 17, 2026

Uh oh!

openshift-ci Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rm3l commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the change

Which issue(s) does this PR fix or relate to

How to test changes / Special notes to the reviewer

Checklist

Uh oh!

sonarqubecloud Bot commented Jun 17, 2026

Quality Gate passed

Uh oh!

rm3l commented Jun 17, 2026

Uh oh!

rm3l commented Jun 17, 2026

Uh oh!

openshift-ci Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rm3l commented Jun 16, 2026 •

edited

Loading