Skip to content

Sporadic Keycloak issues #573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks
jchristgit opened this issue Mar 24, 2025 · 0 comments
Open
3 tasks

Sporadic Keycloak issues #573

jchristgit opened this issue Mar 24, 2025 · 0 comments
Assignees
Labels
component: networking An issue relating to a host networking (e.g. DNS, WireGuard, SSH) component: services An issue relating to a Python Discord service (e.g. Bot, Site, Lancebot) group: docs Issues and pull requests related to our documentation group: kubernetes Issues and pull requests related to the Kubernetes setup

Comments

@jchristgit
Copy link
Member

For a while, we have been receiving sporadic reports about Keycloak not working properly, both via Alertmanager and various other communication channels.

Investigation today revealed that this is likely related to the vault-agent sidecar container that runs in every Keycloak pod. This container regularly crashes with the following error:

2025-03-24T19:23:58.026Z [ERROR] agent: runtime error encountered:
  error=
  | template server: vault.write(internal-tls/issue/internal-tls -> fb6ab102): vault.write(internal-tls/issue/internal-tls -> fb6ab102): Error making API request.
  |
  | URL: PUT http://vault.vault.svc:8200/v1/internal-tls/issue/internal-tls
  | Code: 400. Errors:
  |
  | * cannot satisfy request, as TTL would result in notAfter of 2025-07-22T19:23:58.023842036Z that is beyond the expiration of the CA certificate at 2025-06-26T23:39:49Z
   exitCode=1
Error encountered during run, refer to logs for more details.

Presumably, the Vault CA certificate is the problem here, which might have been configured with an expiration of 1 year when Vault was installed.

Since the Keycloak pod was created 43 days ago, the pod has been restarted 3892 times.

Keycloak itself has no logs indicating big problems during the same timeframe.

Action items

  • Fix the current issue
  • Document how to fix this issue in the future in a runbook
  • Expand the documentation in kubernetes/namespaces/vault/README.md as applicable

Out of scope for now

  • Configure metrics endpoint for Vault to monitor for CA certificate lifetime (will open separate issue)
  • Police DevOps members to take alerts seriously (electric shock therapy conflicts with Chris' pacemaker)
  • Remove Keycloak
  • Remove Vault
@jchristgit jchristgit added component: networking An issue relating to a host networking (e.g. DNS, WireGuard, SSH) component: services An issue relating to a Python Discord service (e.g. Bot, Site, Lancebot) group: docs Issues and pull requests related to our documentation group: kubernetes Issues and pull requests related to the Kubernetes setup labels Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: networking An issue relating to a host networking (e.g. DNS, WireGuard, SSH) component: services An issue relating to a Python Discord service (e.g. Bot, Site, Lancebot) group: docs Issues and pull requests related to our documentation group: kubernetes Issues and pull requests related to the Kubernetes setup
Projects
Status: Up next
Development

No branches or pull requests

2 participants