Skip to content

autohttp's secret-sync container restarting leads to unready pod and disruption of network traffic #2190

@matthew-brett

Description

@matthew-brett

My z2jh GKE cluster stalled badly under fairly mild load today, giving "service refused" errors.

The autohttps-.... pod reported many restarts, and kubectl describe showed that this was entirely due to restarts in the secret-sync container, with no restarts in the traefik container.

kubectl logs --previous autohttps-9fdcfc86c-9jdwx secret-sync started with these lines:

2021-05-10 09:29:19,247 INFO /usr/local/bin/acme-secret-sync.py watch-save --label=app=jupyterhub --label=release=jhub --label=chart=jupyterhub-0.11.1 --label=heritage=secret-sync proxy-public-tls-acme acme.json /etc/acme/acme.json
2021-05-10 09:30:24,876 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff5883de2b0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/jhub/secrets/proxy-public-tls-acme

I noticed similar errors, and some restarts in the hub pod:

WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f258683d370>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/jhub/pods?fieldSelector=&labelSelector=component%3Dsingleuser-server

(leading to an error), and also in the user-scheduler pods (restarts, errors):

E0510 09:27:20.654039       1 leaderelection.go:325] error retrieving resource lock jhub/user-scheduler-lock: Get "https://10.92.0.1:443/api/v1/namespaces/jhub/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.92.0.1:443: connect: connection refused

Traefik image reported as traefik:v2.3.7.

Helm chart is 0.11.1.

Erik Sundell commented over on Gitter:

Note that it may not be a problem that the container restarts in practice.

It is a problem if the pod isnt ready during that process though

Then no network traffic is accepted

But the container is only relevant on startup

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions