-
Notifications
You must be signed in to change notification settings - Fork 829
Description
My z2jh GKE cluster stalled badly under fairly mild load today, giving "service refused" errors.
The autohttps-....
pod reported many restarts, and kubectl describe
showed that this was entirely due to restarts in the secret-sync
container, with no restarts in the traefik container.
kubectl logs --previous autohttps-9fdcfc86c-9jdwx secret-sync
started with these lines:
2021-05-10 09:29:19,247 INFO /usr/local/bin/acme-secret-sync.py watch-save --label=app=jupyterhub --label=release=jhub --label=chart=jupyterhub-0.11.1 --label=heritage=secret-sync proxy-public-tls-acme acme.json /etc/acme/acme.json
2021-05-10 09:30:24,876 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff5883de2b0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/jhub/secrets/proxy-public-tls-acme
I noticed similar errors, and some restarts in the hub pod:
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f258683d370>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/jhub/pods?fieldSelector=&labelSelector=component%3Dsingleuser-server
(leading to an error), and also in the user-scheduler
pods (restarts, errors):
E0510 09:27:20.654039 1 leaderelection.go:325] error retrieving resource lock jhub/user-scheduler-lock: Get "https://10.92.0.1:443/api/v1/namespaces/jhub/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.92.0.1:443: connect: connection refused
Traefik image reported as traefik:v2.3.7.
Helm chart is 0.11.1.
Erik Sundell commented over on Gitter:
Note that it may not be a problem that the container restarts in practice.
It is a problem if the pod isnt ready during that process though
Then no network traffic is accepted
But the container is only relevant on startup