-
Notifications
You must be signed in to change notification settings - Fork 56
Description
I think that we were bitten by a lack of a meaningful TCP connect() timeout in apt-get ...
when we recently saw the driver container hang in places like
Installing Linux kernel headers...
Installing Linux kernel module files...
The pod would hang there for a long time and eventually restart (after about ~20 minutes).
It took quite a bit of back and forth before we understood that this is probably networking-related and not filesystem-related. We would have understood that faster if we would have seen stdout of apt-get
: stdout would have revealed that we're running apt-get in the first place (the log msgs shown above don't show that), and that apt is trying to fetch data from remote infra.
Looking at code, in many places we invoke apt-get
with this pattern:
apt-get -qq install ... > /dev/null
Changes that we should make:
- never suppress stdout
- add an overall meaningful timeout around the command, and retry it (a good timeout constant could be: maximum expected time of execution on a slow system times ~5)
A bit of background / big picture about this type of problem: I deeply care about debuggabililty. That often means that we want to support humans in their debugging efforts, so that they don't spend unnecessary time.
One of the not-so-much fun unnecessary time sinks during debugging is when there's no log output and something just hangs in an operation that
- could have been logged (so that one knows where something hangs)
- could have been timeout-controlled and retried (often leading to success, instead of hanging forever)
Rather often that relates to deliberately setting a TCP connect() timeout. For the record, I have frequently talked about and written about the immense importance and power of combining i) timeout control, ii) retries, and iii) logging elsewhere, such as in