-
Notifications
You must be signed in to change notification settings - Fork 12
Firehose data loss during cf-deployments #204
Comments
We need to differentiate between the nozzle failing to connect to the firehose and a long term outage of the firehose. I'd expect the nozzle to be down and not reporting for ~1 minute. Anything beyond that is not expected. Few questions:
|
Hi, The stackdriver-nozzle is connecting to firehose via a GCP load balancer Did the CF API/Firehose go down for this hour long period? i cannot go back in retrospect - but on a CF deployment (upgrade to 1.15.0) that is current in the "no log period - i get the following via a heath check of the traffic control (which show no firehose connections): log-api/e2ca3f18-d010-4f0b-aada-0deb0e8863b8:~$ curl http://localhost:14825/health Note that this particular cf-deployment only rolled out stackdriver-tools across all cf components (so fluentd etc) |
Does the nozzle continue to spit out logs after the first websocket closure and oauth error? Is there an error that continues to repeat itself or does it just hang till it comes back online? |
the following continue to be shown repeatably in the logs until the point the nozzle is updated: |
bosh deployment.txt

Firehose data (logs, mertics) not showing in stackdriver project whilst performing bosh cf-deployments (either CF-deployment or stackdriver-tools deployment). Loss starts when routers (go routers) upgrade is started.
The bosh deployment timeline is attached which shows that firehose data is flowing until the deployment starts to the 2 router instances (fire hose data stops 09:32:23) and does not resume until the stackdirver-nozzle upgrade completes at (10:22:55) -
Task 3026 | 09:31:52 | Updating instance router: router/d99294ac-35f9-40e2-9933-57d3e464c7cf (0) (canary) (00:01:41)
Task 3026 | 09:33:33 | Updating instance router: router/90ae7733-8ec0-4382-98b0-de2680634d51 (1) (00:01:46)
Task 3026 | 10:22:21 | Updating instance stackdriver-nozzle: stackdriver-nozzle/42560ccc-9fe4-4b01-a1eb-6715610f7d92 (0) (canary) (00:01:00)
I have also attached a stackdriver metrics screen shot for the remaining diego-cell memory (across both cf deployments). This illustrates the loss of metrics for the deployment being upgraded. From the left hand side, you will notice that the remaining available memory for diego is circa 100 GB until 10:22 when metric data returns post upgrade. You will also see remaining memory drop off again around 11 am – that coincides with the stackdriver-tools upgrade to the other cf-deployment (point at which the upgrade hits the routers)
For reference,
we have 2 cf-deployments each running a single stackdriver-nozzle
both nozzles reporting up to the same single stackdriver project.
stackdriver-tools 2.0.1
cf-deployment 1.14.0
GCP IaaS
The text was updated successfully, but these errors were encountered: