Firehose data loss during cf-deployments #204

anthonysroberts · 2018-02-26T13:26:07Z

Firehose data (logs, mertics) not showing in stackdriver project whilst performing bosh cf-deployments (either CF-deployment or stackdriver-tools deployment). Loss starts when routers (go routers) upgrade is started.

The bosh deployment timeline is attached which shows that firehose data is flowing until the deployment starts to the 2 router instances (fire hose data stops 09:32:23) and does not resume until the stackdirver-nozzle upgrade completes at (10:22:55) -

Task 3026 | 09:31:52 | Updating instance router: router/d99294ac-35f9-40e2-9933-57d3e464c7cf (0) (canary) (00:01:41)
Task 3026 | 09:33:33 | Updating instance router: router/90ae7733-8ec0-4382-98b0-de2680634d51 (1) (00:01:46)

Task 3026 | 10:22:21 | Updating instance stackdriver-nozzle: stackdriver-nozzle/42560ccc-9fe4-4b01-a1eb-6715610f7d92 (0) (canary) (00:01:00)

I have also attached a stackdriver metrics screen shot for the remaining diego-cell memory (across both cf deployments). This illustrates the loss of metrics for the deployment being upgraded. From the left hand side, you will notice that the remaining available memory for diego is circa 100 GB until 10:22 when metric data returns post upgrade. You will also see remaining memory drop off again around 11 am – that coincides with the stackdriver-tools upgrade to the other cf-deployment (point at which the upgrade hits the routers)

For reference,
we have 2 cf-deployments each running a single stackdriver-nozzle
both nozzles reporting up to the same single stackdriver project.
stackdriver-tools 2.0.1
cf-deployment 1.14.0
GCP IaaS

johnsonj · 2018-02-26T18:40:43Z

We need to differentiate between the nozzle failing to connect to the firehose and a long term outage of the firehose. I'd expect the nozzle to be down and not reporting for ~1 minute. Anything beyond that is not expected.

Few questions:

Can you pull the logs for the stackdriver-nozzle and take a look to see what errors it was hitting?
Did the CF API/firehose go down during this hour long period?
Is the stackdriver-nozzle connecting to firehose via a GCP load balancer? Did that endpoint change during this deployment?

anthonysroberts · 2018-02-27T11:37:01Z

Hi,
Errors in nozzle as follows @ 09:32 when router upgrade started (firehose data loss start)
{"timestamp":"1519637543.238842964","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"websocket: close 1006 (abnormal closure): unexpected EOF"}}
{"timestamp":"1519637544.205738306","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {"error":"unauthorized","error_description":"No client authentication found. Remember to put a filter upstream of the TokenEndpointAuthenticationFilter."}"}}

The stackdriver-nozzle is connecting to firehose via a GCP load balancer

Did the CF API/Firehose go down for this hour long period? i cannot go back in retrospect - but on a CF deployment (upgrade to 1.15.0) that is current in the "no log period - i get the following via a heath check of the traffic control (which show no firehose connections):

log-api/e2ca3f18-d010-4f0b-aada-0deb0e8863b8:~$ curl http://localhost:14825/health
HELP loggregator_trafficcontroller_appStreamCount Number of open app streams
TYPE loggregator_trafficcontroller_appStreamCount gauge
loggregator_trafficcontroller_appStreamCount 0
HELP loggregator_trafficcontroller_firehoseStreamCount Number of open firehose streams
TYPE loggregator_trafficcontroller_firehoseStreamCount gauge
loggregator_trafficcontroller_firehoseStreamCount 0
HELP loggregator_trafficcontroller_slowConsumerCount Number of stream consumers disconnected to avoid backpressure on the Loggregator system
TYPE loggregator_trafficcontroller_slowConsumerCount gauge
loggregator_trafficcontroller_slowConsumerCount 0
FYI the firehoseStreamCount increments to 1 once the deployment completes the last deployment task which is the stackdriver-nozzle. This is when the firehose data re-appears again

Note that this particular cf-deployment only rolled out stackdriver-tools across all cf components (so fluentd etc)

johnsonj · 2018-02-28T16:53:10Z

Does the nozzle continue to spit out logs after the first websocket closure and oauth error? Is there an error that continues to repeat itself or does it just hang till it comes back online?

anthonysroberts · 2018-03-02T10:31:25Z

the following continue to be shown repeatably in the logs until the point the nozzle is updated:
"error":"Error getting token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {"error":"unauthorized","error_description":"No client authentication found. Remember to put a filter upstream of the TokenEndpointAuthenticationFilter."}"}}

mattysweeps added the stackdriver-nozzle For issues pertaining to the Stackdriver Nozzle project label Oct 23, 2018

mattysweeps added this to the PCF 2.4.0 milestone Dec 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Firehose data loss during cf-deployments #204

Firehose data loss during cf-deployments #204

anthonysroberts commented Feb 26, 2018

johnsonj commented Feb 26, 2018

anthonysroberts commented Feb 27, 2018 •

edited

Loading

johnsonj commented Feb 28, 2018

anthonysroberts commented Mar 2, 2018 •

edited

Loading

Firehose data loss during cf-deployments #204

Firehose data loss during cf-deployments #204

Comments

anthonysroberts commented Feb 26, 2018

johnsonj commented Feb 26, 2018

anthonysroberts commented Feb 27, 2018 • edited Loading

johnsonj commented Feb 28, 2018

anthonysroberts commented Mar 2, 2018 • edited Loading

anthonysroberts commented Feb 27, 2018 •

edited

Loading

anthonysroberts commented Mar 2, 2018 •

edited

Loading