Skip to content
This repository was archived by the owner on Jun 25, 2020. It is now read-only.

Firehose data loss during cf-deployments #204

Open
anthonysroberts opened this issue Feb 26, 2018 · 4 comments
Open

Firehose data loss during cf-deployments #204

anthonysroberts opened this issue Feb 26, 2018 · 4 comments
Labels
stackdriver-nozzle For issues pertaining to the Stackdriver Nozzle project
Milestone

Comments

@anthonysroberts
Copy link

bosh deployment.txt
metrics loss

Firehose data (logs, mertics) not showing in stackdriver project whilst performing bosh cf-deployments (either CF-deployment or stackdriver-tools deployment). Loss starts when routers (go routers) upgrade is started.

The bosh deployment timeline is attached which shows that firehose data is flowing until the deployment starts to the 2 router instances (fire hose data stops 09:32:23) and does not resume until the stackdirver-nozzle upgrade completes at (10:22:55) -

Task 3026 | 09:31:52 | Updating instance router: router/d99294ac-35f9-40e2-9933-57d3e464c7cf (0) (canary) (00:01:41)
Task 3026 | 09:33:33 | Updating instance router: router/90ae7733-8ec0-4382-98b0-de2680634d51 (1) (00:01:46)

Task 3026 | 10:22:21 | Updating instance stackdriver-nozzle: stackdriver-nozzle/42560ccc-9fe4-4b01-a1eb-6715610f7d92 (0) (canary) (00:01:00)

I have also attached a stackdriver metrics screen shot for the remaining diego-cell memory (across both cf deployments). This illustrates the loss of metrics for the deployment being upgraded. From the left hand side, you will notice that the remaining available memory for diego is circa 100 GB until 10:22 when metric data returns post upgrade. You will also see remaining memory drop off again around 11 am – that coincides with the stackdriver-tools upgrade to the other cf-deployment (point at which the upgrade hits the routers)

For reference,
we have 2 cf-deployments each running a single stackdriver-nozzle
both nozzles reporting up to the same single stackdriver project.
stackdriver-tools 2.0.1
cf-deployment 1.14.0
GCP IaaS

@johnsonj
Copy link
Contributor

We need to differentiate between the nozzle failing to connect to the firehose and a long term outage of the firehose. I'd expect the nozzle to be down and not reporting for ~1 minute. Anything beyond that is not expected.

Few questions:

  • Can you pull the logs for the stackdriver-nozzle and take a look to see what errors it was hitting?
  • Did the CF API/firehose go down during this hour long period?
  • Is the stackdriver-nozzle connecting to firehose via a GCP load balancer? Did that endpoint change during this deployment?

@anthonysroberts
Copy link
Author

anthonysroberts commented Feb 27, 2018

Hi,
Errors in nozzle as follows @ 09:32 when router upgrade started (firehose data loss start)
{"timestamp":"1519637543.238842964","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"websocket: close 1006 (abnormal closure): unexpected EOF"}}
{"timestamp":"1519637544.205738306","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {"error":"unauthorized","error_description":"No client authentication found. Remember to put a filter upstream of the TokenEndpointAuthenticationFilter."}"}}

The stackdriver-nozzle is connecting to firehose via a GCP load balancer

Did the CF API/Firehose go down for this hour long period? i cannot go back in retrospect - but on a CF deployment (upgrade to 1.15.0) that is current in the "no log period - i get the following via a heath check of the traffic control (which show no firehose connections):

log-api/e2ca3f18-d010-4f0b-aada-0deb0e8863b8:~$ curl http://localhost:14825/health
HELP loggregator_trafficcontroller_appStreamCount Number of open app streams
TYPE loggregator_trafficcontroller_appStreamCount gauge
loggregator_trafficcontroller_appStreamCount 0
HELP loggregator_trafficcontroller_firehoseStreamCount Number of open firehose streams
TYPE loggregator_trafficcontroller_firehoseStreamCount gauge
loggregator_trafficcontroller_firehoseStreamCount 0
HELP loggregator_trafficcontroller_slowConsumerCount Number of stream consumers disconnected to avoid backpressure on the Loggregator system
TYPE loggregator_trafficcontroller_slowConsumerCount gauge
loggregator_trafficcontroller_slowConsumerCount 0

FYI the firehoseStreamCount increments to 1 once the deployment completes the last deployment task which is the stackdriver-nozzle. This is when the firehose data re-appears again

Note that this particular cf-deployment only rolled out stackdriver-tools across all cf components (so fluentd etc)

@johnsonj
Copy link
Contributor

Does the nozzle continue to spit out logs after the first websocket closure and oauth error? Is there an error that continues to repeat itself or does it just hang till it comes back online?

@anthonysroberts
Copy link
Author

anthonysroberts commented Mar 2, 2018

the following continue to be shown repeatably in the logs until the point the nozzle is updated:
"error":"Error getting token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {"error":"unauthorized","error_description":"No client authentication found. Remember to put a filter upstream of the TokenEndpointAuthenticationFilter."}"}}

@mattysweeps mattysweeps added the stackdriver-nozzle For issues pertaining to the Stackdriver Nozzle project label Oct 23, 2018
@mattysweeps mattysweeps added this to the PCF 2.4.0 milestone Dec 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stackdriver-nozzle For issues pertaining to the Stackdriver Nozzle project
Projects
None yet
Development

No branches or pull requests

3 participants