Skip to content

Add :clean_dir option and clean_pid method to DirectFileStore #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

stefansundin
Copy link
Contributor

Hi again,

As hinted in my other PR (#172), here's another one.

Currently, it is a bit of a problem that the gem doesn't help with keeping the directory clean. I'm sure everyone who attempts to use DirectFileStore will soon realize that metrics from old processes hang around when collecting the metrics. So I wouldn't be surprised if many people have worked around this by simply cleaning the directory first.

So the first change, adding a new :clean_dir option, will let me remove this piece of code from my app: stefansundin/github-activity@936d4d6#diff-d21913acbe4d1528a62c64cdec6d43d7R11-R14

# Clean up old metric files
Dir["#{app_path}/tmp/prometheus/*.bin"].each do |file_path|
  File.unlink(file_path)
end

I was even thinking of making this the default, but I'm afraid to accidentally delete someone's files.. even though it should be unlikely. Is there any use case of possibly initializing two DirectFileStore's that use the same directory? I feel like that may be a bad idea for other reasons..

The second piece, the new clean_pid method, will let me simplify this part of my app: stefansundin/github-activity@936d4d6#diff-3f0c524806c4374c99a509efaffd3e8aR29-R32

So I can change this:

on_worker_shutdown do |index|
  # Delete stale metric files on worker shutdown
  Dir["#{app_path}/tmp/prometheus/*___#{Process.pid}.bin"].each do |file_path|
    File.unlink(file_path)
  end
end

to:

on_worker_shutdown do |index|
  # Delete stale metric files on worker shutdown
  Prometheus::Client.config.data_store.clean_pid(Process.pid)
end

Thanks!

@stefansundin stefansundin force-pushed the DirectFileStore-unlink branch from 831dda8 to 049948b Compare December 28, 2019 02:09
@@ -68,6 +80,7 @@ def validate_metric_settings(metric_settings)
end

class MetricStore
FILENAME_PREFIX = "prometheus_#{ Prometheus::Client::VERSION }"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this from metric to prometheus to reduce the chance of deleting unrelated files.. thoughts?

@store_settings = { dir: dir }
FileUtils.mkdir_p(dir)

if clean_dir
Dir.glob(File.join(dir, "#{ MetricStore::FILENAME_PREFIX }_*___*.bin")).each do |file_path|
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't delete files from old versions of the gem.. thoughts?

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 049948b on stefansundin:DirectFileStore-unlink into 7129453 on prometheus:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 049948b on stefansundin:DirectFileStore-unlink into 7129453 on prometheus:master.

@coveralls
Copy link

coveralls commented Dec 28, 2019

Coverage Status

Coverage remained the same at 100.0% when pulling 8537f5a on stefansundin:DirectFileStore-unlink into 7129453 on prometheus:master.

Signed-off-by: Stefan Sundin <stefan@stefansundin.com>
@dmagliola
Copy link
Collaborator

Hi @stefansundin, thank you for this PR!

This looks quite good, and it seems to solve Issue #109. (@Sinjo RTYI)
I have one main concern, however, which is multiple processes running this cleanup code on app start, and racing to delete each other's files...

You mention you are replacing some currently existing code in your app that deletes these files...
Where is that code in your app, currently?
Does it run before Unicorn forks?
If not, does it run on all the client processes?
I find that last idea a bit scary, in case the first processes to start have already created metrics files by the time the following ones start deleting them...

So i'm not sure whether this should be an option in the constructor, and instead it should be a static method that you call in your Unicorn startup script, before forking.

We should also definitely add some documentation on this, things like your on_worker_shutdown example.

Thoughts?

@stefansundin
Copy link
Contributor Author

You mention you are replacing some currently existing code in your app that deletes these files...
Where is that code in your app, currently?

The relevant samples are linked in the PR description.

Does it run before Unicorn forks?

My puma config is using preload_app!, so my initializers run before puma forks. If I remove preload_app! then prometheus is initialized multiple times and may cause undefined behavior. I haven't really tried to see what problems may arise. Here is my puma config file: https://github.com/stefansundin/github-activity/blob/17ab554695411e702f664feacf1cc2d8494f0880/config/puma.rb

So i'm not sure whether this should be an option in the constructor, and instead it should be a static method that you call in your Unicorn startup script, before forking.

I don't think it matters that much. You can also conditionally set this new argument the same way you would conditionally call a new method.

I actually think we should consider making this the default behavior. As I noted above, I think everyone with forking processes will encounter this issue. As I wrote above, my hesitation is:

I was even thinking of making this the default, but I'm afraid to accidentally delete someone's files.. even though it should be unlikely. Is there any use case of possibly initializing two DirectFileStore's that use the same directory? I feel like that may be a bad idea for other reasons..

So if there is no use case to do this, current or future, then I would at least consider making it the default. You guys are the experts so I will let you make the call. :)

@dmagliola
Copy link
Collaborator

@stefansundin After discussing this quite a bit, I don't think we should have this change.
The reason is that it's quite dangerous / quite easy for someone to shoot themselves in the foot if they don't make sure that this has definitely ran before forking.
As such, it's a bit of an "advanced" usage mode.

Having that flag in the constructor also feels a bit awkward as an interface.
I'd prefer to have a method you call on the store to clean the directory, that people can call from some hook that definitely runs before forking.

Having looked into that option, since we can't do it automatically because you may be calling post-fork, and thus since you still have to call a method manually on startup, we don't seem to be saving users much work. There's not that much difference in my opinion between having to call Store.clean_dir and the 3-line snippet of glob + unlink.

For this same reason, we can't make this the default, as much as i'd like to. It's too easy to inadvertently run after forking.

What I think we should do instead is document this in the README. Let's have a section where we propose "this is some code you can use to make sure you start with an empty dir, and make sure you run that before forking!".

Would you prefer if I close this PR, and we tackle the documentation when we have time, or would you like to write that documentation section?


As for your worker shutdown code, there might be a pretty big issue with it.
I may be wrong, because I don't know Puma well enough to understand when exactly that gets called.
If a certain worker gets recycled (like Unicorn allows you to do after a certain number of requests, for example), will that hook run?
Or does it only run when the whole server is being shutdown?

If that runs when a worker ends but the whole server keep running... That code can be quite problematic. Unless all of your metrics are tagged with their pid (in which case you get the result you want, the time series disappears), you will see counters going down if you delete the files for some processes but not others. This works in your current code because, if i'm reading that correctly, you only have one gauge. But I really don't think we should encourage that behaviour, it's very easy to get wrong.

If the hook only runs on server shutdown, however, that's safe, but i'm not sure I understand what's the use case for each worker to clean after itself, if the whole thing gets wiped after the server restarts.

@stefansundin
Copy link
Contributor Author

@dmagliola Ok. I think it would be better if you would update the README and other documentation. If you do it then there's less need to go back-and-forth between the two of us.

As for the worker shutdown code, yes there are some gotchas. In my opinion, people already have to deal with most of these gotchas in a multi-server environment, where servers come and go very frequently.

Feel free to close this after you've updated the docs.

@dmagliola
Copy link
Collaborator

Hi Stefan,
I'm sorry for my slow reply.

I agree with you. I'm not sure what's proper etiquette in terms of taking over other people's PRs, so I was suggesting some changes.
I'll close this and make those changes in another branch, which is easier.
Thank you for bringing this up!
Daniel

@dmagliola dmagliola closed this Feb 24, 2020
dmagliola pushed a commit that referenced this pull request Feb 24, 2020
Add an example on how to solve the "files are already there on
startup" problem, based on @stefansundin 's proposal in PR #173
dmagliola pushed a commit that referenced this pull request Feb 24, 2020
Add an example on how to solve the "files are already there on
startup" problem, based on @stefansundin 's proposal in PR #173

Signed-off-by: Daniel Magliola <danielmagliola@gocardless.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants