Skip to content

fix: Recover periodic metric readers after forking #1823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

chrisholmes
Copy link
Contributor

@chrisholmes chrisholmes commented Mar 6, 2025

This PR provides a fix to PeriodicMetricReader so that it continues working after a Ruby process is forked. This problem currently surfaces itself on applications that use forking webservers.

When forking a Ruby process only the thread creating the child process is coped across from the parent process, which means that background threads need to be created for monitoring tools to continue working. This means that, before this fix, PeriodicMetricReader would not work on the child process as the background thread is not copied.

The fix is implemented by overriding the Process._fork method provided in Ruby versions 3.1+. Overriding this method allows the authors of monitoring tools to introduce callbacks on fork events.

An alternative approach, required prior to 3.1, would require clients to call hooks directly from their forking webserver's configuration such as use on_worker_boot in puma.rb. Overriding Process._fork removes the need for this boilerplate.

When resetting the PeriodicMetricReader, the after_fork hook will first read and discard metrics that would have been published by the parent process. This is in order to avoid duplicate publishing of metrics that will be handled by the parent process.

The parent process' PeriodicMetricReader is left running after forking as it is assumed that clients will want to publish metrics from it still.

@chrisholmes chrisholmes changed the title Recover periodic metric readers from forking fix: Recover periodic metric readers from forking Mar 6, 2025
@chrisholmes chrisholmes force-pushed the recover-periodic-metric-readers-from-forking branch 2 times, most recently from 312d1df to 213d166 Compare March 7, 2025 10:42
@chrisholmes chrisholmes force-pushed the recover-periodic-metric-readers-from-forking branch from 213d166 to 857b773 Compare March 7, 2025 10:55
@chrisholmes chrisholmes changed the title fix: Recover periodic metric readers from forking fix: Recover periodic metric readers after forking Mar 7, 2025
@chrisholmes chrisholmes marked this pull request as ready for review March 7, 2025 12:22
@xuan-cao-swi
Copy link
Contributor

Thanks @chrisholmes , just curious if it's possible to do something like reset_on_fork instead of prepend Process.

@chrisholmes
Copy link
Contributor Author

Thanks @chrisholmes , just curious if it's possible to do something like reset_on_fork instead of prepend Process.

Hi @xuan-cao-swi, AIUI, the BSP implementation works well because it is able to recover its thread when it receives a new span.

PeriodicMetricReader itself doesn't receive any signals so wouldn't be able to recover itself this way. There is a possibility to use the recording of metrics to recover the PeriodicMetricReader possibly by adding a patch to reset the metric readers in MetricStream [here](https://github.com/open-telemetry/opentelemetry-ruby/blob/main/metrics_sdk/lib/opentelemetry/sdk/metrics/state/metric_stream.rb#L55). This would add a bit of coupling, but maybe this is preferable to prepend`

A consequence of this would be forks would not publish any metrics until a value is recorded. This would probably be fine for most circumstances. As an example, it might be an issue with gauges that you would expect to publish continuously. Though from my own experience, I can't think of a use for a metrics derived from a parent process.

What's you view given this?

@xuan-cao-swi
Copy link
Contributor

Hi @chrisholmes , thanks for the explanation. My thinking is that as long as the thread (in child) keep running and export data for every period of time, then there is no need to recover it. Please correct me if I am misunderstand something or wrong.

@chrisholmes
Copy link
Contributor Author

Hi @chrisholmes , thanks for the explanation. My thinking is that as long as the thread (in child) keep running and export data for every period of time, then there is no need to recover it. Please correct me if I am misunderstand something or wrong.

hi @xuan-cao-swi , I'll try to explain my understanding:

Recovery is required because when we fork only the thread that performed the fork is copied from the parent onto the child process. This means that any background thread behaviour, such as PeriodicMetricReader, is stalled until the thread is re-created.

We have a few options for recovery:

  1. Override Process._fork, which was introduced for this purpose
  2. Ask developers to add a recovery hook into their codebase (such as in on_worker_boot in puma)
  3. Use a trigger, such as the creation of a metric or the recording of a metric value, to recreate the thread

For options 1 & 2, there would be immediate recovery. For option 3, there would be a delay in recovery until the triggering event.

One scenario, where immediate recovery is desired, is if a developer creates gauges prior to forking. For option 3, these gauges will not be reported by the child process until the thread is recovered by the trigger, whose immediacy is not guaranteed. Whereas, a user of the metrics would probably expect metrics to be reported continuously without gaps.

Does this scenario help?

@xuan-cao-swi
Copy link
Contributor

Hi @chrisholmes, thanks for the explanation—you're right that there's no recovery point like BSP's on_finish for spans.

Option 3 would delay metric export, but could the recorded timestamps still help correlate metrics on the representation side? As you mentioned, recovery from the metric_stream record could work. That said, I’m not a fan of checking whether the forked periodic reader needs to start its thread on every record—though BSP does something similar per span finish.

The main concern with Option 1 is the monkey patching of the Process module. Option 2 seems more reasonable, but we’re aiming to minimize direct user interaction with the SDK. We’ll bring this up in the SIG meeting next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants