Skip to content

Metrics feedback #1662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kaylareopelle opened this issue Jul 31, 2024 · 26 comments
Open

Metrics feedback #1662

kaylareopelle opened this issue Jul 31, 2024 · 26 comments

Comments

@kaylareopelle
Copy link
Contributor

We want your feedback on the opentelemetry-metrics-sdk and opentelemetry-metrics-api gems! Understanding how these libraries are used by the community will help us prioritize future work on Metrics.

Copy link
Contributor

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

@chloe-meister
Copy link

Hi there 👋

I've started some prototype implementation of metrics in a Rails app, and I could get some metrics exported from the SDK example.

However, From the README it is not clear where and when I should pull the metrics and shutdown the provider?
These 2 lines:

OpenTelemetry.meter_provider.metric_readers.each(&:pull)
OpenTelemetry.meter_provider.shutdown

What we'd want to do is adding a counter metric each time a specific controller action is called. I can do this with a standard before_action filter, but from there I don't understand well when to pull the metrics from the reader? Any advice?

@kaylareopelle
Copy link
Contributor Author

Hi @chloe-meister! Thanks for reaching out!

The README provides an example for a script that would run once, pull the metric, and exit the program.

If you're open to it, I think a periodic metric reader is a better fit for a Rails application. It'll collect the metrics and pass them to the exporter on a set interval. The process will live as long as the Rails application does, or until OpenTelemetry.meter_provider.shutdown is manually called.

Assuming you're using the OTLP exporter, rather than calling:

otlp_metric_exporter = OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new
OpenTelemetry.meter_provider.add_metric_reader(otlp_metric_exporter)

You would call something like:

# config/initializers/opentelemetry.rb
reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
          exporter: OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new,
          export_interval_millis: 3000,
          export_timeout_millis: 10000)

OpenTelemetry.meter_provider.add_metric_reader(reader)

You can create the meter and the counter the same way as in the example.
If you prefer the console exporter, just swap OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter with OpenTelemetry::SDK::Metrics::Export::ConsoleMetricPullExporter.

Using arguments above, your metrics would be exported every three seconds and would have a 10-second timeout.

Keep in mind that if you're using the OTLP exporter, you also need to install the opentelemetry-exporter-otlp-metrics gem.

@chloe-meister
Copy link

Hi @kaylareopelle! Thank you very much for the detailed answer. This helps a lot 🙏

@chloe-meister
Copy link

Hi @kaylareopelle! Thanks to your guidance I managed to get something exported. We use Grafana and I can see something matching my meter provider's attributes. However, there doesn't seem to be any metric attached: no matter how the @data_points in the @metric_store object are updated, nothing from this information reaches Grafana.

I wonder if you could potentially have an idea of what could be wrong here?

Below you can find my setup:

# config/initializers/open_telemetry.rb

OpenTelemetry::SDK.configure do |config|
  config.service_name = ENV.fetch('SERVICE_NAME', 'unknown')

  group = { 'service.group' => ENV.fetch('SERVICE_GROUP', 'unknown') }
  config.resource = OpenTelemetry::SDK::Resources::Resource.create(group)
end

otlp_metric_exporter = OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new
reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
  exporter: otlp_metric_exporter,
  export_interval_millis: 15_000,
  export_timeout_millis: 30_000
)
OpenTelemetry.meter_provider.add_metric_reader(reader)

queue_latency_meter = OpenTelemetry.meter_provider.meter('QUEUE_LATENCY')
QueueLatency = queue_latency_meter.create_histogram('queue_latency', unit: 'job_latency', description: 'cumulative latency for all queues')

I use this in a background job, collecting this information:

QueueLatency.record(queue.latency, attributes: { queue_name: queue.name })

@kaylareopelle
Copy link
Contributor Author

Hi @chloe-meister! Thanks for sharing these details! Sorry things aren't working as expected.

Everything you've shared looks good to me. I copied your setup and created a simple Rails app using Sidekiq for the background jobs and was able to record data points to the QueueLatency histogram and export them via OTLP.

I was surprised that two resources were created: one for the Rails process and one for the Sidekiq process. The QueueLatency metric on the Rails process never had any data points. Instead, they were all recorded in the Sidekiq process.

I haven't worked with Grafana before, but I can do some testing with that observability backend next week. Perhaps there's something we're missing that environment needs.

Earlier this week, we released a fix for a bug in the metrics SDK that prevented users who set the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE environment variable from getting a valid aggregation_temporality set, preventing some observability backends from ingesting the metrics. Are you setting that environment variable? Perhaps updating to the latest version, if you're not using it already, may help.

It might also be helpful to see the raw data for the metric you're sending, either by creating a periodic metric reader with a console exporter, or by setting up a collector with the debug exporter set to detailed verbosity.

Here's how to set up the console exporter:

# config/initializers/opentelemetry.rb

# ... 

console_metric_exporter = OpenTelemetry::SDK::Metrics::Export::ConsoleMetricPullExporter.new
console_reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
  exporter: console_metric_exporter,
  export_interval_millis: 15_000,
  export_timeout_millis: 30_000
)

OpenTelemetry.meter_provider.add_metric_reader(console_reader)

Either of these options should provide a plain text output of the data being sent over the wire. Would you be able to update your application to generate this output and paste the output (with sensitive information removed) on this issue? Seeing the structure of your data might help us pinpoint any malformed or missing elements that could be preventing the data points from being ingested.

@eduAntequera
Copy link

Hello @kaylareopelle! I am a co-worker of @chloe-meister.
First of all thanks so much for the help and the guidance provided so far 😄

We did not set the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE env var, so I think that should not be a problem.

Followed your advise and used the ConsoleMetricPullExporter which helped us see the raw data points for the metric.
Here are some screenshots I took from it:
otl_data_debug
otl_data_debug2

Right after that I also get a line saying Flushed 2 metrics since I am also sending a second set of histogram metrics.

Just a bit of extra information, there is some data that does seem to reach Grafana and that we can see there. The attributes shown here:

Screenshot from 2024-10-28 12-16-41

Thank you for all the help provided!

@kaylareopelle
Copy link
Contributor Author

Hi @eduAntequera, thank you for these screenshots! Everything looks good in this context, so as of yet, it does not seem like the problem is in the metrics SDK. This is puzzling!

For the next phase of debugging, I think we need to focus on the OTLP exporter specifically. In order to do that, I recommend we move to the collector setup with a debug exporter set to a detailed verbosity.

Could you give that a try and send me the output for a payload that isn't showing up in Grafana as expected?

Here's some example output from my reproduction app:

Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource attributes:
     -> service.name: Str(unknown)
     -> process.pid: Int(2732)
     -> process.command: Str(<redacted>/bin/sidekiq)
     -> process.runtime.name: Str(ruby)
     -> process.runtime.version: Str(3.3.1)
     -> process.runtime.description: Str(ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-darwin23])
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.language: Str(ruby)
     -> telemetry.sdk.version: Str(1.5.0)
     -> service.group: Str(unknown)
     -> host.name: Str(ce5b8509a987)
     -> os.type: Str(linux)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope QUEUE_LATENCY 
Metric #0
Descriptor:
     -> Name: queue_latency
     -> Description: cumulative latency for all queues
     -> Unit: job_latency
     -> DataType: Histogram
     -> AggregationTemporality: Delta
HistogramDataPoints #0
Data point attributes:
     -> queue_name: Str(queue.name)
StartTimestamp: 2024-10-29 22:23:54.933843 +0000 UTC
Timestamp: 2024-10-29 22:24:54.942435 +0000 UTC
Count: 9
Sum: 37.000000
Min: 0.000000
Max: 9.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 25.000000
ExplicitBounds #4: 50.000000
ExplicitBounds #5: 75.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
Buckets #0, Count: 1
Buckets #1, Count: 5
Buckets #2, Count: 3
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0

@eduAntequera
Copy link

Hi @kaylareopelle, sorry for the delay and thanks for the guidance.
Here are some screenshots as requested
Screenshot from 2024-10-31 17-15-53
Screenshot from 2024-10-31 17-16-37
Screenshot from 2024-10-31 17-17-08

I hope they help

@kaylareopelle
Copy link
Contributor Author

Hi @eduAntequera! Thanks for the screenshots! I'm about to take some time off through next Monday, so perhaps someone else can jump in here to help in the meantime.

These screenshots helped me realize that both exporters are recording the metric only once, and it seems to have a value of 0.0. Is that accurate? For these samples, how many times has QueueLatency.record(queue.latency, attributes: { queue_name: queue.name }) been called? What are some values you expect queue.latency to have recorded?

@eduAntequera
Copy link

Hi @kaylareopelle thanks for the reply!
I checked and indeed 0.0 could indeed be a value, I have changed the code to generate random values instead and will update here with screenshots with values other than 0.0. I also decided to change the name of the queues, since I was seeing a metric in grafana for queue_size but none for queue_latency. This made me think that maybe the names were colliding. So I changed the names to sidekiq_queue_size and sidekiq_queue_latency.
When using these and the generation of random value they did not appear in grafana (the queue_size metric still appeared).

Here are some of the results when sending randomly generated values:
Screenshot From 2024-11-06 10-50-07
Screenshot From 2024-11-06 10-50-21
Screenshot From 2024-11-06 10-50-45

Thanks for all the help and hope you enjoy your time off 😄

@kaylareopelle
Copy link
Contributor Author

Hi @eduAntequera! Thank you for this test and for your patience. I took another look at the output and unfortunately I'm not seeing any issues with the output.

This might be an issue related to Grafana's ingest process rather than the OpenTelemetry Ruby metrics gems. I reached out on the CNCF Slack to someone who works for Grafana and they're going to take a look at the issue. If you're able to open a support ticket with a representative from the company, that may also be a good next step.

If they have anything they'd like us to test or change with the metrics code to help with troubleshooting, let us know!

@eduAntequera
Copy link

@kaylareopelle Thank you so much for looking into it and contacting them 🙇
We will let you know if we get any news :)
Sorry for the late reply, the notification about your reply got lost in my inbox 😓

@seizethedave
Copy link

seizethedave commented Jan 21, 2025

For those having issues sending metrics to Grafana/Prometheus with the Ruby SDK: try setting the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE environment variable to cumulative as Prometheus doesn't yet support delta temporality. One way to do it:

bash% OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative ./my_ruby_app

@akahn
Copy link

akahn commented Jan 31, 2025

I'm debugging an issue where metrics produced my app aren't visible in Grafana (using Alloy as the OpenTelemetry Collector). Now I'm using the debug collector in development, as recommended here, to verify that my setup is correct.

In a Rails initializer I do this:

  exporter = OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new
  reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(exporter: exporter)
  OpenTelemetry.meter_provider.add_metric_reader(reader)

Then, in a controller action I do this:

    meter = OpenTelemetry.meter_provider.meter('my_meter')
    scores_posted = meter.create_counter('my_counter')
    scores_posted.add(1, attributes: { attributes: 'my_attributes' })

In the debug collector logs I see this:

Metric #10
Descriptor:
     -> Name: my_counter
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
     -> attributes: Str(my_attributes)
StartTimestamp: 2025-01-31 16:41:01.763851223 +0000 UTC
Timestamp: 2025-01-31 16:42:01.775156029 +0000 UTC
Value: 1

Is this expected, for NumberDataPoints to be 0? Is the relevant information conveyed just by Value?

@akahn
Copy link

akahn commented Feb 8, 2025

@kaylareopelle, any chance you could take a look at this? Any suggestions for what to investigate?

@kaylareopelle
Copy link
Contributor Author

Hi @akahn, thanks for the ping! I'm sorry about the delay here!

In this scenario, I believe NumberDataPoints #0 is expected. The output is from the array of NumberDataPoints and lists the index for the NumberDataPoint being displayed. If you had another NumberDataPoint, it should have the heading NumberDataPoints #1. The relevant information for your counter's value should be the Value: 1.

For the Grafana ingest issue, I wonder if multiple exporters may be causing some sort of block. It should be possible to have multiple exporters, but perhaps we have a bug.

The Metrics SDK gem now creates and configures an OTLP exporter by default. This can be controlled using OTEL_METRICS_EXPORTER. The valid options are otlp (the default), console, in-memory, and none.

For debugging purposes, could you run some tests using the following scenarios:

  1. Comment out the three lines you shared from your Rails initializer and start up your application
  2. Keep the lines you have in your Rails initializer and run your app with the environment variable OTEL_METRICS_EXPORTER=none.

Can you let me know if either of those changes allow your data to reach Grafana?

@akahn
Copy link

akahn commented Mar 10, 2025

I'm not sure what changed, but I am now seeing my metrics in Grafana, thank you for looking at this!


Some unrelated feedback:

We're noticing that when shutting down our app, we have to wait the full metrics export interval before the app exits. That's 60 seconds by default, which slows down our deploys considerably, since they run a few rake tasks. I would expect that on app shutdown, we'd wait up to 60 seconds for any pending metrics to be exported, but it seems like the delay is always 60 seconds (due to this sleep call). As a workaround, we've set our export interval to 3 seconds, to minimize the slowdown of running these tasks, but that seems quite a bit more frequent than what we'd want once we're up and running with many custom metrics.

@kaylareopelle
Copy link
Contributor Author

@akahn - Hooray! 🎉 Happy to hear the ingest is working now!


As far as the shutdown goes, thanks for letting us know! That's not how we'd like it to behave. We'll take a closer look.

In the meantime, could you try adding the following to your opentelemetry.rb file:

at_exit do
  OpenTelemetry.meter_provider.shutdown
end

I haven't tested this, but I believe that should call force_flush on the Periodic Exporting Metric Reader and break the cycle, but the sleep might still get in the way.

@xuan-cao-swi
Copy link
Contributor

I think it's caused by calling shutdown, which sets @continue to false, but @thread is still sleeping (e.g., sleep(@export_interval)). As a result, it can't detect the latest @continue value to break the cycle. I think we can try killing @thread directly (@thread.kill), or abandon the logic of using sleep for interval check so that @continue can be detected earlier.

@akahn
Copy link

akahn commented Mar 11, 2025

That would be awesome, @xuan-cao-swi. We are hesitant to start using this in production because of the shutdown delay.

@akahn
Copy link

akahn commented Mar 17, 2025

@kaylareopelle we already have that block running in our app, so it doesn't seem to help.

@akahn
Copy link

akahn commented Mar 20, 2025

@kaylareopelle @xuan-cao-swi here is an attempt at faster shutdown, using the thread.kill approach you suggested. I'm looking forward to hearing your feedback. #1827

@xuan-cao-swi
Copy link
Contributor

Hi @akahn , thank you for the PR. During sig meeting, we were discussing to use ConditionalVariable to replace sleep (second approach, see PR), but let's keep both PR open for further discussion.
cc @kaylareopelle, @arielvalentin

@akahn
Copy link

akahn commented Apr 14, 2025

We are starting to use histogram metrics more in our Rails app and seeing confusing issues where metrics do not report accurately. We see requests going over the wire on metrics export but it seems the data is empty as no data points end up visible in Prometheus. We will continue investigating on our end but at this point we're wondering if our Puma web server is part of the issue. We use clustered mode, so there are multiple processes, each of which spawns multiple threads. Can you provide any guidance on the correct way to set up metrics reporting in a Rails app that uses Puma in this way?

@xuan-cao-swi
Copy link
Contributor

Hi @akahn , here has more details: #1800; #1823

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To do
Development

No branches or pull requests

6 participants