Metrics feedback #1662

kaylareopelle · 2024-07-31T17:52:58Z

We want your feedback on the opentelemetry-metrics-sdk and opentelemetry-metrics-api gems! Understanding how these libraries are used by the community will help us prioritize future work on Metrics.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-31T01:55:37Z

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

chloe-meister · 2024-10-18T09:54:41Z

Hi there 👋

I've started some prototype implementation of metrics in a Rails app, and I could get some metrics exported from the SDK example.

However, From the README it is not clear where and when I should pull the metrics and shutdown the provider?
These 2 lines:

OpenTelemetry.meter_provider.metric_readers.each(&:pull)
OpenTelemetry.meter_provider.shutdown

What we'd want to do is adding a counter metric each time a specific controller action is called. I can do this with a standard before_action filter, but from there I don't understand well when to pull the metrics from the reader? Any advice?

kaylareopelle · 2024-10-18T22:45:21Z

Hi @chloe-meister! Thanks for reaching out!

The README provides an example for a script that would run once, pull the metric, and exit the program.

If you're open to it, I think a periodic metric reader is a better fit for a Rails application. It'll collect the metrics and pass them to the exporter on a set interval. The process will live as long as the Rails application does, or until OpenTelemetry.meter_provider.shutdown is manually called.

Assuming you're using the OTLP exporter, rather than calling:

opentelemetry-ruby/examples/metrics_sdk/metrics_collect_otlp.rb

Lines 23 to 25 in 2f87a1d

    
           otlp_metric_exporter = OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new 
        
           OpenTelemetry.meter_provider.add_metric_reader(otlp_metric_exporter)

You would call something like:

# config/initializers/opentelemetry.rb
reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
          exporter: OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new,
          export_interval_millis: 3000,
          export_timeout_millis: 10000)

OpenTelemetry.meter_provider.add_metric_reader(reader)

You can create the meter and the counter the same way as in the example.
If you prefer the console exporter, just swap OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter with OpenTelemetry::SDK::Metrics::Export::ConsoleMetricPullExporter.

Using arguments above, your metrics would be exported every three seconds and would have a 10-second timeout.

Keep in mind that if you're using the OTLP exporter, you also need to install the opentelemetry-exporter-otlp-metrics gem.

chloe-meister · 2024-10-21T08:30:24Z

Hi @kaylareopelle! Thank you very much for the detailed answer. This helps a lot 🙏

chloe-meister · 2024-10-25T09:24:23Z

Hi @kaylareopelle! Thanks to your guidance I managed to get something exported. We use Grafana and I can see something matching my meter provider's attributes. However, there doesn't seem to be any metric attached: no matter how the @data_points in the @metric_store object are updated, nothing from this information reaches Grafana.

I wonder if you could potentially have an idea of what could be wrong here?

Below you can find my setup:

# config/initializers/open_telemetry.rb

OpenTelemetry::SDK.configure do |config|
  config.service_name = ENV.fetch('SERVICE_NAME', 'unknown')

  group = { 'service.group' => ENV.fetch('SERVICE_GROUP', 'unknown') }
  config.resource = OpenTelemetry::SDK::Resources::Resource.create(group)
end

otlp_metric_exporter = OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new
reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
  exporter: otlp_metric_exporter,
  export_interval_millis: 15_000,
  export_timeout_millis: 30_000
)
OpenTelemetry.meter_provider.add_metric_reader(reader)

queue_latency_meter = OpenTelemetry.meter_provider.meter('QUEUE_LATENCY')
QueueLatency = queue_latency_meter.create_histogram('queue_latency', unit: 'job_latency', description: 'cumulative latency for all queues')

I use this in a background job, collecting this information:

QueueLatency.record(queue.latency, attributes: { queue_name: queue.name })

kaylareopelle · 2024-10-26T00:05:08Z

Hi @chloe-meister! Thanks for sharing these details! Sorry things aren't working as expected.

Everything you've shared looks good to me. I copied your setup and created a simple Rails app using Sidekiq for the background jobs and was able to record data points to the QueueLatency histogram and export them via OTLP.

I was surprised that two resources were created: one for the Rails process and one for the Sidekiq process. The QueueLatency metric on the Rails process never had any data points. Instead, they were all recorded in the Sidekiq process.

I haven't worked with Grafana before, but I can do some testing with that observability backend next week. Perhaps there's something we're missing that environment needs.

Earlier this week, we released a fix for a bug in the metrics SDK that prevented users who set the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE environment variable from getting a valid aggregation_temporality set, preventing some observability backends from ingesting the metrics. Are you setting that environment variable? Perhaps updating to the latest version, if you're not using it already, may help.

It might also be helpful to see the raw data for the metric you're sending, either by creating a periodic metric reader with a console exporter, or by setting up a collector with the debug exporter set to detailed verbosity.

Here's how to set up the console exporter:

# config/initializers/opentelemetry.rb

# ... 

console_metric_exporter = OpenTelemetry::SDK::Metrics::Export::ConsoleMetricPullExporter.new
console_reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
  exporter: console_metric_exporter,
  export_interval_millis: 15_000,
  export_timeout_millis: 30_000
)

OpenTelemetry.meter_provider.add_metric_reader(console_reader)

Either of these options should provide a plain text output of the data being sent over the wire. Would you be able to update your application to generate this output and paste the output (with sensitive information removed) on this issue? Seeing the structure of your data might help us pinpoint any malformed or missing elements that could be preventing the data points from being ingested.

eduAntequera · 2024-10-28T11:19:04Z

Hello @kaylareopelle! I am a co-worker of @chloe-meister.
First of all thanks so much for the help and the guidance provided so far 😄

We did not set the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE env var, so I think that should not be a problem.

Followed your advise and used the ConsoleMetricPullExporter which helped us see the raw data points for the metric.
Here are some screenshots I took from it:

Right after that I also get a line saying Flushed 2 metrics since I am also sending a second set of histogram metrics.

Just a bit of extra information, there is some data that does seem to reach Grafana and that we can see there. The attributes shown here:

Thank you for all the help provided!

kaylareopelle · 2024-10-29T22:37:23Z

Hi @eduAntequera, thank you for these screenshots! Everything looks good in this context, so as of yet, it does not seem like the problem is in the metrics SDK. This is puzzling!

For the next phase of debugging, I think we need to focus on the OTLP exporter specifically. In order to do that, I recommend we move to the collector setup with a debug exporter set to a detailed verbosity.

Could you give that a try and send me the output for a payload that isn't showing up in Grafana as expected?

Here's some example output from my reproduction app:

Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource attributes:
     -> service.name: Str(unknown)
     -> process.pid: Int(2732)
     -> process.command: Str(<redacted>/bin/sidekiq)
     -> process.runtime.name: Str(ruby)
     -> process.runtime.version: Str(3.3.1)
     -> process.runtime.description: Str(ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-darwin23])
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.language: Str(ruby)
     -> telemetry.sdk.version: Str(1.5.0)
     -> service.group: Str(unknown)
     -> host.name: Str(ce5b8509a987)
     -> os.type: Str(linux)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope QUEUE_LATENCY 
Metric #0
Descriptor:
     -> Name: queue_latency
     -> Description: cumulative latency for all queues
     -> Unit: job_latency
     -> DataType: Histogram
     -> AggregationTemporality: Delta
HistogramDataPoints #0
Data point attributes:
     -> queue_name: Str(queue.name)
StartTimestamp: 2024-10-29 22:23:54.933843 +0000 UTC
Timestamp: 2024-10-29 22:24:54.942435 +0000 UTC
Count: 9
Sum: 37.000000
Min: 0.000000
Max: 9.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 25.000000
ExplicitBounds #4: 50.000000
ExplicitBounds #5: 75.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
Buckets #0, Count: 1
Buckets #1, Count: 5
Buckets #2, Count: 3
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0

eduAntequera · 2024-11-04T14:38:01Z

Hi @kaylareopelle, sorry for the delay and thanks for the guidance.
Here are some screenshots as requested

I hope they help

kaylareopelle · 2024-11-05T23:01:33Z

Hi @eduAntequera! Thanks for the screenshots! I'm about to take some time off through next Monday, so perhaps someone else can jump in here to help in the meantime.

These screenshots helped me realize that both exporters are recording the metric only once, and it seems to have a value of 0.0. Is that accurate? For these samples, how many times has QueueLatency.record(queue.latency, attributes: { queue_name: queue.name }) been called? What are some values you expect queue.latency to have recorded?

eduAntequera · 2024-11-06T09:53:37Z

Hi @kaylareopelle thanks for the reply!
I checked and indeed 0.0 could indeed be a value, I have changed the code to generate random values instead and will update here with screenshots with values other than 0.0. I also decided to change the name of the queues, since I was seeing a metric in grafana for queue_size but none for queue_latency. This made me think that maybe the names were colliding. So I changed the names to sidekiq_queue_size and sidekiq_queue_latency.
When using these and the generation of random value they did not appear in grafana (the queue_size metric still appeared).

Here are some of the results when sending randomly generated values:

Thanks for all the help and hope you enjoy your time off 😄

kaylareopelle · 2024-11-19T23:34:55Z

Hi @eduAntequera! Thank you for this test and for your patience. I took another look at the output and unfortunately I'm not seeing any issues with the output.

This might be an issue related to Grafana's ingest process rather than the OpenTelemetry Ruby metrics gems. I reached out on the CNCF Slack to someone who works for Grafana and they're going to take a look at the issue. If you're able to open a support ticket with a representative from the company, that may also be a good next step.

If they have anything they'd like us to test or change with the metrics code to help with troubleshooting, let us know!

eduAntequera · 2024-11-27T07:11:39Z

@kaylareopelle Thank you so much for looking into it and contacting them 🙇
We will let you know if we get any news :)
Sorry for the late reply, the notification about your reply got lost in my inbox 😓

seizethedave · 2025-01-21T19:52:56Z

For those having issues sending metrics to Grafana/Prometheus with the Ruby SDK: try setting the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE environment variable to cumulative as Prometheus doesn't yet support delta temporality. One way to do it:

bash% OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative ./my_ruby_app

akahn · 2025-01-31T16:44:55Z

I'm debugging an issue where metrics produced my app aren't visible in Grafana (using Alloy as the OpenTelemetry Collector). Now I'm using the debug collector in development, as recommended here, to verify that my setup is correct.

In a Rails initializer I do this:

  exporter = OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new
  reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(exporter: exporter)
  OpenTelemetry.meter_provider.add_metric_reader(reader)

Then, in a controller action I do this:

    meter = OpenTelemetry.meter_provider.meter('my_meter')
    scores_posted = meter.create_counter('my_counter')
    scores_posted.add(1, attributes: { attributes: 'my_attributes' })

In the debug collector logs I see this:

Metric #10
Descriptor:
     -> Name: my_counter
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
     -> attributes: Str(my_attributes)
StartTimestamp: 2025-01-31 16:41:01.763851223 +0000 UTC
Timestamp: 2025-01-31 16:42:01.775156029 +0000 UTC
Value: 1

Is this expected, for NumberDataPoints to be 0? Is the relevant information conveyed just by Value?

akahn · 2025-02-08T13:12:16Z

@kaylareopelle, any chance you could take a look at this? Any suggestions for what to investigate?

kaylareopelle · 2025-02-25T03:02:44Z

Hi @akahn, thanks for the ping! I'm sorry about the delay here!

In this scenario, I believe NumberDataPoints #0 is expected. The output is from the array of NumberDataPoints and lists the index for the NumberDataPoint being displayed. If you had another NumberDataPoint, it should have the heading NumberDataPoints #1. The relevant information for your counter's value should be the Value: 1.

For the Grafana ingest issue, I wonder if multiple exporters may be causing some sort of block. It should be possible to have multiple exporters, but perhaps we have a bug.

The Metrics SDK gem now creates and configures an OTLP exporter by default. This can be controlled using OTEL_METRICS_EXPORTER. The valid options are otlp (the default), console, in-memory, and none.

For debugging purposes, could you run some tests using the following scenarios:

Comment out the three lines you shared from your Rails initializer and start up your application
Keep the lines you have in your Rails initializer and run your app with the environment variable OTEL_METRICS_EXPORTER=none.

Can you let me know if either of those changes allow your data to reach Grafana?

akahn · 2025-03-10T18:06:04Z

I'm not sure what changed, but I am now seeing my metrics in Grafana, thank you for looking at this!

Some unrelated feedback:

We're noticing that when shutting down our app, we have to wait the full metrics export interval before the app exits. That's 60 seconds by default, which slows down our deploys considerably, since they run a few rake tasks. I would expect that on app shutdown, we'd wait up to 60 seconds for any pending metrics to be exported, but it seems like the delay is always 60 seconds (due to this sleep call). As a workaround, we've set our export interval to 3 seconds, to minimize the slowdown of running these tasks, but that seems quite a bit more frequent than what we'd want once we're up and running with many custom metrics.

kaylareopelle · 2025-03-11T15:31:21Z

@akahn - Hooray! 🎉 Happy to hear the ingest is working now!

As far as the shutdown goes, thanks for letting us know! That's not how we'd like it to behave. We'll take a closer look.

In the meantime, could you try adding the following to your opentelemetry.rb file:

at_exit do
  OpenTelemetry.meter_provider.shutdown
end

I haven't tested this, but I believe that should call force_flush on the Periodic Exporting Metric Reader and break the cycle, but the sleep might still get in the way.

xuan-cao-swi · 2025-03-11T16:30:52Z

I think it's caused by calling shutdown, which sets @continue to false, but @thread is still sleeping (e.g., sleep(@export_interval)). As a result, it can't detect the latest @continue value to break the cycle. I think we can try killing @thread directly (@thread.kill), or abandon the logic of using sleep for interval check so that @continue can be detected earlier.

akahn · 2025-03-11T19:08:21Z

That would be awesome, @xuan-cao-swi. We are hesitant to start using this in production because of the shutdown delay.

akahn · 2025-03-17T20:09:48Z

@kaylareopelle we already have that block running in our app, so it doesn't seem to help.

akahn · 2025-03-20T14:13:16Z

@kaylareopelle @xuan-cao-swi here is an attempt at faster shutdown, using the thread.kill approach you suggested. I'm looking forward to hearing your feedback. #1827

xuan-cao-swi · 2025-03-21T17:07:19Z

Hi @akahn , thank you for the PR. During sig meeting, we were discussing to use ConditionalVariable to replace sleep (second approach, see PR), but let's keep both PR open for further discussion.
cc @kaylareopelle, @arielvalentin

akahn · 2025-04-14T22:00:23Z

We are starting to use histogram metrics more in our Rails app and seeing confusing issues where metrics do not report accurately. We see requests going over the wire on metrics export but it seems the data is empty as no data points end up visible in Prometheus. We will continue investigating on our end but at this point we're wondering if our Puma web server is part of the issue. We use clustered mode, so there are multiple processes, each of which spawns multiple threads. Can you provide any guidance on the correct way to set up metrics reporting in a Rails app that uses Puma in this way?

xuan-cao-swi · 2025-04-15T15:09:06Z

Hi @akahn , here has more details: #1800; #1823

kaylareopelle mentioned this issue Jul 31, 2024

chore: Add link to issue for feedback on Metrics gems #1663

Merged

github-actions bot added the stale label Aug 31, 2024

kaylareopelle added metrics keep and removed stale labels Sep 3, 2024

kaylareopelle added this to Ruby - Metrics Sep 12, 2024

akahn mentioned this issue Mar 19, 2025

Prevent slow shutdown caused by waiting full metrics export interval #1827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics feedback #1662

Metrics feedback #1662

kaylareopelle commented Jul 31, 2024

github-actions bot commented Aug 31, 2024

chloe-meister commented Oct 18, 2024

kaylareopelle commented Oct 18, 2024

chloe-meister commented Oct 21, 2024

chloe-meister commented Oct 25, 2024

kaylareopelle commented Oct 26, 2024

eduAntequera commented Oct 28, 2024

kaylareopelle commented Oct 29, 2024

eduAntequera commented Nov 4, 2024

kaylareopelle commented Nov 5, 2024

eduAntequera commented Nov 6, 2024

kaylareopelle commented Nov 19, 2024

eduAntequera commented Nov 27, 2024

seizethedave commented Jan 21, 2025 •

edited

Loading

akahn commented Jan 31, 2025

akahn commented Feb 8, 2025

kaylareopelle commented Feb 25, 2025

akahn commented Mar 10, 2025 •

edited

Loading

kaylareopelle commented Mar 11, 2025

xuan-cao-swi commented Mar 11, 2025

akahn commented Mar 11, 2025

akahn commented Mar 17, 2025

akahn commented Mar 20, 2025

xuan-cao-swi commented Mar 21, 2025

akahn commented Apr 14, 2025

xuan-cao-swi commented Apr 15, 2025

Metrics feedback #1662

Metrics feedback #1662

Comments

kaylareopelle commented Jul 31, 2024

github-actions bot commented Aug 31, 2024

chloe-meister commented Oct 18, 2024

kaylareopelle commented Oct 18, 2024

chloe-meister commented Oct 21, 2024

chloe-meister commented Oct 25, 2024

kaylareopelle commented Oct 26, 2024

eduAntequera commented Oct 28, 2024

kaylareopelle commented Oct 29, 2024

eduAntequera commented Nov 4, 2024

kaylareopelle commented Nov 5, 2024

eduAntequera commented Nov 6, 2024

kaylareopelle commented Nov 19, 2024

eduAntequera commented Nov 27, 2024

seizethedave commented Jan 21, 2025 • edited Loading

akahn commented Jan 31, 2025

akahn commented Feb 8, 2025

kaylareopelle commented Feb 25, 2025

akahn commented Mar 10, 2025 • edited Loading

kaylareopelle commented Mar 11, 2025

xuan-cao-swi commented Mar 11, 2025

akahn commented Mar 11, 2025

akahn commented Mar 17, 2025

akahn commented Mar 20, 2025

xuan-cao-swi commented Mar 21, 2025

akahn commented Apr 14, 2025

xuan-cao-swi commented Apr 15, 2025

seizethedave commented Jan 21, 2025 •

edited

Loading

akahn commented Mar 10, 2025 •

edited

Loading