Skip to content

Add OpenTracing module #418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 5, 2021
Merged

Conversation

Spikhalskiy
Copy link
Contributor

@Spikhalskiy Spikhalskiy commented Apr 1, 2021

What was changed:

This PR adds a new temporal-opentracing module with support for OpenTracing context propagation from the client to workflow and to activity execution.

Why?

We want to be able to tie all the parts of the workflow execution under the same client parent OpenTracing span for tracing and debugging.

How it can be used?

See README provided with this module: https://github.com/temporalio/sdk-java/pull/418/files#diff-a87181e0e99e6147518982b133e66674da61be50e9898fe9d1eab9dc91ab2cc6R5

Closes

#258

@Spikhalskiy Spikhalskiy marked this pull request as draft April 1, 2021 06:15
@Spikhalskiy Spikhalskiy force-pushed the opentracing branch 6 times, most recently from ba3d259 to 7b2ecd5 Compare April 1, 2021 22:41
@Spikhalskiy Spikhalskiy marked this pull request as ready for review April 1, 2021 22:42
@Spikhalskiy
Copy link
Contributor Author

Spikhalskiy commented Apr 1, 2021

There will be a couple more things later today:

  1. moving options into optional input parameters
  2. logging failures into a Span

But this is ready for the main review and feedback.

Copy link
Contributor

@vitarb vitarb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Dmitry, I left some comments here, please ping me on slack if you have questions.

Comment on lines 55 to 59
openTracingConfig.getWorkflowRunOperationNamePrefix()
+ "-"
+ Workflow.getInfo().getWorkflowType(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add workflow type at the end? Should we add it into tags instead?

Copy link
Contributor Author

@Spikhalskiy Spikhalskiy Apr 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just WorkflowStart or WorkflowRun as an operation name is very undescriptive and pretty meaningless.

A definition/best practice from OpenTracing Specification:

An operation name, a human-readable string which concisely represents the work done by the Span (for example, an RPC method name, a function name, or the name of a subtask or stage within a larger computation). The operation name should be the most general string that identifies a (statistically) interesting class of Span instances. That is, "get_user" is better than "get_user/314159".

For example, here are potential operation names for a Span that gets hypothetical account information:

Operation Name Guidance
get Too general
get_account/792 Too specific
get_account Good, and account_id=792 would make a nice Span tag

Also from Open Tracing Guide:

The reason for choosing general operation names is to allow the tracing systems to do aggregations. For example, Jaeger tracer has an option of emitting metrics for all the traffic going through the application. Having a unique operation name for each span would make the metrics useless. For application or library developers who wish to capture the program arguments in the traces to distinguish them, the recommended solution is to annotate spans with tags or logs. These are discussed in the section below.

Each operation name should have some semantic meaning to it and should be chosen such that its cardinality is neither too high nor too low. The cardinality of an operation name is defined as follows:

card(operation X) = total number of spans of operation X

For example, when using http.url (as operation name) the cardinality would be too high, on the other hand if http.method is chosen as the operation name then the cardinality would be too low.

I think that my current solution is most in line with these guidelines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually goes against what you've quoted above, specifically "operation name should be the most general string" as we concatenate specific workflow type to operation.

Copy link
Contributor Author

@Spikhalskiy Spikhalskiy Apr 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you just took a part of the phrase out of the context. The full phrase is "most general string that identifies a (statistically) interesting class of Span instances". And there are more phrases around that explain what does it mean.
You don't suppose to just select the most general string. You select the most general string that makes sense semantically for the process that we trace and which cardinality is not too small or too large (because it would be useless).

  1. human-readable string which concisely represents the work done by the Span (for example, an RPC method name, a function name, or the name of a subtask or stage within a larger computation)

WorkflowStart, WorkflowRun, ActivityStart, ActivityRun don't "concisely represents the work done by the Span"

  1. For example, when using http.url (as operation name) the cardinality would be too high, on the other hand if http.method is chosen as the operation name then the cardinality would be too low.

Your offer is close to using http.method from this example.

  1. The first link gives us specific good and bad examples:
Operation Name Guidance
get Too general
get_account/792 Too specific
get_account Good, and account_id=792 would make a nice Span tag

Your offer will lead to making an operation name EVEN MORE generic than this get example. Your offer is comparable to "Just Some HTTP Call" level. Which doesn't even presented in the examples, because it's clearly too generic.

  1. Even temporal go sdk uses workflow/activity types as open tracing span operation names: https://github.com/temporalio/sdk-go/blob/cec615e1aaf798e1b153ba3918b4d3027edde808/internal/propagation.go#L50
    https://github.com/temporalio/sdk-go/blob/cec615e1aaf798e1b153ba3918b4d3027edde808/internal/propagation.go#L64

I stand on the fact that my current solution exactly implements the guidelines or close to them and it should be implemented this way.
I'm ready to compromise on removing the prefixes and putting them in tags (which will make Start and Run spans indistinguishable for the reader from the first sight). But I'm not ready to compromse on taking workflow/activity types out from span operation names. It will be an incorrect implementation of the spec. It will also be an implementation that is not in agreement with how other opentracing implementations (and even Temporal own Go SDK) are done.
I'm also a future user of this solution and I'm very motivated to keep it this way. Your proposal will make the trace look like a chain of indistinguishable "ActivityRun" spans without digging into tags and it will be a disaster for a user who actually will read the trace.

Copy link
Contributor Author

@Spikhalskiy Spikhalskiy Apr 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how OpenTelemetry UIs look like:
c6463a2b-trace-300
0_WDqZqICj-V23_1YW
aanuraag_Distributed-Tracing-OpenTelemetry_f1
To even get to tags you need to open the details of a specific span. How comfortable it will work with a sequence of indistinguishable ActivityRun spans?

Also, see OpenTracing implementation for JaxRS2 (as the closest framework to Temporal because it's aware about endpoint class and method names):
https://github.com/opentracing-contrib/java-jaxrs/blob/master/opentracing-jaxrs2/src/main/java/io/opentracing/contrib/jaxrs2/server/OperationNameProvider.java#L46

Default Microprofile operation name <HTTP method>:<package name>.<Class name>.<method name>

<HTTP method> is our StartWorkflow/RunActivity, etc.
<package name>.<Class name>.<method name> is our ActivityType/WorkflowType

They do provide an alternative strategy

  /**
   * Returns HTTP method as operation name
   */
  class HTTPMethodOperationName implements OperationNameProvider {

but it's not a default.

@Spikhalskiy
Copy link
Contributor Author

Spikhalskiy commented Apr 2, 2021

@vitarb Thanks for your review!

  • Addressed PR comments
  • Implemented logging of workflow/activity failure flags
  • Reworked and exposed OpenTracingOptions

This PR is ready to be revisited.

@Spikhalskiy Spikhalskiy force-pushed the opentracing branch 6 times, most recently from 1be097d to 6c01e14 Compare April 2, 2021 15:47
@Spikhalskiy Spikhalskiy force-pushed the opentracing branch 2 times, most recently from 6d21943 to a8587b9 Compare April 3, 2021 01:30
@Spikhalskiy Spikhalskiy force-pushed the opentracing branch 4 times, most recently from 4550a09 to 74a7941 Compare April 3, 2021 02:07
@Spikhalskiy Spikhalskiy requested a review from vitarb April 3, 2021 02:23
@Spikhalskiy Spikhalskiy force-pushed the opentracing branch 7 times, most recently from 188de6d to a2df830 Compare April 4, 2021 21:47
@vitarb vitarb merged commit d9a0c15 into temporalio:master Apr 5, 2021
@Spikhalskiy Spikhalskiy deleted the opentracing branch April 15, 2022 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants