Skip to content

Conversation

nsavoire
Copy link

@nsavoire nsavoire commented Sep 1, 2025

What does this PR do?

This PR stores the tracer process level context in a named anonymous mapping on Linux.

PROF-12406

Motivation

Project plan

  • Datadog proposal to OTel
  • Technical RFC on process-level information and why we chose named anonymous mapping over memfd

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • New code is free of linting errors. You can check this by running ./scripts/lint.sh locally.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.

Unsure? Have a question? Request a review!

@nsavoire nsavoire requested review from a team as code owners September 1, 2025 21:28
@nsavoire nsavoire force-pushed the nsavoire/process_level_context branch from 0cd2733 to 99d1142 Compare September 1, 2025 21:32
@pr-commenter
Copy link

pr-commenter bot commented Sep 1, 2025

Benchmarks

Benchmark execution time: 2025-09-08 10:24:49

Comparing candidate commit d14c0bc in PR branch nsavoire/process_level_context with baseline commit a3544b2 in branch main.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 5 metrics, 0 unstable metrics.

scenario:BenchmarkConfig/scenario_WithStartSpanConfig-24

  • 🟥 execution_time [+3.010ns; +3.740ns] or [+3.206%; +3.982%]

@nsavoire nsavoire changed the title [PROF-12406] feat: store OTEL process context in named anonymous mapping on Linux feat: store OTEL process context in named anonymous mapping on Linux Sep 2, 2025
@nsavoire nsavoire force-pushed the nsavoire/process_level_context branch 2 times, most recently from f4348e8 to 56e26fa Compare September 2, 2025 08:08
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments!

My go-foo is not very strong so definitely worth getting a pass from someone that's not pretending to know go as well ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is cool! Much better than generating this by hand as we did in the C implementation ;)

copy(mappingBytes[headerSize:], data)
copy(mappingBytes[:headerSize], unsafe.Slice((*byte)(unsafe.Pointer(&header)), headerSize))
// write the signature last to ensure that once a process validates the signature, it can safely read the whole data
copy(mappingBytes, otelContextSignature)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should become a https://pkg.go.dev/sync/atomic#Uint64.Store with the actual bytes turned into a single value so as to create a "synchronized before" that guarantees that the things before are visible to readers of this store? 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That kind of awkward to do in Go, but done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivoanjo could you say more about this? I don't follow how the atomic store would help here, at least based on the stuff in that document. Keep in mind this is like the second time I've read the memory model stuff :) Given this:

If a synchronizing read-like memory operation r observes a synchronizing write-like memory operation w (that is, if W(r) = w), then w is synchronized before r

combined with this

If the effect of an atomic operation A is observed by atomic operation B, then A is synchronized before B

Are the regular old memory writes (the copy calls preceding this line) "synchronizing write-like operations"? And the bit about atomics seems like it only applies to synchronization between atomic operations, of which there would only be one here.

I understand the motivation, which is for an external process to not see partially-initialized data. But I don't feel 100% sure that changing this to an atomic helps. Especially since we're trying to synchronize with a completely different process reading this memory, which seems out of scope of the Go memory model. If we had a test that fails without this I'd feel better, but that's probably hard to do :)

Copy link
Member

@ivoanjo ivoanjo Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesssss there's a bit to be said here. Here's my thinking:

Ideally what we want is for the signature to be the last thing, so that if the signature is observed, the rest is expected to be in place.

One additional note, is that, because we're asking the OS for memory directly using mmap we do get a nice property -- that memory is zeroed by the OS before giving it to us so at least we have this property that if we observe any field at zero, then things are weird.

Are the regular old memory writes (the copy calls preceding this line) "synchronizing write-like operations"? And the bit about atomics seems like it only applies to synchronization between atomic operations, of which there would only be one here.

Ok so let's talk the synchronization parts: I will preface this by stating that I'm going off my knowledge of other memory models (not being very familiar with Go), but this seems a reasonable assumption as https://go.dev/ref/mem?#atomic states (emphasis mine):

If the effect of an atomic operation A is observed by atomic operation B, then A is synchronized before B. All the atomic operations executed in a program behave as though executed in some sequentially consistent order.

The preceding definition has the same semantics as C++’s sequentially consistent atomics and Java’s volatile variables.

The memory model page doesn't have an amazing quote for atomics for this (at least that I could see), but the way that "synchronizes-with" is used, my reading on it is that it means this synchronization not only affects the atomics but also for the things that happen before them in the program order.

For instance, if you see the channel communication definition:

Channel communication
Channel communication is the main method of synchronization between goroutines. Each send on a particular channel is matched to a corresponding receive from that channel, usually in a different goroutine.

A send on a channel is synchronized before the completion of the corresponding receive from that channel.

This program:

var c = make(chan int, 10)
var a string

func f() {
	a = "hello, world"
	c <- 0
}

func main() {
	go f()
	<-c
	print(a)
}

is guaranteed to print "hello, world". The write to a is sequenced before the send on c, which is synchronized before the corresponding receive on c completes, which is sequenced before the print.

You can spot that the fact this in action: the synchronization is on the writing/reading to the channel, but because in a single-threaded sequential execution a is written before c then this order is "propagated" to the other thread.

This is also what allows you do do something like in most languages (and I believe in Go, but again the example on this page is not amazing for locks)...

(do a bunch of writes)
lock()
finished_work = true
unlock()

and allows a reader that observes finished_work having grabbed the lock to also observe the writes that came before the flag was set to true.

Since they quote C++ sequentially consistency, here's from https://en.cppreference.com/w/cpp/atomic/memory_order.html :

Sequentially-consistent ordering
Atomic operations tagged memory_order_seq_cst [...] order memory the same way as release/acquire ordering (everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load) [...]

Release-Consume ordering
If an atomic store in thread A is tagged memory_order_release, an atomic load in thread B from the same variable is tagged memory_order_consume, and the load in thread B reads a value written by the store in thread A, then the store in thread A is dependency-ordered before the load in thread B.

All memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects within those operations in thread B into which the load operation carries dependency, that is, once the atomic load is completed, those operators and functions in thread B that use the value obtained from the load are guaranteed to see what thread A wrote to memory.

(Java volatiles, also cited are not like C/C++ volatiles and behave the same way in terms of mandating an order of operations between threads IF they synchronize via the volatile)

I understand the motivation, which is for an external process to not see partially-initialized data. But I don't feel 100% sure that changing this to an atomic helps. Especially since we're trying to synchronize with a completely different process reading this memory, which seems out of scope of the Go memory model.

Yes, this is the hard part.

Usually memory models are defined in a more abstract way to try to allow for some flexibility in implementation, so while theoretically a go program would be correct if it could make the synchronization work only at the go level but somehow this didn't map to the expected ordering we intend at the hardware/memory level, in practice that's not very likely to happen. (I remember https://gee.cs.oswego.edu/dl/jmm/cookbook.html being an interesting resource of "what does the mumbo jumbo in the java concurrency model actually mean for actual architectures").

So here we're implicitly relying on this "we expect in practice using these atomics will mean the go compiler will a) Emit the assembly instructions in the order we intend -- the signature after everything else; and b) Emit a memory barrier that will tell the CPU that what came before does need to become visible together with the signature".

If we had a test that fails without this I'd feel better, but that's probably hard to do :)

The problem on these kinds of things is, sometimes in practice the compiler is already doing what we want, even if we didn't force its hand. So this is more about forcing the hand of the compiler, saying "really really really really pretty please don't touch on this".

Having said that, I suspect that, especially on arm64, if we did something like: a goroutine constantly publishes the information (let's say version = 1, then length = 1, then payload = 1, then signature = 1), then increments each one in this order; and another goroutine tries to read all 4 fields that we could see situations where signature was > than some of the other fields (meaning that there were reordered writes).

Finally, I'll add that the cost of getting this wrong is annoying but not catastrophic, since in the worst case we would see zeroes in the header, which are easy to detect. It's more annoying if somehow the writing of the payload gets reordered, but I'm not sure how likely it would be?

(Sorry for the TL;DR, you nerd-sniped me...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents, which largely echo Ivo's comment, but shorter. We need two guarantees here:

a) The statements before the atomic operation can't be reordered to execute after the atomic execution by the Go compiler. This should be covered by Requirement 1 in the memory model.
b) The atomic operation must not be executed out of order by the processor. For this we rely on the mercy of the compiler to emit full memory fence instructions. But we can't imagine a compiler that wouldn't do that for atomics, so it's an acceptable risk.

So I think the current state of the patch is acceptable.

If we want stronger guarantees, we should probably replace the signature with a checksum. If another process reads the memory and find the checksum to not match, they need to retry. I'd be curious how y'all feel about that approach. I kinda like that it would work without any assumptions about the memory models of the language or processor architecture.

Copy link
Contributor

@nsrip-dd nsrip-dd Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want stronger guarantees, we should probably replace the signature with a checksum. If another process reads the memory and find the checksum to not match, they need to retry. I'd be curious how y'all feel about that approach. I kinda like that it would work without any assumptions about the memory models of the language or processor architecture.

I was going to suggest the same thing. My gut instinct is to avoid theoretical memory model-type arguments especially when they feel untested/untestable. Like the Go memory model reference says: "don't be clever" :) A checksum is more obviously correct and given that it'd be a one-time computation on a paltry 2 pages of memory I think it'd be basically free to add.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want stronger guarantees, we should probably replace the signature with a checksum. If another process reads the memory and find the checksum to not match, they need to retry. I'd be curious how y'all feel about that approach. I kinda like that it would work without any assumptions about the memory models of the language or processor architecture.

I was going to suggest the same thing. My gut instinct is to avoid theoretical memory model-type arguments especially when they feel untested/untestable. Like the Go memory model reference says: "don't be clever" :) A checksum is more obviously correct and given that it'd be a one-time computation on a paltry 2 pages of memory I think it'd be basically free to add.

I'm a bit divided on this one......

I was somehow hoping that the current approach was enough because we can take advantage of the whole "the mapping is zeroed by the kernel" thing. That is, a reader could rely on "if the mapping isn't blank" then it has observed the whole thing.

Y'all are right that the signature is stronger (no need for handwaving around memory models and weird reorderings). What gives me hesitation is complexity of implementation: now there's one more thing that all writers need to implement (and better not get it wrong!) as well as readers, and then we get into: what checksum should we use? Should we use a trivial one? crc32? A fancier crypto-secure one?

But... maybe I'm over-weighting the complexity of that and under-weighting the complexity of all of this concurrency mumbo-jumbo so... yeah please do chime in on this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I realize I've probably introduced quite a wrench in the works here... I agree that the signature would add a lot of implementation complexity. I wonder if the prctl later is already sufficient synchronization, like that all the copy calls happen before that, meaning the external process won't find our mapping before everything is written to it? That said, I had an idea for a way to do the synchronization we want that I personally feel a little better about. How about this?

diff --git a/internal/otelcontextmapping_linux.go b/internal/otelcontextmapping_linux.go
index 61c1c7f504..11072fae14 100644
--- a/internal/otelcontextmapping_linux.go
+++ b/internal/otelcontextmapping_linux.go
@@ -10,6 +10,7 @@
 import (
        "fmt"
        "os"
+       "sync"
        "unsafe"

        "github.com/DataDog/dd-trace-go/v2/internal/log"
@@ -71,14 +72,19 @@
        }

        addr := uintptr(unsafe.Pointer(&mappingBytes[0]))
-       header := processContextHeader{
-               Version:     1,
-               PayloadSize: uint32(len(data)),
-               PayloadAddr: addr + uintptr(headerSize),
-       }
-
-       copy(mappingBytes[headerSize:], data)
-       copy(mappingBytes[:headerSize], unsafe.Slice((*byte)(unsafe.Pointer(&header)), headerSize))
+       var wg sync.WaitGroup
+       wg.Add(1)
+       go func() {
+               defer wg.Done()
+               header := processContextHeader{
+                       Version:     1,
+                       PayloadSize: uint32(len(data)),
+                       PayloadAddr: addr + uintptr(headerSize),
+               }
+               copy(mappingBytes[headerSize:], data)
+               copy(mappingBytes[:headerSize], unsafe.Slice((*byte)(unsafe.Pointer(&header)), headerSize))
+       }()
+       wg.Wait()
        // write the signature last to ensure that once a process validates the signature, it can safely read the whole data
        copy(mappingBytes, otelContextSignature)

This, I think, achieves the same synchronization we'd want, without having to refer to the Go memory model and make arguments about "happens before" re: atomics. The wg.Wait ensures that we will observe all the other stuff in mappingBytes strictly before writing the signature. This is what I'd do if this was just memory shared within the same Go process, at least. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the prctl later is already sufficient synchronization, like that all the copy calls happen before that, meaning the external process won't find our mapping before everything is written to it?

Yeah this is one of those weird things where I'd expect it to be clearly stated in the memory model "if you're going to do IO this does impact the global ordering" but surprisingly it's not (I guess it's... implied? by the other things).

This, I think, achieves the same synchronization we'd want, without having to refer to the Go memory model and make arguments about "happens before" re: atomics. The wg.Wait ensures that we will observe all the other stuff in mappingBytes strictly before writing the signature. This is what I'd do if this was just memory shared within the same Go process, at least. WDYT?

I like it! I'd go for leaving the first two copies in the original goroutine, leaving only the signature copy on the separate goroutine -- I think this conveys even more the whole "this is done, then we sync with the other goroutine on the other part, then we come back.

Possibly with a comment that "this is possibly not needed but just in case this really should make sure the go compiler doesn't get any clever ideas"...

I agree that the signature would add a lot of implementation complexity.

Yeah, at this point I'm in the "I'd like to avoid it" side, but if there's overall concerns here we're still on time to "bite the bullet".

@nsavoire nsavoire requested a review from ivoanjo September 4, 2025 11:52
- Add link to reference implementation
- Atomically write the signature to the start of the mapping
- Ignore the return value of Prctl and do not log the error
@nsavoire nsavoire force-pushed the nsavoire/process_level_context branch from 2282839 to 3025072 Compare September 4, 2025 12:12
Copy link
Contributor

@nsrip-dd nsrip-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this with a simple Go program that just starts the APM tracer and then blocks forever. I've set DD_ENV, service name, and other config values, and then used Ivo's shell script to confirm that the memory mapping is there and has the expected contents.

Just left a few small comments.

// https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-version
ServiceVersion string `msg:"service.version"`
// https://opentelemetry.io/docs/specs/semconv/registry/attributes/telemetry/#telemetry-sdk-language
TelemetrySdkLanguage string `msg:"telemetry.sdk.language"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really small nitpick: the Go custom is to capitalize acronyms, like TelemetrySDKLanguage

copy(mappingBytes[headerSize:], data)
copy(mappingBytes[:headerSize], unsafe.Slice((*byte)(unsafe.Pointer(&header)), headerSize))
// write the signature last to ensure that once a process validates the signature, it can safely read the whole data
copy(mappingBytes, otelContextSignature)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivoanjo could you say more about this? I don't follow how the atomic store would help here, at least based on the stuff in that document. Keep in mind this is like the second time I've read the memory model stuff :) Given this:

If a synchronizing read-like memory operation r observes a synchronizing write-like memory operation w (that is, if W(r) = w), then w is synchronized before r

combined with this

If the effect of an atomic operation A is observed by atomic operation B, then A is synchronized before B

Are the regular old memory writes (the copy calls preceding this line) "synchronizing write-like operations"? And the bit about atomics seems like it only applies to synchronization between atomic operations, of which there would only be one here.

I understand the motivation, which is for an external process to not see partially-initialized data. But I don't feel 100% sure that changing this to an atomic helps. Especially since we're trying to synchronize with a completely different process reading this memory, which seems out of scope of the Go memory model. If we had a test that fails without this I'd feel better, but that's probably hard to do :)

existingMappingBytes []byte
)

type processContextHeader struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a _ structs.HostLayout field to the beginning of this struct. It seems to work fine without it, but adding that field would ensure that the memory layout matches the C layout, which is what I gather the consumer of this data will expect. Especially since we copy the raw memory for this struct below. Makes it more future-proof.

log.Error("failed to store the configuration: %s", err.Error())
}

processContext := otelProcessContext{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a doc that explains why were are not using the Memfd solution for OpenTelemetry? IIRC we had some good reason, but I forgot what it was. We should probably link to that doc here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reasons were:

  • memfd_create syscall might be blocked by some seccomp profiles
  • it requires more syscalls / IO since all symlinks in /proc/<pid>/fd/ need to be resolved until datadog-tracer-info-<id> file is found

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "not inherited by fork" is also a nice bonus, and connects back to the discussions happening internally on the meaning of "runtime-id" -- the anonymous mapping mechanism, because it's not inherited by forks won't have any issues with stale information (such as a runtime-id) being carried from parent to child process.

copy(mappingBytes[headerSize:], data)
copy(mappingBytes[:headerSize], unsafe.Slice((*byte)(unsafe.Pointer(&header)), headerSize))
// write the signature last to ensure that once a process validates the signature, it can safely read the whole data
copy(mappingBytes, otelContextSignature)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents, which largely echo Ivo's comment, but shorter. We need two guarantees here:

a) The statements before the atomic operation can't be reordered to execute after the atomic execution by the Go compiler. This should be covered by Requirement 1 in the memory model.
b) The atomic operation must not be executed out of order by the processor. For this we rely on the mercy of the compiler to emit full memory fence instructions. But we can't imagine a compiler that wouldn't do that for atomics, so it's an acceptable risk.

So I think the current state of the patch is acceptable.

If we want stronger guarantees, we should probably replace the signature with a checksum. If another process reads the memory and find the checksum to not match, they need to retry. I'd be curious how y'all feel about that approach. I kinda like that it would work without any assumptions about the memory models of the language or processor architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants