-
Notifications
You must be signed in to change notification settings - Fork 484
feat: store OTEL process context in named anonymous mapping on Linux #3937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0cd2733 to
99d1142
Compare
BenchmarksBenchmark execution time: 2025-09-08 10:24:49 Comparing candidate commit d14c0bc in PR branch Found 0 performance improvements and 1 performance regressions! Performance is the same for 5 metrics, 0 unstable metrics. scenario:BenchmarkConfig/scenario_WithStartSpanConfig-24
|
f4348e8 to
56e26fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments!
My go-foo is not very strong so definitely worth getting a pass from someone that's not pretending to know go as well ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is cool! Much better than generating this by hand as we did in the C implementation ;)
internal/otelcontextmapping_linux.go
Outdated
| copy(mappingBytes[headerSize:], data) | ||
| copy(mappingBytes[:headerSize], unsafe.Slice((*byte)(unsafe.Pointer(&header)), headerSize)) | ||
| // write the signature last to ensure that once a process validates the signature, it can safely read the whole data | ||
| copy(mappingBytes, otelContextSignature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should become a https://pkg.go.dev/sync/atomic#Uint64.Store with the actual bytes turned into a single value so as to create a "synchronized before" that guarantees that the things before are visible to readers of this store? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That kind of awkward to do in Go, but done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivoanjo could you say more about this? I don't follow how the atomic store would help here, at least based on the stuff in that document. Keep in mind this is like the second time I've read the memory model stuff :) Given this:
If a synchronizing read-like memory operation r observes a synchronizing write-like memory operation w (that is, if W(r) = w), then w is synchronized before r
combined with this
If the effect of an atomic operation A is observed by atomic operation B, then A is synchronized before B
Are the regular old memory writes (the copy calls preceding this line) "synchronizing write-like operations"? And the bit about atomics seems like it only applies to synchronization between atomic operations, of which there would only be one here.
I understand the motivation, which is for an external process to not see partially-initialized data. But I don't feel 100% sure that changing this to an atomic helps. Especially since we're trying to synchronize with a completely different process reading this memory, which seems out of scope of the Go memory model. If we had a test that fails without this I'd feel better, but that's probably hard to do :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesssss there's a bit to be said here. Here's my thinking:
Ideally what we want is for the signature to be the last thing, so that if the signature is observed, the rest is expected to be in place.
One additional note, is that, because we're asking the OS for memory directly using mmap we do get a nice property -- that memory is zeroed by the OS before giving it to us so at least we have this property that if we observe any field at zero, then things are weird.
Are the regular old memory writes (the copy calls preceding this line) "synchronizing write-like operations"? And the bit about atomics seems like it only applies to synchronization between atomic operations, of which there would only be one here.
Ok so let's talk the synchronization parts: I will preface this by stating that I'm going off my knowledge of other memory models (not being very familiar with Go), but this seems a reasonable assumption as https://go.dev/ref/mem?#atomic states (emphasis mine):
If the effect of an atomic operation A is observed by atomic operation B, then A is synchronized before B. All the atomic operations executed in a program behave as though executed in some sequentially consistent order.
The preceding definition has the same semantics as C++’s sequentially consistent atomics and Java’s volatile variables.
The memory model page doesn't have an amazing quote for atomics for this (at least that I could see), but the way that "synchronizes-with" is used, my reading on it is that it means this synchronization not only affects the atomics but also for the things that happen before them in the program order.
For instance, if you see the channel communication definition:
Channel communication
Channel communication is the main method of synchronization between goroutines. Each send on a particular channel is matched to a corresponding receive from that channel, usually in a different goroutine.A send on a channel is synchronized before the completion of the corresponding receive from that channel.
This program:
var c = make(chan int, 10) var a string func f() { a = "hello, world" c <- 0 } func main() { go f() <-c print(a) }is guaranteed to print "hello, world". The write to a is sequenced before the send on c, which is synchronized before the corresponding receive on c completes, which is sequenced before the print.
You can spot that the fact this in action: the synchronization is on the writing/reading to the channel, but because in a single-threaded sequential execution a is written before c then this order is "propagated" to the other thread.
This is also what allows you do do something like in most languages (and I believe in Go, but again the example on this page is not amazing for locks)...
(do a bunch of writes)
lock()
finished_work = true
unlock()
and allows a reader that observes finished_work having grabbed the lock to also observe the writes that came before the flag was set to true.
Since they quote C++ sequentially consistency, here's from https://en.cppreference.com/w/cpp/atomic/memory_order.html :
Sequentially-consistent ordering
Atomic operations tagged memory_order_seq_cst [...] order memory the same way as release/acquire ordering (everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load) [...]Release-Consume ordering
If an atomic store in thread A is tagged memory_order_release, an atomic load in thread B from the same variable is tagged memory_order_consume, and the load in thread B reads a value written by the store in thread A, then the store in thread A is dependency-ordered before the load in thread B.All memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects within those operations in thread B into which the load operation carries dependency, that is, once the atomic load is completed, those operators and functions in thread B that use the value obtained from the load are guaranteed to see what thread A wrote to memory.
(Java volatiles, also cited are not like C/C++ volatiles and behave the same way in terms of mandating an order of operations between threads IF they synchronize via the volatile)
I understand the motivation, which is for an external process to not see partially-initialized data. But I don't feel 100% sure that changing this to an atomic helps. Especially since we're trying to synchronize with a completely different process reading this memory, which seems out of scope of the Go memory model.
Yes, this is the hard part.
Usually memory models are defined in a more abstract way to try to allow for some flexibility in implementation, so while theoretically a go program would be correct if it could make the synchronization work only at the go level but somehow this didn't map to the expected ordering we intend at the hardware/memory level, in practice that's not very likely to happen. (I remember https://gee.cs.oswego.edu/dl/jmm/cookbook.html being an interesting resource of "what does the mumbo jumbo in the java concurrency model actually mean for actual architectures").
So here we're implicitly relying on this "we expect in practice using these atomics will mean the go compiler will a) Emit the assembly instructions in the order we intend -- the signature after everything else; and b) Emit a memory barrier that will tell the CPU that what came before does need to become visible together with the signature".
If we had a test that fails without this I'd feel better, but that's probably hard to do :)
The problem on these kinds of things is, sometimes in practice the compiler is already doing what we want, even if we didn't force its hand. So this is more about forcing the hand of the compiler, saying "really really really really pretty please don't touch on this".
Having said that, I suspect that, especially on arm64, if we did something like: a goroutine constantly publishes the information (let's say version = 1, then length = 1, then payload = 1, then signature = 1), then increments each one in this order; and another goroutine tries to read all 4 fields that we could see situations where signature was > than some of the other fields (meaning that there were reordered writes).
Finally, I'll add that the cost of getting this wrong is annoying but not catastrophic, since in the worst case we would see zeroes in the header, which are easy to detect. It's more annoying if somehow the writing of the payload gets reordered, but I'm not sure how likely it would be?
(Sorry for the TL;DR, you nerd-sniped me...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My two cents, which largely echo Ivo's comment, but shorter. We need two guarantees here:
a) The statements before the atomic operation can't be reordered to execute after the atomic execution by the Go compiler. This should be covered by Requirement 1 in the memory model.
b) The atomic operation must not be executed out of order by the processor. For this we rely on the mercy of the compiler to emit full memory fence instructions. But we can't imagine a compiler that wouldn't do that for atomics, so it's an acceptable risk.
So I think the current state of the patch is acceptable.
If we want stronger guarantees, we should probably replace the signature with a checksum. If another process reads the memory and find the checksum to not match, they need to retry. I'd be curious how y'all feel about that approach. I kinda like that it would work without any assumptions about the memory models of the language or processor architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want stronger guarantees, we should probably replace the signature with a checksum. If another process reads the memory and find the checksum to not match, they need to retry. I'd be curious how y'all feel about that approach. I kinda like that it would work without any assumptions about the memory models of the language or processor architecture.
I was going to suggest the same thing. My gut instinct is to avoid theoretical memory model-type arguments especially when they feel untested/untestable. Like the Go memory model reference says: "don't be clever" :) A checksum is more obviously correct and given that it'd be a one-time computation on a paltry 2 pages of memory I think it'd be basically free to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want stronger guarantees, we should probably replace the signature with a checksum. If another process reads the memory and find the checksum to not match, they need to retry. I'd be curious how y'all feel about that approach. I kinda like that it would work without any assumptions about the memory models of the language or processor architecture.
I was going to suggest the same thing. My gut instinct is to avoid theoretical memory model-type arguments especially when they feel untested/untestable. Like the Go memory model reference says: "don't be clever" :) A checksum is more obviously correct and given that it'd be a one-time computation on a paltry 2 pages of memory I think it'd be basically free to add.
I'm a bit divided on this one......
I was somehow hoping that the current approach was enough because we can take advantage of the whole "the mapping is zeroed by the kernel" thing. That is, a reader could rely on "if the mapping isn't blank" then it has observed the whole thing.
Y'all are right that the signature is stronger (no need for handwaving around memory models and weird reorderings). What gives me hesitation is complexity of implementation: now there's one more thing that all writers need to implement (and better not get it wrong!) as well as readers, and then we get into: what checksum should we use? Should we use a trivial one? crc32? A fancier crypto-secure one?
But... maybe I'm over-weighting the complexity of that and under-weighting the complexity of all of this concurrency mumbo-jumbo so... yeah please do chime in on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I realize I've probably introduced quite a wrench in the works here... I agree that the signature would add a lot of implementation complexity. I wonder if the prctl later is already sufficient synchronization, like that all the copy calls happen before that, meaning the external process won't find our mapping before everything is written to it? That said, I had an idea for a way to do the synchronization we want that I personally feel a little better about. How about this?
diff --git a/internal/otelcontextmapping_linux.go b/internal/otelcontextmapping_linux.go
index 61c1c7f504..11072fae14 100644
--- a/internal/otelcontextmapping_linux.go
+++ b/internal/otelcontextmapping_linux.go
@@ -10,6 +10,7 @@
import (
"fmt"
"os"
+ "sync"
"unsafe"
"github.com/DataDog/dd-trace-go/v2/internal/log"
@@ -71,14 +72,19 @@
}
addr := uintptr(unsafe.Pointer(&mappingBytes[0]))
- header := processContextHeader{
- Version: 1,
- PayloadSize: uint32(len(data)),
- PayloadAddr: addr + uintptr(headerSize),
- }
-
- copy(mappingBytes[headerSize:], data)
- copy(mappingBytes[:headerSize], unsafe.Slice((*byte)(unsafe.Pointer(&header)), headerSize))
+ var wg sync.WaitGroup
+ wg.Add(1)
+ go func() {
+ defer wg.Done()
+ header := processContextHeader{
+ Version: 1,
+ PayloadSize: uint32(len(data)),
+ PayloadAddr: addr + uintptr(headerSize),
+ }
+ copy(mappingBytes[headerSize:], data)
+ copy(mappingBytes[:headerSize], unsafe.Slice((*byte)(unsafe.Pointer(&header)), headerSize))
+ }()
+ wg.Wait()
// write the signature last to ensure that once a process validates the signature, it can safely read the whole data
copy(mappingBytes, otelContextSignature)
This, I think, achieves the same synchronization we'd want, without having to refer to the Go memory model and make arguments about "happens before" re: atomics. The wg.Wait ensures that we will observe all the other stuff in mappingBytes strictly before writing the signature. This is what I'd do if this was just memory shared within the same Go process, at least. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the prctl later is already sufficient synchronization, like that all the copy calls happen before that, meaning the external process won't find our mapping before everything is written to it?
Yeah this is one of those weird things where I'd expect it to be clearly stated in the memory model "if you're going to do IO this does impact the global ordering" but surprisingly it's not (I guess it's... implied? by the other things).
This, I think, achieves the same synchronization we'd want, without having to refer to the Go memory model and make arguments about "happens before" re: atomics. The wg.Wait ensures that we will observe all the other stuff in mappingBytes strictly before writing the signature. This is what I'd do if this was just memory shared within the same Go process, at least. WDYT?
I like it! I'd go for leaving the first two copies in the original goroutine, leaving only the signature copy on the separate goroutine -- I think this conveys even more the whole "this is done, then we sync with the other goroutine on the other part, then we come back.
Possibly with a comment that "this is possibly not needed but just in case this really should make sure the go compiler doesn't get any clever ideas"...
I agree that the signature would add a lot of implementation complexity.
Yeah, at this point I'm in the "I'd like to avoid it" side, but if there's overall concerns here we're still on time to "bite the bullet".
- Add link to reference implementation - Atomically write the signature to the start of the mapping - Ignore the return value of Prctl and do not log the error
2282839 to
3025072
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this with a simple Go program that just starts the APM tracer and then blocks forever. I've set DD_ENV, service name, and other config values, and then used Ivo's shell script to confirm that the memory mapping is there and has the expected contents.
Just left a few small comments.
ddtrace/tracer/otelprocesscontext.go
Outdated
| // https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-version | ||
| ServiceVersion string `msg:"service.version"` | ||
| // https://opentelemetry.io/docs/specs/semconv/registry/attributes/telemetry/#telemetry-sdk-language | ||
| TelemetrySdkLanguage string `msg:"telemetry.sdk.language"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really small nitpick: the Go custom is to capitalize acronyms, like TelemetrySDKLanguage
internal/otelcontextmapping_linux.go
Outdated
| copy(mappingBytes[headerSize:], data) | ||
| copy(mappingBytes[:headerSize], unsafe.Slice((*byte)(unsafe.Pointer(&header)), headerSize)) | ||
| // write the signature last to ensure that once a process validates the signature, it can safely read the whole data | ||
| copy(mappingBytes, otelContextSignature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivoanjo could you say more about this? I don't follow how the atomic store would help here, at least based on the stuff in that document. Keep in mind this is like the second time I've read the memory model stuff :) Given this:
If a synchronizing read-like memory operation r observes a synchronizing write-like memory operation w (that is, if W(r) = w), then w is synchronized before r
combined with this
If the effect of an atomic operation A is observed by atomic operation B, then A is synchronized before B
Are the regular old memory writes (the copy calls preceding this line) "synchronizing write-like operations"? And the bit about atomics seems like it only applies to synchronization between atomic operations, of which there would only be one here.
I understand the motivation, which is for an external process to not see partially-initialized data. But I don't feel 100% sure that changing this to an atomic helps. Especially since we're trying to synchronize with a completely different process reading this memory, which seems out of scope of the Go memory model. If we had a test that fails without this I'd feel better, but that's probably hard to do :)
| existingMappingBytes []byte | ||
| ) | ||
|
|
||
| type processContextHeader struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a _ structs.HostLayout field to the beginning of this struct. It seems to work fine without it, but adding that field would ensure that the memory layout matches the C layout, which is what I gather the consumer of this data will expect. Especially since we copy the raw memory for this struct below. Makes it more future-proof.
| log.Error("failed to store the configuration: %s", err.Error()) | ||
| } | ||
|
|
||
| processContext := otelProcessContext{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a doc that explains why were are not using the Memfd solution for OpenTelemetry? IIRC we had some good reason, but I forgot what it was. We should probably link to that doc here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reasons were:
memfd_createsyscall might be blocked by some seccomp profiles- it requires more syscalls / IO since all symlinks in
/proc/<pid>/fd/need to be resolved untildatadog-tracer-info-<id>file is found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "not inherited by fork" is also a nice bonus, and connects back to the discussions happening internally on the meaning of "runtime-id" -- the anonymous mapping mechanism, because it's not inherited by forks won't have any issues with stale information (such as a runtime-id) being carried from parent to child process.
internal/otelcontextmapping_linux.go
Outdated
| copy(mappingBytes[headerSize:], data) | ||
| copy(mappingBytes[:headerSize], unsafe.Slice((*byte)(unsafe.Pointer(&header)), headerSize)) | ||
| // write the signature last to ensure that once a process validates the signature, it can safely read the whole data | ||
| copy(mappingBytes, otelContextSignature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My two cents, which largely echo Ivo's comment, but shorter. We need two guarantees here:
a) The statements before the atomic operation can't be reordered to execute after the atomic execution by the Go compiler. This should be covered by Requirement 1 in the memory model.
b) The atomic operation must not be executed out of order by the processor. For this we rely on the mercy of the compiler to emit full memory fence instructions. But we can't imagine a compiler that wouldn't do that for atomics, so it's an acceptable risk.
So I think the current state of the patch is acceptable.
If we want stronger guarantees, we should probably replace the signature with a checksum. If another process reads the memory and find the checksum to not match, they need to retry. I'd be curious how y'all feel about that approach. I kinda like that it would work without any assumptions about the memory models of the language or processor architecture.
What does this PR do?
This PR stores the tracer process level context in a named anonymous mapping on Linux.
PROF-12406
Motivation
Project plan
Reviewer's Checklist
./scripts/lint.shlocally.Unsure? Have a question? Request a review!