Skip to content

No-std support? #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
VorpalBlade opened this issue Apr 23, 2025 · 13 comments
Open

No-std support? #42

VorpalBlade opened this issue Apr 23, 2025 · 13 comments

Comments

@VorpalBlade
Copy link

VorpalBlade commented Apr 23, 2025

Would it be feasible to add optional no-std/no-alloc support to this crate? Or would it be easier to write a separate implementation entirely for my use case in embedded? And is this something you would even be interested in?

Looking at your dependencies crossbeam-utils already supports disabling the std feature, though I don't know if that would remove any API you depend on.

My use case is in embedded for reading in high volume sensor data into buffers with DMA and then having another thread process the data. If I don't manage to keep up with the processing I would rather drop the oldest data than the newest data, and it seems this crate would solve that issue.

Everything will have to be statically allocated, as I don't have alloc, and I need control over which linker section the allocations goes into (which can be done using #[link_section = ".section.name"] on the static).

I understand if this is out of scope for your crate or if it would be too disruptive to the way it is architected.

@HadrienG2
Copy link
Owner

HadrienG2 commented Apr 23, 2025

There has been a quick discussion on no-std support 5 years ago, which has stalled because 1/we could not think of a nice way to replace the Arc<SharedState> in the input and output interfaces in environments without alloc, without sacrificing ergonomics (e.g. requiring unsafe for use) and 2/none of the people involved had a solid embedded background so we were afraid of making rookie mistakes by overdesigning for a use case we didn't understand.

Both the language and my skills have improved since however, and I think these days, given an explanation of how the no-std design should work (ideally via code e.g. a rough prototype, an embedded-friendly fork of this crate...), I should be able to figure out the dirty API/generics magic needed to keep the alloc interface as nice as it currently is, while making the no-alloc interface viable (if possibly a bit uglier).

For your particular use case, though, I wonder if triple-buffer is the right design to begin with. Since we are talking about some kind of streaming data input, I am not sure if the "access the latest value" semantics of triple-buffer match your use case that well, spontaneously I would be looking more into ring-buffer-ish constructs that allow you to access the last N values, as in my other rt-history crate. But perhaps I am misunderstanding the use case.

@VorpalBlade
Copy link
Author

Ring buffers is a fair point. Though if you want to avoid copying data in or out of the ring buffer (doing DMA and processing directly on the data in the buffer) it seems difficult to me to discard old data if the consumer ends up lagging. The opposite: applying back pressure, is trivial.

The other aspect (which I didn't think to mention) is that I need to process in fixed chunks since I'm doing FFT on the sound data that I'm reading in. And the third aspect that I should have perhaps clarified is that I don't have much ram. So I need to keep the number of buffers down.

At that point, what is the difference between triple buffering and a ring buffer with three buffers that prefers discarding old values?

I think I could do this would three statically allocated buffers in memory that supports DMA (that's right, only some of the RAM I have available actually works for DMA) and an pair of atomic pointer sized values. That would be the low level unsafe approach though. I need to sit down and figure such an implementation out to know for sure.

The ideal API for me would be to let me provide my own buffers and the crate just handles the tricky transfer and juggling of three &mut [u8; BUFFER_SIZE] between the producer and the consumer (and doesn't require dynamic allocation for anything).

@HadrienG2
Copy link
Owner

HadrienG2 commented Apr 26, 2025

Thanks for clarifying! If you want to access data in place, then triple-buffer's synchronization protocol does indeed sound like a better starting point to me, because it gives you exclusive access to the current read buffer, which means that you can access it using normal memory accesses. That's actually the main way in which triple-buffer's synchronization protocol differs from that of nonblocking ring buffers that continuously overwrite old contents without waiting for the reader, like my rt-history crate does.

In those ring buffers, the writer thread may concurrently overwrite the ring elements that you are in the process of reading. This will result in UB if you access the ring buffer storage using normal memory reads and writes. To prevent this UB, you must access the ring buffer data in place using special memory operations. For simple inter-thread communication Relaxed atomic reads/writes combined with Acquire/Release fences are enough, but DMA makes life more complicated. I think from the Rust memory model's perspective you would need at least a combination of volatile accesses and Acquire/Release fences to correctly handle it, and then on top of that the CPU µarch will likely require you to take extra steps in order to make sure that you get fresh data from RAM and not a weird mixture of fresh data and stale CPU cache lines...

All this to say, you likely don't want to patch your FFT routine so that it takes all these precautions :) Which is why I would never advise anyone to use this sort of ring buffer outside of scenarios where it is possible to simply copy data in and out, treating the ring buffer's inner storage as an encapsulated black box.


Now, since we agree that extending triple-buffer is probably a good approach for your use case, the next step is to clarify the "provide my own buffers" side of things.

triple-buffer's shared state is actually a bit more complicated than a [T; 3] because it needs to handle the following concerns:

  • Any shared state which can be modified from &self (which in turn is required for multi-threaded access) must be wrapped into an UnsafeCell in Rust.
  • To avoid false sharing, especially when T is small, it is better to pad buffers to a multiple of the CPU cache line length using something like crossbeam's CachePadded wrapper.
  • To make the basic synchronization protocol work, I need to track which of the three buffer holds the most recent write, and whether a write has occurred since the last read. This is done via an atomic bitfield.
    • This bitfield does not need to be stored alongside the buffers, but doing so makes the triple-buffer implementation a bit simpler as the input/output interfaces only need to track one single shared state pointer, not a pair of pointers to hopefully related separate shared state bits.

Obviously, I would prefer not to expose these implementation details if I can avoid it. So do you think an encapsulated API design along these lines could work for you?

#[link_section = ".whatever.you.want"]
// Shared state is built using encapsulated methods, content is not directly exposed
static SHARED: triple_buffer::SharedState<T> = SharedState::new(...);

// API option 1: Building the input/output is unsafe, by doing so you assert
// that you have only one input interface and one output interface pointing to
// this shared state.
let mut input = unsafe { Input::from_shared_state_unchecked(&SHARED) };

// API option 2: I can make the input/output building safe by tracking which has
// been already built inside of the shared state. I have enough unused bits
// remaining in the shared state's bitfield to do so, no extra storage required.
let mut input = Input::from_shared_state(&SHARED);

Since sane apps should not be building triple buffer input/output interfaces in a tight loop, I think API option 2 probably makes more sense, following the general Rust principle of keeping APIs safe when there's not a strong argument against it.

@VorpalBlade
Copy link
Author

Yes, I would build the triple buffer structure once at startup. Probably using https://lib.rs/crates/static_cell if the constructing function isn't const. So API 2 should be fine.

As a side note: while cache lines are typically 64 bytes on anything remotely modern desktop-ish, not so on embedded. I have seen both 16 and 32 bytes. I don't know if you can auto detect that or not. Since I will be using power of 2 buffers anyway it shouldn't really affect me.

I don't actually know what exactly I need to do for DMA here: because the frameworks we use with Rust on embedded handles it for us:

I use the rather excellent Embassy framework on embedded. It is an async runtime and it makes life so much easier (not what people usually say about async rust, I know). Basically, things like using interrupts, power states and completion on DMA notifications all gets optimally handled internally, and what the user sees is just awaitable futures. Embedded rust and async work so well together.

So I will have one async task (let's call it A) waiting on DMA completing, starting the next DMA and sending the old buffer off to the other task (B). As such I don't think special synchronisation would actually be needed here in triple buffer: that should already happen between embassy and task A. Between A and B it should just work like normal (for cross core communication, Embassy runs a separate single threaded runtime on each core). Task A will share a core with various other tasks (control over WiFi, output LED matrix control, ...), while Task B will get a dedicated core essentially.

@HadrienG2
Copy link
Owner

HadrienG2 commented Apr 27, 2025

As a side note: while cache lines are typically 64 bytes on anything remotely modern desktop-ish, not so on embedded. I have seen both 16 and 32 bytes. I don't know if you can auto detect that or not. Since I will be using power of 2 buffers anyway it shouldn't really affect me.

Indeed, implementing something like CachePadded is hard.

Generally speaking, a CPU's precise cache layout may only be known at runtime (if known at all). It can even vary over time with SMP scheduling schemes that allow tasks to migrate across heterogeneous cores on big.LITTLE-ish architectures. Because the layout of a CachePadded must be known at compile time for the compiler to do its job, this means that we can only use a pessimistic upper bound of the cache line size based on knowledge of existing ISA implementations, which must be kept up to date as new hardware is released.

Since alas we can't have rustc/LLVM do this work for us, I offload this particular concern to the crossbeam team, who did reasonably thorough research on a number of micro-architectures already and should accept patches for more.


Regarding DMA, can you point me to the API documentation of one of the embassy methods that you might use to perform a DMA transaction on the embassy side? The main thing that I need to know is this: if triple-buffer provided you with an &mut [u8; N] inside of a static storage block allocated within the suitable memory address range, would this be enough for you to perform DMA into this storage block, or would you need lower-level memory access / more layout control?

@VorpalBlade
Copy link
Author

Creating DMA buffers and doing DMA is a microcontroller specific thing unfortunately. While the Rust community has done a heroic job on trying to create standardised traits things like pins, I2C bus etc (which is way better than in C/C++, where porting to a new micro-controller might mean rewriting everything from scratch), there is still a bunch of things that are not abstracted away since the capabilities vary so much. DMA is one of those. For example:

The RP2040 (the chip found on Raspberry Pi Pico) only requires &[u8] from what I can tell: https://docs.embassy.dev/embassy-rp/git/rp2040/spi/struct.Spi.html (see the async DMA functions like write, read and transfer on that page (not the blocking_ variants)).

The chip I'm using (ESP32) has a complicated enough setup that they provide helper macros for setup. Looking through the multiple layers of macros I get to declare_aligned_dma_buffer! which says that buffers must be word aligned as a safety invariant. There is also a bunch of other preconditions that they macros check, such as max chunk size (<=4095 apparently). The reason for this appears to be that DmaDescriptor has a more complicated structure: A linked list of *mut u8. A set of these are then used in DmaRxBuf (and similar for Tx). There is also special support for cyclic transfers that you don't need to restart from the chip (though that is not going to play well with my use case).

The actual API to use these DMA buffers would be what is found in I2sRx in my case (since it is the I2S bus I'm doing DMA on). Looking at it, I'm not sure that API is actually sound, since it does seem to accept buffers not created with the above macros. Hm, I will have to investigate this.

This is why I prefer to bring my own buffers.

However, there is a saying that everything in computer science can be solved by adding a layer of indirection! I could just use triple_buffer to transfer a struct { data: *mut u8; length: usize } or maybe even a &mut [u8]. This puts me in control of the buffers again, which I think might be required for the ESP32.

@HadrienG2
Copy link
Owner

HadrienG2 commented Apr 28, 2025

Overall, I can think about three possible degrees of "bring my own buffers" on the ergonomics vs flexibility tradeoff axis.


For maximal flexibility at the expense of rather poor ergonomics, I could extract the index management / synchronization protocol of triple-buffer into its own abstraction. This would give you input and output interfaces that do not provide references to buffers anymore, but instead just tell you the index of the buffer that you should be accessing in some hypothetical array of three buffers that you allocate yourself, ideally following some suggestions from my side. Accessing the buffers will then involve some unsafe code on your side, since there is no way for the Rust compiler to figure out that the index management protocol ensures absence of incorrect buffer aliasing between two threads.

This option is not ideal from my perspective as a triple-buffer maintainer, because it effectively requires maintaining two similar set of abstractions. But I know how I could share some code between them.

In this scheme, the API would look like this:

use triple_buffer::{IndexingSharedState, IndexingInput};

// Set up buffers. I don't know what you are doing here, and I do not need to know.
static BUFFERS: [Buffer; 3] = core::array::from_fn(|_| /* ... allocate a buffer somehow ... */);

// Set up shared state
static SHARED: IndexingSharedState = IndexingSharedState::new();

// Set up input interface
let mut input = IndexingInput::from_shared(&SHARED);

// Query current index of input buffer
let input_idx_1: usize = input.input_buffer_idx();
let input_buffer = &BUFFERS[input_idx_1];
/* TODO: Fill up input_buffer. This will require unsafe. */

// Publish current input buffer after filling it up
input.publish();
// SAFETY: input_idx_1 and input_buffer must not be used after this point

// Query index of new input buffer and move on
let input_idx_2: usize = input.input_buffer_idx();
assert_ne!(input_idx_1, input_idx_2);
let input_buffer = &BUFFERS[input_idx_2];

For optimal ergonomics at the expense of minimal flexibility, we have the option that I discussed so far, where I fully manage the buffer allocation on my side, and the only knob I give you is a way to put some directives like link_section on the shared state variable. This is the option that I discussed in my previous message.

If I understand your last message correctly, this option might work on both RP2040 and ESP32 (via the I2sRx API for the latter), but we're not sure if that's true of all hardware.

In this scheme, the API would look like this:

use triple_buffer::{SharedState, Input};

// Set up shared state with internal buffers
#[link_section = ".dma.black.magic"]
static SHARED: SharedState<Buffer> = SharedState::new(/* ... initial T value that will be cloned 3 times ... */);

// Set up input interface
let mut input = Input::from_shared(&SHARED);

// Access current input buffer
let input_buffer: &mut Buffer = input.input_buffer_mut();
/* TODO: Fill up input_buffer. No unsafe required. */

// Publish current input buffer after filling it up
input.publish();
// SAFETY: Borrow checker automatically ensures that input_buffer cannot be used

// Query new input buffer and move on
let input_buffer: &mut Buffer = input.input_buffer_mut();

And as a middle ground between ergonomics and flexibility, we can have a scenario where SHARED does not hold actual inline [u8; N] buffers, but instead some kind of descriptor that points to a buffer allocated using some DMA-friendly target-specific mechanism. This is what I understand you suggest at the end of your last message.

use triple_buffer::{SharedState, Input};

// Set up shared state with internal buffers
// NOTE: No link_section required here, we are not allocating buffers inline
static SHARED: SharedState<BufferDescriptor> = SharedState::with_buffer_builder(|| /* ... allocate one buffer, return a descriptor to it ... */);

// Set up input interface
let mut input = Input::from_shared(&SHARED);

// Access current input buffer
let input_buffer: &mut BufferDescriptor = input.input_buffer_mut();
/* TODO: Fill up input_buffer. May or may not require unsafe depending on how BufferDescriptor works, probably hardware-specific. */

// Publish current input buffer after filling it up
input.publish();
// SAFETY: Borrow checker automatically ensures that input_buffer cannot be used

// Query new input buffer and move on
let input_buffer: &mut BufferDescriptor = input.input_buffer_mut();

As you can see, this third option is almost identical to the previous one from the perspective of triple_buffer, the main difference is that I need to expose the (currently private) triple buffer constructor that builds each buffer separately using a closure.

Do you agree that these look like the three main API designs that I could provide on the triple-buffer side, and if so do you have an opinion on which one(s) should be provided?

@VorpalBlade
Copy link
Author

I'm on my phone right now (it is almost midnight here), so excuse the crude explanation. What I meant was that I could probably use triple_buffer as it is today (changed to work with no-std) but bring my own T that is not the buffer itself but something like:

struct BufferHandle {
    data: /* Pointer or mut ref to my actual buffer */,
    /// How much data is actually in buffer, as not every read fills it completely.
    size: u16
} 

// Raw pointers aren't send, so I might need this if I have to use those.
unsafe impl Send for BufferHandle;

At that point, the requirements for the buffer themselves are not something that triple buffer needs to care about.

The BufferHandles themselves do need to be allocated somewhere though of course. So it would be nice to do that in a static, or if alloc is used, to have an opt in feature for the nightly allocator API so I can specify where it gets allocated from.

@HadrienG2
Copy link
Owner

HadrienG2 commented May 2, 2025

Okay, I think I can get this working in a way that you can use by making the various structs from triple-buffer generic over the kind of shared state storage (inline vs heap-allocated):

  • TripleBuffer<T> would become TripleBuffer<T, Shared: Borrow<SharedState<T>> = Arc<SharedState<T>>>, where Shared can also be SharedState<T>.
    • TripleBuffer::split(self) would only be implemented for TripleBuffer<T, Arc<SharedState<T>>>, not for all Shared.
    • A new API would be added to allow generating input and output interfaces that point to a static-allocated TripleBuffer without moving it, with appropriate safety checks added to make it safe by disallowing multiple input/output interfaces for a single TripleBuffer.
    • A new constructor would also be added to allow you to allocate buffers on your side using arbitrarily complex procedures and construct a TripleBuffer from three (pointers/handles to) preallocated buffers.
  • Input<T> would become Input<T, Shared: Deref<SharedState<T>> = Arc<SharedState<T>>> where Shared can also be &'a SharedState. Output would be modified in a similar way.
  • SharedState should probably remain #[doc(hidden)], since that's an implementation detail that I am not super comfortable exposing. Instead, type aliases could be used e.g. TripleBufferAlloc, TripleBufferInline.
    • This doesn't look too great though, and a tempting alternative would be to completely drop the current TripleBuffer layer, which doesn't do much, rename the SharedState struct to TripleBuffer, and carefully review/minimize its public interface to make it safe to expose. I need to check the code to see how practical that is, but if it is, I could re-express all of the above in terms of Arc<TripleBuffer<T>> instead of Arc<SharedState<T>> etc, which would be quite compelling.
  • Finally, an on-by-default alloc feature would be added, which support for Arc<SharedState<T>> would be gated on.

Need to find the time to actually implement this of course, but if you can have a quick look and tell me if that looks sensible on your side, that's much appreciated already.

@HadrienG2
Copy link
Owner

HadrienG2 commented May 2, 2025

Oh, and bonus question: are you fine with me bundling this feature into the major release that I need to do at some point this year in order to move forward with issue #30?

I'm not a Rust semver hazard guru, but I think I remember that adding a generic parameter to a struct is not semver-safe in Rust even if the parameter is defaulted, due to the fact that default type parameters are a suboptimally implemented language feature which rustc will only take into account for type inference in specific situations.

@VorpalBlade
Copy link
Author

I think that would work, but I'm having a hard time picturing the API in my head. I could test a pre-release version of this though to make sure.

I'm not a Rust semver hazard guru

Neither am I, but have you tried cargo-semver-checks? It can catch a lot of semver issues for you.

@HadrienG2
Copy link
Owner

Alright, I have pushed a first prototype in the no_alloc branch. The doc examples and tests have not yet been updated (and therefore the code has not been tested yet), but the changelog and API documentation from cargo doc should hopefully be enough to give you a reasonable taste of what this is all about.

@VorpalBlade
Copy link
Author

That was quick! Unfortunately I won't be. I should have some time to test it this weekend or the weekend after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants