Skip to content

Add native lock-free dynamic heap allocator #4749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 43 commits into
base: master
Choose a base branch
from

Conversation

Feoramund
Copy link
Contributor

Native Odin-based Heap Allocator

After an intense development and rigorous testing process of the past three months, I am finally ready to share the results of my latest project.

In short, this is a lock-free dynamic heap allocator written solely in Odin, utilizing direct virtual memory access to the operating system where possible. Only system calls are used to ask the OS for virtual memory (except on operating systems where this is verboten, such as Darwin, where we use their libc API to get virtual memory), and the allocator handles everything else from
there.

Rationale

Originally, I was working on porting all of core to use os2, when I found the experimental heap allocator in there. Having hooked my unreleased os2 test framework up to it, I found that it suffered from more race conditions than had already been found, as well as other synchronization issues such as apparent misunderstandings of how atomic operations work.

The most confusing code that stood out to me was the following block:

	idx := sync.atomic_load(&alloc.idx)
	prev := sync.atomic_load(&alloc.prev)
	next := sync.atomic_load(&alloc.next)

All three of those fields exist in the same u64. It would make more sense to have atomically loaded alloc then read each field individually. I spent a few days trying to make sense of heap_linux.odin, but it was a bit much for me at the time. The previous block and the warnings listed by TSan didn't give me great hope that it would be simple to fix it.

So, I did what I think most programmers do in this situation. I decided to try writing my own from nothing and hopefully come to better appreciation of the problem as a whole.

I combed through the last 30 years of the literature on allocators, with some reading of papers on parallelism.

For dynamic heap allocators, I found that contention, false sharing, and heap blowup were issues often mentioned. The Hoard paper (Berger et al., 2000) was particularly helpful in figuring out an overall design for solving those issues.

There is hopefully nothing too novel about the design I've put together here. We're all on well-trodden ground. I think the most exciting feature is the free bitmap, where most allocators appear to use free lists instead.

Goals

The design of this allocator was guided by three principles.

  • Robustness: A dynamic heap allocator cannot afford to have bugs. It must be as small as possible to reduce the number of possible places something could go wrong. Locks cannot be used, as this introduces the potential for deadlocking an entire program upon unexpected termination of a thread.
  • Clarity: A dynamic heap allocator is complicated enough already. Adding lock-free as a goal only makes it that much more complicated. To this end, as much of the internals have been commented about, explicit and meaningful variable names are used, and there's plenty of testing and assertions throughout the code to indicate what the design expects.
  • Speed: Where it did not compromise the other goals, optimizations have been applied iteratively over the time that this has been in development to make it as fast as possible in comparison to the previous libc malloc-based allocator on Linux.

Features

  • Superpage-based Allocation which allows theoretically faster pointer access through less pressure on the operating system's Translation Lookaside Buffer. Most MMUs have page sizes of 4KiB with the ability to distribute larger page sizes of 2MiB and above, generally speaking, and this is leveraged directly when possible.
  • Remote Free Bitmaps which allows wait-free freeing from other threads through atomic bit-flips.
  • Superpage Orphanage implemented using a lock-free algorithm to prevent heap blowup and allow reuse of already-allocated superpages, reducing how often the allocator makes requests to the virtual memory subsystem.
  • Lock-Free Guaranteed: While the majority of the synchronization methods implemented in the allocator are wait-free, the strongest overall guarantee that can be made is that it is lock-free. This means that no thread, even if it is terminated or crashes, can take down the allocator. The worst that can happen is the thread's memory is leaked. At least one thread in the system is guaranteed to make progress, as the academic definition of lock-free goes.
  • Thread-local Heaps eliminate allocator-induced false sharing and most allocation contention. Each thread need not communicate with a global structure for permission to allocate.
  • Runtime Slab Reconfiguration: Superpages are set up with a fixed number of slabs, but the details of how those slabs are configured are left to runtime determination. A superpage can have a dozen slabs of one size category, or it can have a variety of sizes; the runtime needs of the program determine how the allocator partitions each slab.
  • Slab Allocation structures individual allocations into fixed-size slabs of memory, each containing fixed-size bins, adjacent to each other of the same size rank. This eliminates headers (including any need for alignment information) which saves space and encourages better cache locality and higher performance for accessing objects of similar sizes. There is no need for complicated coalescence logic either. Resizing can be a no-op, if the new size is within the same category.
  • Superpage & Slab-masked Allocations which allows constant-time lookup of the superstructures which house individual allocations through simple bitwise AND operations on the addresses distributed by the allocator.
  • Heap Caching vastly increases allocation speed, as each heap has a cache of superpages with slabs available, as well as a map structure pointing to slabs of each size category available. These caches are implemented through using space that would otherwise be wasted in each superpage due to the need to align each pointer to its owning superpage and slab.
  • Slab-wide Allocation for larger-than-bin-sized allocations, the allocator may use entire slabs to house an allocation. These slab-wide allocations can span multiple slabs, allowing the allocator to make use of as much space as available in a superpage.
  • Linear Allocation Strategy, which reduces tracking which bins have been used to a single counter and helps encourage packing new objects near old objects. A slab allocates linearly from the first free slot to the last.
  • Completely Native Implementation: The whole allocator is written in pure Odin which allows for the compiler to better reason about what is happening in the code. No longer is each call to malloc and free an opaque barrier. Given that the code is available right here in the runtime itself, it allows configuration to any programmer's needs. It provides uniform behavior across all platforms, as well; a programmer need not contemplate how heap allocation may impact performance on one system versus another.
  • 1k LoC: Excluding system-specific code, debug code, and assertions, this implementation comes in at just about one thousand lines of code. Dynamic allocators can be complicated, and lock-free dynamic heap allocators can be incredibly complex with subtle bugs. This implementation needed to be as simple as possible to allow ease of future maintenance.
  • Test & Debug Framework: A broken dynamic heap allocator has the potential to introduce numerous vexing bugs. As tricky as this can be to get right, it warranted the most thorough testing possible. Code coverage is tracked through calls at various points of interest made to an internal code coverage tracker when ODIN_DEBUG_HEAP is enabled.
  • Diagnostics offer programs the ability to view specific information about heap allocation statistics through the runtime.get_local_heap_info() API.

Benchmark Results

These benchmarks should be taken with a grain of salt in general, since I had
to search pretty far and wide to find repeatable tests that showed
significant run time differences. The run times of most real programs simply
do not depend significantly on malloc performance. (Evans, 2006, p. 9)
[...]
Exhaustive benchmarking of allocators is not feasible, and the benchmark
results should not be interpreted as definitive in any sense. Allocator
performance is highly sensitive to application allocation patterns, and it is
possible to construct microbenchmarks that show any of the three allocators
tested here in either a favorable or unfavorable light, at the benchmarker’s
whim. (p. 11)

Because of the wisdom from the quote above, I won't spend much time here except to say that the included test bench has microbenchmarks written by me for the purpose of making sure that the allocator is at least not egregiously slow in certain made-up scenarios.

If you believe these benchmarks can align with realistic situations, then this allocator is 2-3 times faster than libc malloc, in general use case scenarios (so any allocation less than ~63KiB), on my AMD64 Linux-based system, compiled with -o:aggressive -disable-assert -no-bounds-check -microarch:alderlake.

Any speed gain drops off above allocations of 32KiB in size, because this is where bin allocations are no longer possible with the default configuration, and the allocator has to resort to coalescing entire slabs to fit the requests, but I decided to accept the result of this design, as it's not that much slower than malloc, and I believe that rapid allocation of >=64KiB blocks is a special case and not the usual case for most programs.

The full test suite can be run with:

odin run tests/heap_allocator/ -debug -define:ODIN_DEBUG_HEAP=true -sanitize-thread -- -vmem-tests -serial-tests -parallel-tests -allocator:feoramalloc

The benchmarks can be run with:

odin run tests/heap_allocator/ -o:aggressive -disable-assert -no-bounds-check -- -serial-benchmarks -parallel-benchmarks -allocator:feoramalloc

The allocator command line option can be switched to libc to use the old behavior.

Memory Usage

Speed aside, I can say that there are points to be aware of with this allocator, particularly in how it uses memory, which are clear and not as susceptible to application patterns like benchmarking may be.

For one, due to the nature of slab allocation, any allocation will always use the most amount of space possible within a bin rank, so if you request 9 bytes, you will in actual fact consume 16, as that is the next power of two available. This continues for every power of two up to the maximum bin size of 32KiB.

This shouldn't be too surprising at lower sizes, as with a non-slab general purpose allocator, you're almost guaranteed to have some book-keeping somewhere, which would result in an allocation of 8 bytes actually using 16 or 24 bytes, depending on the header.

This begins to break down at higher sizes, however. If you allocate 257 bytes instead of 256, you're going to be placed into a bin of 512 bytes. This may seem wasteful, but there is a consideration for this: every allocation of a particular size rank is tightly packed next to each other, which increases cache locality. It's a memory for speed tradeoff, in the end.

Alignment is also used as the size, if it's larger than the two, up to a maximum of 64 bytes by default. This was one of the design choices made to help eliminate any need for headers. Beyond a size of 64 bytes, all allocations are aligned to at least 64 bytes. Alignment beyond 64 bytes is not supported.

There is also no convoluted coalescing logic to be had for any allocation below ~63KiB. This was done for the sake of simplicity. Beyond 64KiB, the allocator has to make decisions on which slabs to merge together, which is where memory
usage and speed both take a hit.

To allocate 64KiB is to block out up to 128KiB, due to the nature of book-keeping on slab-wide allocations. That may be the weakest point of this allocator, and I'm open to feedback on possible workarounds.

The one upside of over-allocating like this is that if you resize within the same frame of memory that's already been allotted to you, it's virtually a no-op. The allocator has to do a few calculations, and it returns without touching any memory: simple and fast.

Beyond the HUGE_ALLOCATION_THRESHOLD, which is 3/4ths of a Superpage by default (1.5MiB), the allocator distributes chunks of at least a superpage in size directly through virtual memory. This is where memory waste becomes less noticeable, as we're no longer dealing with bins or slabs but whole chunks from virtual memory.

Superpages also may waste up to one slab size of memory (64KiB) for the purposes of maintaining alignment, but this space is optionally used if a heap needs more space for its cache. With the current default values, one of these 64KiB blocks is used per 20 superpages allocated to a single thread. So it's about 3% of all virtual memory allocated this way.

The values dictating the sizes of slabs and maximum bins are all configurable through the ODIN_HEAP_* defines, so if your application really does need to make binned allocations of 64KiB, or if you find speed improvements by using smaller slabs, it's easy to change.

I chose the default values of 64KiB slabs with a 32KiB max bin size after some microbenchmarking, but it's possible that different values could result in better performance for different scenarios.

To summarize: this allocator does not try to squeeze out every possible byte at every possible juncture, but it does try to be fast as much as possible.

There may be a case to be made for the reduction of fragmentation through slab allocation resulting in less actual memory usage at the end of the day versus a coalescing allocator, but that is probably an application-specific benefit and one I have not thoroughly investigated.

Credits

I hope to demonstrate that the design used in this allocator is not exceedingly novel (and thus, not untested) by pointing out the inspirations for each major feature based upon the literature reviewed. Each feature has been documented and in use in various implementations for over two decades now.

  • The design of the Slab allocator was originally detailed by Bonwick (1994) in The Slab Allocator: An Object-Caching Kernel.
  • An observation made by Wilson et al. (1995) in Dynamic Storage Allocation: A Survey and Critical Review inspired runtime configurable size classes for slabs.
  • Kamp (1998) describes how individual objects can be so aligned as to allow bitmasking their pointers to find their owning structure in Malloc(3) revisited, otherwise known as the PHKmalloc paper. This strategy would also later be used in mimalloc.
  • Using a bitmap to track freed objects also comes from PHKmalloc. Usage of this structure predates any notion of wait-free allocator design, but it works quite well for that purpose. This strategy is contrasted mainly with free lists used in many other designs.
  • Berger et al. (2000) describe a global heap for storing unused superblocks in Hoard: A Scalable Memory Allocator for Multithreaded Applications. This was the earliest documentation of a solution to the problem of heap blowup according to the authors. This idea is the direct inspiration for the superpage orphanage. However, Hoard's specific implementation was not lock-free.
  • Using per-thread heaps also comes from the Hoard paper, where it was originally outlined as the only known solution in the literature to date for the problem of allocator-induced false sharing. Mimalloc would later use thread-local storage to accomplish the very same goal.
  • The benefits of a lock-free allocator were first described by Michael (2004) in Scalable Lock-Free Dynamic Memory Allocation.
  • This allocator may most resemble the Streamflow design described by Schneider et al. (2006), with its segregated heaps, its distinction between local and remote deallocation, and its usage of superpages to improve TLB performance.
  • The idea of using per-slab free bitmaps is partly inspired by free list sharding detailed by Leijen et al. (2019) in Mimalloc: Free List Sharding in Action where per-page free lists are used instead of a large free list per size class. It would seem to be the natural development when using bitmaps over free lists.

The following points are original ideas; original in the sense that they were realized during development and not drawn from any specific paper, not that they are wholly novel and have never been seen before.

  • The strategy of forbidding the freeing of a slab until it has been fully used at least once came about after running a benchmark where a single object was allocated and freed repeatedly. This test demonstrated a significant weak point in an earlier version of the allocator, and this strategy was the solution to keep it from becoming a sore spot for performance.
  • The usage of a single integer counter (dirty_bins) to track which bins are dirty and need zeroing upon re-allocation was an iterative optimization realized after noticing that the allocator naturally uses the bin with the lowest address possible to keep cache locality, by virtue of next_free_sector always being set to the minimum value. An earlier version of the allocator used a bitmap with the same layout of local_free to track dirty bins.
  • I do not recall seeing in any of the papers reviewed any mention of using allocator space that would otherwise be wasted by alignment needs for keeping metadata, but it should hopefully be an obvious enough usage to not qualify as groundbreaking.
  • Related to the previous point, the usage of a size class-keyed hash map to speed up finding available slabs should also be an obvious optimization.
  • I could not find any mention in the literature of the usage of a bitmap to achieve wait-free synchronization through bitwise merging. This may actually be the only novel idea in this design. Most parallel-aware allocators use free lists, and some are lock-free. Of note, StarMalloc, whose paper was published just last year, uses bitmaps instead of free lists, but it retains locks for the purpose of security according to the authors.

Quotes

The following passage inspired runtime configurable slab size classes.

A crude but possibly effective form of coalescing for simple segregated
storage (used by Mike Haertel in a fast allocator [GZH93, Vo95], and in
several garbage collectors [Wil95]) is to maintain a count of live objects
for each page, and notice when a page is entirely empty. If a page is empty,
it can be made available for allocating objects in a different size class,
preserving the invariant that all objects in a page are of a single size
class. (Wilson et al., 1995, p. 37)

This passage encouraged attention to optimizing the heuristics used for the bitmaps used to track free bins.

It may appear that bitmapped allocators are slow, because search times are
linear, and to a first approximation this may be true. But notice that if a
good heuristic is available to decide which area of the bitmap to search,
searching is linear in the size of the area searched, rather than the number
of free blocks. The cost of bitmapped allocation may then be proportional to
the rate of allocation, rather than the number of free blocks, and may scale
better than other indexing schemes. If the associated constants are low
enough, bitmapped allocation may do quite well. It may also be valuable in
conjunction with other indexing schemes. (Wilson et al., 1995, p. 42)

jemalloc's author, on the rich history of memory allocator design:

On the surface, memory allocation and deallocation appears to be a simple
problem that merely requires a bit of bookkeeping in order to keep track of
in-use versus available memory. However, decades of research and scores of
allocator implementations have failed to produce a clearly superior
allocator. (Evans, 2006, p. 1)

References

  1. Jeff Bonwick. (1994). The Slab Allocator: An Object-Caching Kernel.
  2. Paul R. Wilson, Mark S. Johnstone, Michael Neely, & David Boles. (1995). Dynamic Storage Allocation: A Survey and Critical Review.
  3. Maged M. Michael, & Michael L. Scott. (1996). Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms.
  4. Poul-Henning Kamp. (1998). Malloc(3) revisited.
  5. Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, Paul R. Wilson. (2000). Hoard: A Scalable Memory Allocator for Multithreaded Applications.
  6. Maged M. Michael. (2004). Scalable Lock-Free Dynamic Memory Allocation.
  7. Thomas Edward Hart. (2005). Comparative Performance of Memory Reclamation Strategies for Lock-Free and Concurrently-Readable Data Structures.
  8. Jason Evans. (2006). A Scalable Concurrent malloc(3) Implementation for FreeBSD.
  9. Scott Schneider, Christos D. Antonopoulos, Dimitrios S. Nikolopoulos. (2006). Scalable Locality-Conscious Multithreaded Memory Allocation.
  10. Martin Thompson, Dave Farley, Michael Barker, Patricia Gee, Andrew Stewart. (2011). Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads.
  11. Daan Leijen, Benjamin Zoren, Leonardo De Moura. (2019). Mimalloc: Free List Sharding in Action.
  12. Antonin Reitz, Aymeric Fromherz, Jonathan Protzenko. (2024). StarMalloc: A Formally Verified, Concurrent, Performant and Security-Oriented Memory Allocator.

Lectures

  1. QCon SF 2010: Martin Thompson & Michael Barker, LMAX "How to Do 100K TPS at Less than 1ms Latency"
  2. C++ and Beyond 2012: Herb Sutter "atomic<> Weapons, the C++11 Memory Model and Modern Hardware"
  3. CppCon 2014: Herb Sutter "Lock-Free Programming (or, Juggling Razor Blades)"
  4. CppCon 2015: Fedor Pikus "Live Lock-Free or Deadlock (Practical Lock-free Programming)"

Design Differences

  1. PHKmalloc is mentioned a few times as an inspiration. jemalloc was developed as a more parallel-aware replacement for phkmalloc. How does this allocator differ from jemalloc?

Of note, jemalloc uses multiple arenas to reduce the issue of allocator-induced false sharing. However, those arenas are shared between active threads. The strategy of giving exclusive access to an arena on a per thread basis is more similar to Hoard than jemalloc.

  1. How does this allocator differ from Hoard?

With regard to what is called the global heap in the Hoard paper, there is the superpage orphanage in this allocator. They both fulfill similar duties as far as memory reuse. However, in Hoard, superblocks may be moved from per-processor heaps to the global heap, if they cross an emptiness threshold.

In my design, this ownership transfer mechanism is forgone in favor of an overall simplified synchronization process. Superpages do not change ownership until they are either completely empty and ready to be freed or the thread cleanly exits. For a remote thread to be able to decouple a superpage belonging to another thread would require more complicated logic behind the scenes and likely slow down regular single-threaded usage with atomic operations.

This design can result in an apparent memory leak if thread A allocates some number of bytes, and thread B frees all of the allocations but thread A never allocates anything ever again and does not exit, as either event would trigger the merging of its remote frees and subsequent freeing of its empty superpages.

This is one behavior to be aware of when writing concurrent programs that use this allocator in producer/consumer relationships. In practice however, it should be unusual that a thread accumulates a significant amount of memory that it hands off to another thread to free and never revisits its heap for the duration of the program.

The Name

Most allocators are either named after the author or have a fancy title. PHKmalloc represents the author's initials. Mimalloc presumably means Microsoft Malloc.

If I had to give this allocator design a name, I might call it "the lock-free bitmap slab allocator" after its key features. For the purpose of differentiating this specific implementation of a heap allocator from any others, I think "Feoramalloc" is suitable.

I've used feoramalloc in the test bench to differentiate it from libc.

Final Thoughts

In closing, I want to say that I hope this allocator can improve the efficiency of programs written in Odin while standing as an example of how to learn about these low-level concepts such as lock-free programming and heap allocators.

Obviously, it won't make all programs magically faster, and if you're already using a custom allocator, then you know more about your problem space better than a general-purpose allocator could possibly ever guess.

I think this is a significant step towards having an independent runtime. We can get consistent behavior across all platforms too, as well as the ability to learn very specific information about the heap through the included diagnostics.

This PR is a draft for now, while I hammer out the final details and receive feedback.

Help Requests

I mainly need help with non-Linux/POSIX virtual memory access. I can test this allocator against FreeBSD and NetBSD, but I do not have a Windows or Darwin machine to verify the system-specific code there.

Windows passed the CI tests, so I'm hopeful that it works there. The Darwin tests pass for Intel, but it stalls on the core test suite afterwards, so there's something strange going on there. Linux and FreeBSD are working.

While testing, I hit an interesting snag with NetBSD. Its mmap syscall requires 7 arguments, but we only support 6, and I haven't been able to figure out what the calling convention for it is. That is to say, is it another register, or is it pushed to the stack, or something else. Could use some help there.

I don't have a plan for what to do for systems that do not expose virtual memory access, since I don't have any experience with those systems. I'm assuming Orca and WASM do not expose a virtual memory subsystem akin to mmap
or VirtualAlloc. I only recently started tinkering with wasm after finding the official example. These are otherwise foreign fields to me, and I'm open to feedback. We could perhaps have a malloc-based fallback allocator in that case.

The only strong requirement the allocator has, regarding backing allocation, is some ability to request alignment or dispose of unnecessary pages. If we can do either of those, we're solid.

It would be great to hear about how this allocator impacts real-world application usage, too.

Also interested to hear how this could impact -no-crt. I noticed a commit recently about how Linux requires the C runtime to initialize thread local storage. I wasn't aware of that.

API

I'm also looking to hear if anyone has any better ideas about organization or API. This allocator used to live in a package of its own right during all of my testing, but I had to merge it into base in order to avoid cyclic import issues while making it the default allocator. This resulted in a lot of heap_ and HEAP_ prefixing.

The same goes for the virtual_memory_* and get_current_thread_id procs added to base:runtime. If anyone has a feel for how that could be improved, or if they're good as-is, I'd like to hear.

Memory Order Consume

I'm uncertain if the Consume memory ordering really means consume. If you check under Atomic_Memory_Order in base:intrinsics, it has a comment beside Consume that says Monotonic, which I presume corresponds to this.

Based on the documentation for LLVM's Acquire memory ordering, this is the one that actually corresponds to memory_order_consume.

I'm leaning towards thinking my usage of Consume should actually be replaced with Acquire, based on this, but I've left the memory order as-is for now until someone else can review and comment about it. It's no problem to use a stronger order, but if we can get away with a weaker one and preserve the semantics, all the better.

I base most of my understanding of memory ordering on Herb Sutter's talks, referenced above, which I highly recommend to anyone interested in this subject.

@Feoramund Feoramund mentioned this pull request Jan 24, 2025
61 tasks
Copy link
Contributor

@graphitemaster graphitemaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, this is absolutely amazing work. You've done an incredible job here.

I've not done a full review since there's certainly more to check with real world benches and because I'd have to actually run the code, but from what I've gleamed reading the changes on GH there is some comments.

I also want to ask for some real world graphs when running the myriad of heap bench tools out there and some the harder stress tests for memory allocators that stress their threading characteristics.

Of note, I think it should be added that while this allocator is robust in the nominal sense, it's not at all "security hardened" in the sense that a buffer overflow can't give an attacker access to important heap data structures to cause havoc. It's actually quite unsafe from a hardening stand-point because the heap data structure is still stored and maintained inline with regular allocations (header/footer, what have you) and is not using side-band metadata.

I'd be curious how binning performs in practice too. This allocator lacks a useful optimization most allocators implement which is a "debouncing" filter on size classes. Essentially some workloads have a lot of allocations that go like "big size, small size, big size, small size, big size, small size, ..." classes (but same bins after overflow) jumping back and forth and eventually you end up with excessive internal fragmentation and scanning loops that degrade performance.

heap_slab_clear_data(slab)
}
slab.bin_size = size
slab.is_full = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does require an atomic store for coherency reasons, it's valid for this write not to be flushed from cache to backing memory until after ptr is returned and used by other threads on E X O T I C A R C H S

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference for this? I'm always interested in more documentation.

if superpage == local_heap_tail {
assert_contextless(superpage.next == nil, "The heap allocator's tail superpage has a next link.")
assert_contextless(superpage.prev != nil, "The heap allocator's tail superpage has no previous link.")
local_heap_tail = superpage.prev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely need an atomic load of superpage.prev here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine without, as only the owning thread ever interacts with prev.

assert_contextless(superpage.prev != nil, "The heap allocator's tail superpage has no previous link.")
local_heap_tail = superpage.prev
// We never unlink all superpages, so no need to check validity here.
superpage.prev.next = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM might optimize this code to local_heap_tail.next = nil because of the non-atomic load of superpage.prev above (and lack of a barrier) here, even though another thread can replace superpage.prev and so this code is actually assigning nil to that other thread's superpage.prev and not the one loaded into local_heap_tail. This area seems a bit problematic actually, do both of these operations (reading prev and assigning next to nil) need to happen atomically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to change superpage.prev.next = nil to local_heap_tail.next = nil; they're semantically the same. superpage.prev is only accessed by the thread which owns it, so atomic access shouldn't be needed.

@gingerBill
Copy link
Member

Can I just say, I fucking love your work! It's always a surprise to see it and always a pleasure to read.

}

_resize_virtual_memory :: proc "contextless" (ptr: rawptr, old_size: int, new_size: int, alignment: int) -> rawptr {
// NOTE(Feoramund): mach_vm_remap does not permit resizing, as far as I understand it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is true, you can do this more efficiently with these steps:

  1. mach_vm_allocate a bigger region
  2. mach_vm_remap from the smaller region into the new bigger region (which does the copying, and also maps the previous steps allocation)
  3. mach_vm_deallocate the old region

Copy link
Collaborator

@laytan laytan Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, mach_task_self() is implemented as #define mach_task_self() mach_task_self_.

Maybe you can use that global symbol to see a minor improvement as Odin does not optimise that function call.

EDIT: it is weird that mach_task_self is also a function symbol though, as a macro that wouldn't work 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried optimizing this as you directed, but I don't think I'm using the API correctly. I've only found a little documentation for it.

If you could take a look at it again, that'd be great. I added one commit to revert to the old behavior that worked on ARM Mac, and my attempt at following the optimization suggested is intact in the commit behind that. I even tried to explicitly set the memory protections, and it's still segfaulting halfway through the virtual memory tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've sat with it for a bit, trying different incantations, but didn't actually get it to work either, the lack of documentation is really something here. I think when I tried it before it appeared to work but wasn't actually doing it how I expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate. Thank you for trying anyway.

@Feoramund
Copy link
Contributor Author

Of note, I think it should be added that while this allocator is robust in the nominal sense, it's not at all "security hardened" in the sense that a buffer overflow can't give an attacker access to important heap data structures to cause havoc. It's actually quite unsafe from a hardening stand-point because the heap data structure is still stored and maintained inline with regular allocations (header/footer, what have you) and is not using side-band metadata.

If preventing buffer overflows is a priority, then the StarMalloc paper is an excellent work on making a heap allocator resilient to this class of bugs, full of ideas which I had not encountered in any other paper. They use out-of-band metadata as you suggest, as well as canaries to detect overflows and guard pages. When setting out on implementing this allocator initially, I presumed that among the audience that Odin targets, speed would be a preferred higher priority.

I put my focus on making a fast lock-free parallel allocator, because I figured that this was the most complex design that encompasses the most general-purpose use and anything else would be of lesser complexity and thus easy for someone with a specific problem to implement a solution for. I.e. it's easy for someone to write a single-threaded bump allocator for a specific size class if that affords them greater performance, as it's very specific, has strong design constraints, and is simple.

I think if a StarMalloc-like allocator is preferred, then that might be easier to implement, since it uses mutexes for synchronicity.

I'd be curious how binning performs in practice too. This allocator lacks a useful optimization most allocators implement which is a "debouncing" filter on size classes. Essentially some workloads have a lot of allocations that go like "big size, small size, big size, small size, big size, small size, ..." classes (but same bins after overflow) jumping back and forth and eventually you end up with excessive internal fragmentation and scanning loops that degrade performance.

I can't recall encountering the term "debouncing" filter, and I did a quick search through a few of my PDFs, but I can say that if you allocate, say, 4KiB, then 8 bytes, and do that repeatedly back and forth, all of the 4KiB allocations will be adjacent, and all of the 8 byte allocations will be adjacent, up to a certain number.

So with 8 byte allocations, with the current default config, you get 7907 bins to play with. All of the 0-8 byte allocations will be placed into one of those bins, linearly from left to right. When the slab runs out, the allocator will try to find a new slab (which will also be subdivided into 7907 bins) to place future allocations into of that same size.

Then with the 4KiB allocations, you'll get a slab that has only 15 bins (because the slab is 64KiB and we still need to keep some book-keeping data, therefore it's subdivided down to 15 slots), and each 4KiB allocation will go into that, until it runs out of space and needs to find a new slab.

The allocator doesn't try to get a new slab for a bin size rank if it already has one, so it shouldn't be fragmenting in that way either.

All of the available slabs with open bins are kept in the heap's cache for quick access, each slab has an index saved (next_free_sector) of where it can find the first free bin, and every superpage that has free slabs are also cached, so there shouldn't be much looping going on to find an open spot.

Hopefully this explanation clears up the allocation pattern for how this works.

@laytan
Copy link
Collaborator

laytan commented Jan 24, 2025

To give some insight about wasm:

WASM's memory model is very simple, a page is 64KiB, and there is only an API to grow (request more pages), there is no freeing or anything like that. So you want 128KiB you call it with 2, for 2 pages and you get those back. AFAIK every page you request is next to the previously requested page. You can look at the wasm_allocator.odin in runtime, it's a pretty small and basic allocator I put in based on the emscripten allocator. I am not sure if you can adapt this into this allocator, probably not. But because we already have a simple native allocator for wasm it is not entirely necessary either.

For orca we do want to keep calling malloc, this is because that calls into the orca runtime and that is always bundled in, otherwise we would have an allocator on top of their allocator which doesn't make much sense.
Also in the interest of bundle size which is often a bigger factor on wasm.

@laytan
Copy link
Collaborator

laytan commented Jan 24, 2025

Also, I can debug the macos failures this weekend if nobody else got to it.

@Feoramund
Copy link
Contributor Author

You can look at the wasm_allocator.odin in runtime, it's a pretty small and basic allocator I put in based on the emscripten allocator. I am not sure if you can adapt this into this allocator, probably not. But because we already have a simple native allocator for wasm it is not entirely necessary either.

I see no reason to displace a perfectly good allocator that's been tuned for the platform.

For orca we do want to keep calling malloc, this is because that calls into the orca runtime and that is always bundled in, otherwise we would have an allocator on top of their allocator which doesn't make much sense. Also in the interest of bundle size which is often a bigger factor on wasm.

I can see to making an exception for Orca (and possibly other platforms), so that it'll be easy to have a malloc fallback.

@laytan
Copy link
Collaborator

laytan commented Jan 26, 2025

Also, I can debug the macos failures this weekend if nobody else got to it.

So, the segfault on ARM MacOS is because _allocate_virtual_memory_superpage is returning nil, the mach_vm_map call inside it is returning 4 for invalid argument. I think superpages aren't supported on ARM MacOS.

MEMORY_OBJECT_NULL :: 0
VM_PROT_READ :: 0x01
VM_PROT_WRITE :: 0x02
VM_INHERIT_SHARE :: 0
Copy link
Collaborator

@laytan laytan Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you use MAP_PRIVATE on other targets, but use VM_INHERIT_SHARE here. The equivalent to MAP_PRIVATE would be VM_INHERIT_COPY afaict.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be why the Intel CI is failing once threading is involved but I can't confirm because I don't have an Intel machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been changed as suggested, however Intel MacOS is still stalling midway through the normal core test suite (but after all the allocator tests have passed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As silly as this sounds, why not just have different flags for Intel and Arm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which flags are we talking about here? The inheritance flags? I double-checked the documentation, and VM_INHERIT_COPY looks to be the right choice. I just don't know why Intel Mac is stalling after a certain point (I do not have an Intel Mac machine either), and it may have nothing to do with this part of the code, since it was stalling before the change too.

@Feoramund
Copy link
Contributor Author

So, the segfault on ARM MacOS is because _allocate_virtual_memory_superpage is returning nil, the mach_vm_map call inside it is returning 4 for invalid argument. I think superpages aren't supported on ARM MacOS.

That may very well be the case. I had assumed ARM MacOS at least supported some form of superpage flag, based on the following definitions, but it might just not support any, if 2MB is the only actual option:

https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/mach/vm_statistics.h#L353-L361

@Feoramund Feoramund force-pushed the feoramalloc branch 7 times, most recently from 219755c to 6e0f518 Compare February 11, 2025 23:37
@Feoramund
Copy link
Contributor Author

I dug into the NetBSD issue. It's looking like the 7th argument to the NetBSD mmap syscall must be placed on the stack, but it's not seen unless two words are pushed onto the stack (so 7th argument is expected to be at $rsp+8 at time of syscall), and it's caller pop. To get this far, I wrote a call to mmap in assembly and checked it with picotrace which is NetBSD's tracer for analyzing syscalls, while also using lldb to verify the contents of the stack.

I'm not sure what $rsp+0 is supposed to be, if anything, at the time of syscall for this.

; https://mail-index.netbsd.org/netbsd-users/2023/04/14/msg029654.html
section .note.netbsd.ident note alloc noexec nowrite align=4
        dd 7          ; ELF_NOTE_NETBSD_NAMESZ
        dd 4          ; ELF_NOTE_NETBSD_DESCSZ
        dd 1          ; ELF_NOTE_TYPE_NETBSD_TAG
        dd 'NetBSD'   ; NetBSD string (8 bytes)
        dd 1000000000 ; __NetBSD_Version__ (please see <sys/param.h>)

%define SYS_exit 1
%define SYS_mmap 197

section .text
        global _start
_start:
        int 3
        int3

        mov rax, SYS_mmap
        mov rdi, 0             ; addr: void*
        mov rsi, 4096          ; len: size_t
        mov rdx, 0x1|0x2       ; prot: int  = PROT_READ|PROT_WRITE
        mov r10, 0x0002|0x1000 ; flags: int = MAP_PRIVATE|MAP_ANONYMOUS
        mov r8, -1             ; fd: int
        mov r9, 0x0            ; PAD: long (unused)
        push 0x0               ; pos: off_t
        push 0x0               ; ?
        syscall
        add rsp, 16            ; reset stack pointer (caller pop)

        mov rax, SYS_exit
        syscall

I might be able to alter syscall_bsd to push the extra arguments for NetBSD, but it concerns me that it needs an extra pushed value on the stack and I don't know why that is. There is this documentation which makes me think it's just another unused argument that doesn't appear in the syscall trace that has been lined up with the API in an out of order manner.

ODIN_VIRTUAL_MEMORY_SUPPORTED :: VIRTUAL_MEMORY_SUPPORTED

// Virtually all MMUs supported by Odin should have a 4KiB page size.
PAGE_SIZE :: 4 * Kilobyte
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually wrong on ARM MacOS (it's 16kb (event though libcs PAGE_SIZE is 4kb for backwards compatibility or something)), possibly on others too. Did you think about doing what's done in core:mem/virtual where the page size is queried in an @(init)? There is the vm_page_size global on darwin too, but maybe just setting it to 16kb on ARM MacOS is fine too, I just don't know if it is also different on other targets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of simplicity, I had hoped to avoid runtime lookup of page sizes and superpage sizes, in hopes that drawing on the MMU sizes would still be sane even when the virtual page size differed.

Changing the base:runtime virtual memory procedures to rely on a variable read at initialization wouldn't be difficult.

Considering the prevalence of kernel-configurable (super)page sizes, it might be worth doing for long-term stability to check them both at init, though it would complicate the heap allocator implementation if many of the constants had to be changed to depend on runtime variables. It's certainly doable, just complicated because right now all the struct sizes are known at compile-time and that makes certain definitions easier.

I could look into this as the next significant change.

@Feoramund
Copy link
Contributor Author

I have good news. I've since learned how to write a dynamic library in Odin that interfaces with this allocator and provides a complete malloc replacement API suitable for patching any executable on Linux with the LD_PRELOAD environment variable. I've been using this allocator to run a variety of C++ programs on my machine to stress test it and root out some tricky bugs.

I've reached the point where I can successfully open my web browser, open 40 tabs, run browser-based benchmarks, and close everything without a problem. It's actually powering real programs now.

@graphitemaster
Copy link
Contributor

graphitemaster commented Mar 31, 2025

In the case of over-aligned allocations > 64b there might be a trick we can do here but I'm not sure what the worst-cast performance or memory characteristics are for it.

The general idea is that most allocations are not naturally going to have addresses that are divisible by large alignment values. In the case of 128b alignment (the first alignment that is larger than 64) you'll have a memory address where the low seven bits are all zero. I posit that for most allocations that are not over-aligned, this will rarely occur.

So here's the idea, for alignments 128b or larger, allocate N+A-1 (where A is the alignment), and shove an allocation header at the start of that. Since A will always be >= 128, this gives you 127 bytes of space to store whatever you want right before the allocation so you could also just store the original pointer. You leave all other allocations the same though. They do not require any over-alignment or header trickery (except in one case mentioned below)

When freeing memory, you simply count the number of leading zeros from the address, if it's >= 7, you know it must be an over-aligned allocation, meaning it's safe to read before it the original allocation to free.

The only time this method can fail is when the allocation is not over-aligned and happens to have an address that is a multiple of some power of two that is >= 128. In these cases I propose a change to the existing allocator:

When allocating anything, if the address returned happens to be aligned by some multiple of power of two that is >= 128, you try an in-place resize by 128 bytes, if the in-place resize works, you shove the original pointer before the returned memory, promoting what would be a regular allocation into a 128b aligned one. This shouldn't happen often though. If an in-place resize is not possible though, you'll have to attempt an over-aligned allocation (128b) and copy over, freeing the old one. This should happen significantly less, to the point that statistically it should have a 1 in 2^41 chance.

Overall, this has the following characteristics:

  • Allocations with A >= 128 need to allocate 128 bytes more than usual
  • Allocations that unluckily return addresses that are a multiple of any power of 2 >= 128 need to in-place resize up to 128b extra or reallocate completely with a copy (statistically unlikely)
  • Frees need to perform one count leading zero on the memory address in all cases, in the case the count is >= 7 (one branch) they need to pointer arithmetic their way to the real address, in all other cases it's a regular free though.

@Feoramund
Copy link
Contributor Author

In the case of over-aligned allocations > 64b there might be a trick we can do here but I'm not sure what the worst-cast performance or memory characteristics are for it.

For this, I'm more concerned about complicating the allocator with extra special cases more so than performance.

I think it comes down to the question of "what is the maximum alignment we want to support?" If we need to just do 128 bytes, we can increase HEAP_MAX_ALIGNMENT for a simple increase in book-keeping costs on slabs and huge allocations, nothing more, or even make it a #config.

However, if we need to support completely arbitrary alignments, then something like this can be considered. I would really hope to keep the number of special cases as minimal as possible to prevent complicating the allocator and increasing the potential for bugs.

Are there any practical cases for N>128 byte alignment in a default scenario?

@Kelimion
Copy link
Member

There may be use cases for page-aligned allocations, but then you could just reach out to the virtual memory system directly.

@Feoramund
Copy link
Contributor Author

I thought about this a bit more and realized that there's a simpler solution, if arbitrary alignment support is the way we want to go. Once again, we won't need any headers.

Take a slab of any bin size. Every bin is self-aligned to its own size. Based on the address of the slab data starting pointer (which is also bin slot 0) and the bin size, we can calculate precisely which bins will be aligned to numbers greater than the bin size by allocating only on the aligned interval. For example, this would allow for allocations of size 32 and alignment 64 or size 64 and alignment 128 without wasting any memory at all. All sizes would live together, regardless of alignment request.

The bins not used by "hopping" along this aligned interval will still be available for standard alignment requests. Of course, you do subdivide a slab's available memory by doing this if all you're doing is making allocations of unusual alignments, but the logic that would handle it all should be simpler than needing headers and having to check for those upon freeing.

However, this means that dirty_bins will be marking some number of bins outside the interval as dirty when they never were used, causing already-zeroed memory to be zeroed again upon a new allocation. This may not be that much of a cost in practice.

There's also the possibility of performance degradation with regards to the CPU cache if these non-size alignments are allocated interspersed with standard alignments. A series of size 32 / align 64 allocations will inevitably interlock with a series of size 32 / align 32 allocations, and the allocation pattern might not be as preferable for some applications than if the two were segregated with size 32 allocations in one slab and the align 64 allocations in the size 64 slab. I think that might be application-specific and probably not a big deal though, but it did come to mind.

I'll have to think on this more.


Now that I have a working malloc shim, I'm going to start putting together benchmarks and test suites that others have made and plug the allocator into those to see what the results look like.

Feoramund added 3 commits May 8, 2025 11:06
- Add the test bench for the allocator
- Move old allocator code to the test bench
- Fix `heap_resize` usage in `os2/env_linux.odin` to fit new API
  requiring `old_size`
Feoramund added 18 commits May 8, 2025 11:06
Increment of `i` is handled near the start of this loop.
At this point, we know we're expanding the wide slab, so `old_size` will
always be the right size.
Previously, a wide slab could be freed during this phase which could
trigger the superpage itself to be freed, resulting in iteration across
potentially invalid memory if the superpage was returned to the host and
not placed in the orphanage.

`heap_free_wide_slab` has had its responsibilities reduced to just
freeing the wide slab, and `heap_free_superpage_if_empty_and_unused` has
been added to better clarify what is happening where.
This is a total rewrite after having experimented with a variety of
techniques including a dual-allocator strategy for small and large size
classes where the large class allocator had a coalescing mechanism.

This allocator design is much closer to `mimalloc` in spirit and
performs almost twice as fast as the original bitmap-based design.

Additionally, several bugs have been fixed. The most important one being
the lock-free synchronization method used for remote frees. This design
uses a single atomic pointer that is doubly-tagged to pass remote frees
either to the heap or to the slab, depending on the ownership status.
This is a workaround for a lack of an ability to tell TSan that it
should clear any state it has about certain memory ranges to prevent
false positives.
@Feoramund Feoramund force-pushed the feoramalloc branch 2 times, most recently from 2859da1 to 674853f Compare May 9, 2025 22:15
Feoramund added 2 commits May 9, 2025 18:15
The new address sanitizer code was committed to a file that I had later
moved and renamed with the old commits.

Having rebased onto master, those commits were carried on to the present
point without generating a conflict and have since become invalid.
@Feoramund
Copy link
Contributor Author

Feoramund commented May 9, 2025

The latest heap allocator rewrite has been pushed. It's been a lot of debugging work, but we're near the end.

This rewrite was done for performance, simplicity, and correctness. There was a subtle synchronization bug that had kept cropping up in rare instances which led me to go digging through papers and source code again for an idea. I replaced the previously complicated synchronization method that used 3 variables in favor of a doubly-tagged pointer that can redirect the flow of remote frees from heap to slab depending on the ownership status.

I've since taken the allocator to a design much closer to a simplified version of Mimalloc. The actual implementation is under 1kLoC now. heap*.odin comes up as 1,110 lines of code using tokei.

I've integrated the AddressSanitizer by using @Lperlind's great new API which should address all of @graphitemaster's hardening concerns. We can guard against invalid frees, double frees, use-after-frees, and buffer overflows when -sanitize:address is on.

Of note, we no longer free virtual memory if TSan is enabled in order to bypass any false positives. If there's a TSan API that can wipe state for memory regions, that'd be great, but I couldn't find one when I went digging into the header files I could find on a web search.

Final TODO list:

  1. Fix Intel macOS issues. I have access to a real machine I can test this on now.
  2. Fix NetBSD vmem: I should be able to come up with a fix for this. I had one in mind before I set out on this rewrite.
  3. Get the page and superpage sizes from the OS at runtime (which is going to be far easier now that the allocator dynamically handles capacities it's given for metadata).
  4. Give everything a final review.

	OS:      Arch Linux, Linux 6.14.5-arch1-1
	CPU:     12th Gen Intel(R) Core(TM) i7-12700K
	RAM:     31906 MiB
	Backend: LLVM 19.1.7

allocator-benchmark-results-1t

allocator-benchmark-results-mt

allocator-benchmark-results-1t-random

allocator-benchmark-results-peakmem

@graphitemaster
Copy link
Contributor

Do you think we can get some benches that compare this allocator to rpmalloc, tlsf. And scudo. These three are essentially considered world class when it comes to lock free allocators. The others in your bench do use internal locking so you're comparing a lock free allocator to non-lock which isn't a fair comparison. Otherwise this looks absolutely amazing and the simplification is always a good change. Curious about the TSAN thing, will have to do some googling of that.

@Feoramund
Copy link
Contributor Author

The others in your bench do use internal locking so you're comparing a lock free allocator to non-lock which isn't a fair comparison.

Mimalloc is lock-free, so that one falls into the category you want to see. I did a quick test of rpmalloc a day ago without doing the full range of tests but it seemed to be on par with Mimalloc, if not faster. I can do a full range of tests and include it in the next benchmark chart. As for tlsf and scudo, if I can compile them without issue, I can add them to the benchmarks too.

@Feoramund
Copy link
Contributor Author

Feoramund commented May 11, 2025

I've pushed the code for runtime page_size fetching from the OS, but there's an issue with the initialization order that causes the code in os to initialize before runtime (sometimes and non-deterministically), which is causing the floating point errors (division by zero, of course).

I have reported this in #5146.

The code works fine when not compiled with os, such as in a barebones DLL, which is how I've been running the benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants