-
-
Notifications
You must be signed in to change notification settings - Fork 764
Add native lock-free dynamic heap allocator #4749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, this is absolutely amazing work. You've done an incredible job here.
I've not done a full review since there's certainly more to check with real world benches and because I'd have to actually run the code, but from what I've gleamed reading the changes on GH there is some comments.
I also want to ask for some real world graphs when running the myriad of heap bench tools out there and some the harder stress tests for memory allocators that stress their threading characteristics.
Of note, I think it should be added that while this allocator is robust in the nominal sense, it's not at all "security hardened" in the sense that a buffer overflow can't give an attacker access to important heap data structures to cause havoc. It's actually quite unsafe from a hardening stand-point because the heap data structure is still stored and maintained inline with regular allocations (header/footer, what have you) and is not using side-band metadata.
I'd be curious how binning performs in practice too. This allocator lacks a useful optimization most allocators implement which is a "debouncing" filter on size classes. Essentially some workloads have a lot of allocations that go like "big size, small size, big size, small size, big size, small size, ..." classes (but same bins after overflow) jumping back and forth and eventually you end up with excessive internal fragmentation and scanning loops that degrade performance.
heap_slab_clear_data(slab) | ||
} | ||
slab.bin_size = size | ||
slab.is_full = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does require an atomic store for coherency reasons, it's valid for this write not to be flushed from cache to backing memory until after ptr
is returned and used by other threads on E X O T I C A R C H S
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a reference for this? I'm always interested in more documentation.
if superpage == local_heap_tail { | ||
assert_contextless(superpage.next == nil, "The heap allocator's tail superpage has a next link.") | ||
assert_contextless(superpage.prev != nil, "The heap allocator's tail superpage has no previous link.") | ||
local_heap_tail = superpage.prev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely need an atomic load of superpage.prev
here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine without, as only the owning thread ever interacts with prev
.
assert_contextless(superpage.prev != nil, "The heap allocator's tail superpage has no previous link.") | ||
local_heap_tail = superpage.prev | ||
// We never unlink all superpages, so no need to check validity here. | ||
superpage.prev.next = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM might optimize this code to local_heap_tail.next = nil
because of the non-atomic load of superpage.prev
above (and lack of a barrier) here, even though another thread can replace superpage.prev
and so this code is actually assigning nil
to that other thread's superpage.prev
and not the one loaded into local_heap_tail
. This area seems a bit problematic actually, do both of these operations (reading prev and assigning next to nil) need to happen atomically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to change superpage.prev.next = nil
to local_heap_tail.next = nil
; they're semantically the same. superpage.prev
is only accessed by the thread which owns it, so atomic access shouldn't be needed.
Can I just say, I fucking love your work! It's always a surprise to see it and always a pleasure to read. |
} | ||
|
||
_resize_virtual_memory :: proc "contextless" (ptr: rawptr, old_size: int, new_size: int, alignment: int) -> rawptr { | ||
// NOTE(Feoramund): mach_vm_remap does not permit resizing, as far as I understand it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is true, you can do this more efficiently with these steps:
mach_vm_allocate
a bigger regionmach_vm_remap
from the smaller region into the new bigger region (which does the copying, and also maps the previous steps allocation)mach_vm_deallocate
the old region
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, mach_task_self()
is implemented as #define mach_task_self() mach_task_self_
.
Maybe you can use that global symbol to see a minor improvement as Odin does not optimise that function call.
EDIT: it is weird that mach_task_self
is also a function symbol though, as a macro that wouldn't work 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried optimizing this as you directed, but I don't think I'm using the API correctly. I've only found a little documentation for it.
If you could take a look at it again, that'd be great. I added one commit to revert to the old behavior that worked on ARM Mac, and my attempt at following the optimization suggested is intact in the commit behind that. I even tried to explicitly set the memory protections, and it's still segfaulting halfway through the virtual memory tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've sat with it for a bit, trying different incantations, but didn't actually get it to work either, the lack of documentation is really something here. I think when I tried it before it appeared to work but wasn't actually doing it how I expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unfortunate. Thank you for trying anyway.
If preventing buffer overflows is a priority, then the StarMalloc paper is an excellent work on making a heap allocator resilient to this class of bugs, full of ideas which I had not encountered in any other paper. They use out-of-band metadata as you suggest, as well as canaries to detect overflows and guard pages. When setting out on implementing this allocator initially, I presumed that among the audience that Odin targets, speed would be a preferred higher priority. I put my focus on making a fast lock-free parallel allocator, because I figured that this was the most complex design that encompasses the most general-purpose use and anything else would be of lesser complexity and thus easy for someone with a specific problem to implement a solution for. I.e. it's easy for someone to write a single-threaded bump allocator for a specific size class if that affords them greater performance, as it's very specific, has strong design constraints, and is simple. I think if a StarMalloc-like allocator is preferred, then that might be easier to implement, since it uses mutexes for synchronicity.
I can't recall encountering the term "debouncing" filter, and I did a quick search through a few of my PDFs, but I can say that if you allocate, say, 4KiB, then 8 bytes, and do that repeatedly back and forth, all of the 4KiB allocations will be adjacent, and all of the 8 byte allocations will be adjacent, up to a certain number. So with 8 byte allocations, with the current default config, you get 7907 bins to play with. All of the 0-8 byte allocations will be placed into one of those bins, linearly from left to right. When the slab runs out, the allocator will try to find a new slab (which will also be subdivided into 7907 bins) to place future allocations into of that same size. Then with the 4KiB allocations, you'll get a slab that has only 15 bins (because the slab is 64KiB and we still need to keep some book-keeping data, therefore it's subdivided down to 15 slots), and each 4KiB allocation will go into that, until it runs out of space and needs to find a new slab. The allocator doesn't try to get a new slab for a bin size rank if it already has one, so it shouldn't be fragmenting in that way either. All of the available slabs with open bins are kept in the heap's cache for quick access, each slab has an index saved ( Hopefully this explanation clears up the allocation pattern for how this works. |
To give some insight about wasm: WASM's memory model is very simple, a page is 64KiB, and there is only an API to grow (request more pages), there is no freeing or anything like that. So you want 128KiB you call it with 2, for 2 pages and you get those back. AFAIK every page you request is next to the previously requested page. You can look at the For orca we do want to keep calling |
Also, I can debug the macos failures this weekend if nobody else got to it. |
I see no reason to displace a perfectly good allocator that's been tuned for the platform.
I can see to making an exception for Orca (and possibly other platforms), so that it'll be easy to have a |
So, the segfault on ARM MacOS is because |
MEMORY_OBJECT_NULL :: 0 | ||
VM_PROT_READ :: 0x01 | ||
VM_PROT_WRITE :: 0x02 | ||
VM_INHERIT_SHARE :: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you use MAP_PRIVATE
on other targets, but use VM_INHERIT_SHARE
here. The equivalent to MAP_PRIVATE
would be VM_INHERIT_COPY
afaict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be why the Intel CI is failing once threading is involved but I can't confirm because I don't have an Intel machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been changed as suggested, however Intel MacOS is still stalling midway through the normal core test suite (but after all the allocator tests have passed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As silly as this sounds, why not just have different flags for Intel and Arm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which flags are we talking about here? The inheritance flags? I double-checked the documentation, and VM_INHERIT_COPY
looks to be the right choice. I just don't know why Intel Mac is stalling after a certain point (I do not have an Intel Mac machine either), and it may have nothing to do with this part of the code, since it was stalling before the change too.
That may very well be the case. I had assumed ARM MacOS at least supported some form of superpage flag, based on the following definitions, but it might just not support any, if 2MB is the only actual option: |
219755c
to
6e0f518
Compare
I dug into the NetBSD issue. It's looking like the 7th argument to the NetBSD I'm not sure what ; https://mail-index.netbsd.org/netbsd-users/2023/04/14/msg029654.html
section .note.netbsd.ident note alloc noexec nowrite align=4
dd 7 ; ELF_NOTE_NETBSD_NAMESZ
dd 4 ; ELF_NOTE_NETBSD_DESCSZ
dd 1 ; ELF_NOTE_TYPE_NETBSD_TAG
dd 'NetBSD' ; NetBSD string (8 bytes)
dd 1000000000 ; __NetBSD_Version__ (please see <sys/param.h>)
%define SYS_exit 1
%define SYS_mmap 197
section .text
global _start
_start:
int 3
int3
mov rax, SYS_mmap
mov rdi, 0 ; addr: void*
mov rsi, 4096 ; len: size_t
mov rdx, 0x1|0x2 ; prot: int = PROT_READ|PROT_WRITE
mov r10, 0x0002|0x1000 ; flags: int = MAP_PRIVATE|MAP_ANONYMOUS
mov r8, -1 ; fd: int
mov r9, 0x0 ; PAD: long (unused)
push 0x0 ; pos: off_t
push 0x0 ; ?
syscall
add rsp, 16 ; reset stack pointer (caller pop)
mov rax, SYS_exit
syscall I might be able to alter |
base/runtime/virtual_memory.odin
Outdated
ODIN_VIRTUAL_MEMORY_SUPPORTED :: VIRTUAL_MEMORY_SUPPORTED | ||
|
||
// Virtually all MMUs supported by Odin should have a 4KiB page size. | ||
PAGE_SIZE :: 4 * Kilobyte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually wrong on ARM MacOS (it's 16kb (event though libcs PAGE_SIZE
is 4kb for backwards compatibility or something)), possibly on others too. Did you think about doing what's done in core:mem/virtual
where the page size is queried in an @(init)
? There is the vm_page_size
global on darwin too, but maybe just setting it to 16kb on ARM MacOS is fine too, I just don't know if it is also different on other targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of simplicity, I had hoped to avoid runtime lookup of page sizes and superpage sizes, in hopes that drawing on the MMU sizes would still be sane even when the virtual page size differed.
Changing the base:runtime
virtual memory procedures to rely on a variable read at initialization wouldn't be difficult.
Considering the prevalence of kernel-configurable (super)page sizes, it might be worth doing for long-term stability to check them both at init, though it would complicate the heap allocator implementation if many of the constants had to be changed to depend on runtime variables. It's certainly doable, just complicated because right now all the struct sizes are known at compile-time and that makes certain definitions easier.
I could look into this as the next significant change.
I have good news. I've since learned how to write a dynamic library in Odin that interfaces with this allocator and provides a complete I've reached the point where I can successfully open my web browser, open 40 tabs, run browser-based benchmarks, and close everything without a problem. It's actually powering real programs now. |
In the case of over-aligned allocations > 64b there might be a trick we can do here but I'm not sure what the worst-cast performance or memory characteristics are for it. The general idea is that most allocations are not naturally going to have addresses that are divisible by large alignment values. In the case of 128b alignment (the first alignment that is larger than 64) you'll have a memory address where the low seven bits are all zero. I posit that for most allocations that are not over-aligned, this will rarely occur. So here's the idea, for alignments 128b or larger, allocate N+A-1 (where A is the alignment), and shove an allocation header at the start of that. Since A will always be >= 128, this gives you 127 bytes of space to store whatever you want right before the allocation so you could also just store the original pointer. You leave all other allocations the same though. They do not require any over-alignment or header trickery (except in one case mentioned below) When freeing memory, you simply count the number of leading zeros from the address, if it's >= 7, you know it must be an over-aligned allocation, meaning it's safe to read before it the original allocation to free. The only time this method can fail is when the allocation is not over-aligned and happens to have an address that is a multiple of some power of two that is >= 128. In these cases I propose a change to the existing allocator: When allocating anything, if the address returned happens to be aligned by some multiple of power of two that is >= 128, you try an in-place resize by 128 bytes, if the in-place resize works, you shove the original pointer before the returned memory, promoting what would be a regular allocation into a 128b aligned one. This shouldn't happen often though. If an in-place resize is not possible though, you'll have to attempt an over-aligned allocation (128b) and copy over, freeing the old one. This should happen significantly less, to the point that statistically it should have a 1 in 2^41 chance. Overall, this has the following characteristics:
|
For this, I'm more concerned about complicating the allocator with extra special cases more so than performance. I think it comes down to the question of "what is the maximum alignment we want to support?" If we need to just do 128 bytes, we can increase However, if we need to support completely arbitrary alignments, then something like this can be considered. I would really hope to keep the number of special cases as minimal as possible to prevent complicating the allocator and increasing the potential for bugs. Are there any practical cases for N>128 byte alignment in a default scenario? |
There may be use cases for page-aligned allocations, but then you could just reach out to the virtual memory system directly. |
I thought about this a bit more and realized that there's a simpler solution, if arbitrary alignment support is the way we want to go. Once again, we won't need any headers. Take a slab of any bin size. Every bin is self-aligned to its own size. Based on the address of the slab data starting pointer (which is also bin slot 0) and the bin size, we can calculate precisely which bins will be aligned to numbers greater than the bin size by allocating only on the aligned interval. For example, this would allow for allocations of size 32 and alignment 64 or size 64 and alignment 128 without wasting any memory at all. All sizes would live together, regardless of alignment request. The bins not used by "hopping" along this aligned interval will still be available for standard alignment requests. Of course, you do subdivide a slab's available memory by doing this if all you're doing is making allocations of unusual alignments, but the logic that would handle it all should be simpler than needing headers and having to check for those upon freeing. However, this means that There's also the possibility of performance degradation with regards to the CPU cache if these non-size alignments are allocated interspersed with standard alignments. A series of size 32 / align 64 allocations will inevitably interlock with a series of size 32 / align 32 allocations, and the allocation pattern might not be as preferable for some applications than if the two were segregated with size 32 allocations in one slab and the align 64 allocations in the size 64 slab. I think that might be application-specific and probably not a big deal though, but it did come to mind. I'll have to think on this more. Now that I have a working |
- Add the test bench for the allocator - Move old allocator code to the test bench - Fix `heap_resize` usage in `os2/env_linux.odin` to fit new API requiring `old_size`
Increment of `i` is handled near the start of this loop.
At this point, we know we're expanding the wide slab, so `old_size` will always be the right size.
Previously, a wide slab could be freed during this phase which could trigger the superpage itself to be freed, resulting in iteration across potentially invalid memory if the superpage was returned to the host and not placed in the orphanage. `heap_free_wide_slab` has had its responsibilities reduced to just freeing the wide slab, and `heap_free_superpage_if_empty_and_unused` has been added to better clarify what is happening where.
This is a total rewrite after having experimented with a variety of techniques including a dual-allocator strategy for small and large size classes where the large class allocator had a coalescing mechanism. This allocator design is much closer to `mimalloc` in spirit and performs almost twice as fast as the original bitmap-based design. Additionally, several bugs have been fixed. The most important one being the lock-free synchronization method used for remote frees. This design uses a single atomic pointer that is doubly-tagged to pass remote frees either to the heap or to the slab, depending on the ownership status.
This is a workaround for a lack of an ability to tell TSan that it should clear any state it has about certain memory ranges to prevent false positives.
2859da1
to
674853f
Compare
The new address sanitizer code was committed to a file that I had later moved and renamed with the old commits. Having rebased onto master, those commits were carried on to the present point without generating a conflict and have since become invalid.
The latest heap allocator rewrite has been pushed. It's been a lot of debugging work, but we're near the end. This rewrite was done for performance, simplicity, and correctness. There was a subtle synchronization bug that had kept cropping up in rare instances which led me to go digging through papers and source code again for an idea. I replaced the previously complicated synchronization method that used 3 variables in favor of a doubly-tagged pointer that can redirect the flow of remote frees from heap to slab depending on the ownership status. I've since taken the allocator to a design much closer to a simplified version of Mimalloc. The actual implementation is under 1kLoC now. I've integrated the AddressSanitizer by using @Lperlind's great new API which should address all of @graphitemaster's hardening concerns. We can guard against invalid frees, double frees, use-after-frees, and buffer overflows when Of note, we no longer free virtual memory if TSan is enabled in order to bypass any false positives. If there's a TSan API that can wipe state for memory regions, that'd be great, but I couldn't find one when I went digging into the header files I could find on a web search. Final TODO list:
|
Do you think we can get some benches that compare this allocator to rpmalloc, tlsf. And scudo. These three are essentially considered world class when it comes to lock free allocators. The others in your bench do use internal locking so you're comparing a lock free allocator to non-lock which isn't a fair comparison. Otherwise this looks absolutely amazing and the simplification is always a good change. Curious about the TSAN thing, will have to do some googling of that. |
Mimalloc is lock-free, so that one falls into the category you want to see. I did a quick test of rpmalloc a day ago without doing the full range of tests but it seemed to be on par with Mimalloc, if not faster. I can do a full range of tests and include it in the next benchmark chart. As for |
This is to support getting `auxv` on SysV platforms.
I've pushed the code for runtime I have reported this in #5146. The code works fine when not compiled with |
Native Odin-based Heap Allocator
After an intense development and rigorous testing process of the past three months, I am finally ready to share the results of my latest project.
In short, this is a lock-free dynamic heap allocator written solely in Odin, utilizing direct virtual memory access to the operating system where possible. Only system calls are used to ask the OS for virtual memory (except on operating systems where this is verboten, such as Darwin, where we use their libc API to get virtual memory), and the allocator handles everything else from
there.
Rationale
Originally, I was working on porting all of
core
to useos2
, when I found the experimental heap allocator in there. Having hooked my unreleasedos2
test framework up to it, I found that it suffered from more race conditions than had already been found, as well as other synchronization issues such as apparent misunderstandings of how atomic operations work.The most confusing code that stood out to me was the following block:
All three of those fields exist in the same
u64
. It would make more sense to have atomically loadedalloc
then read each field individually. I spent a few days trying to make sense ofheap_linux.odin
, but it was a bit much for me at the time. The previous block and the warnings listed by TSan didn't give me great hope that it would be simple to fix it.So, I did what I think most programmers do in this situation. I decided to try writing my own from nothing and hopefully come to better appreciation of the problem as a whole.
I combed through the last 30 years of the literature on allocators, with some reading of papers on parallelism.
For dynamic heap allocators, I found that contention, false sharing, and heap blowup were issues often mentioned. The Hoard paper (Berger et al., 2000) was particularly helpful in figuring out an overall design for solving those issues.
There is hopefully nothing too novel about the design I've put together here. We're all on well-trodden ground. I think the most exciting feature is the free bitmap, where most allocators appear to use free lists instead.
Goals
The design of this allocator was guided by three principles.
Features
AND
operations on the addresses distributed by the allocator.malloc
andfree
an opaque barrier. Given that the code is available right here in the runtime itself, it allows configuration to any programmer's needs. It provides uniform behavior across all platforms, as well; a programmer need not contemplate how heap allocation may impact performance on one system versus another.ODIN_DEBUG_HEAP
is enabled.runtime.get_local_heap_info()
API.Benchmark Results
Because of the wisdom from the quote above, I won't spend much time here except to say that the included test bench has microbenchmarks written by me for the purpose of making sure that the allocator is at least not egregiously slow in certain made-up scenarios.
If you believe these benchmarks can align with realistic situations, then this allocator is 2-3 times faster than libc malloc, in general use case scenarios (so any allocation less than ~63KiB), on my AMD64 Linux-based system, compiled with
-o:aggressive -disable-assert -no-bounds-check -microarch:alderlake
.Any speed gain drops off above allocations of 32KiB in size, because this is where bin allocations are no longer possible with the default configuration, and the allocator has to resort to coalescing entire slabs to fit the requests, but I decided to accept the result of this design, as it's not that much slower than malloc, and I believe that rapid allocation of >=64KiB blocks is a special case and not the usual case for most programs.
The full test suite can be run with:
The benchmarks can be run with:
The
allocator
command line option can be switched tolibc
to use the old behavior.Memory Usage
Speed aside, I can say that there are points to be aware of with this allocator, particularly in how it uses memory, which are clear and not as susceptible to application patterns like benchmarking may be.
For one, due to the nature of slab allocation, any allocation will always use the most amount of space possible within a bin rank, so if you request 9 bytes, you will in actual fact consume 16, as that is the next power of two available. This continues for every power of two up to the maximum bin size of 32KiB.
This shouldn't be too surprising at lower sizes, as with a non-slab general purpose allocator, you're almost guaranteed to have some book-keeping somewhere, which would result in an allocation of 8 bytes actually using 16 or 24 bytes, depending on the header.
This begins to break down at higher sizes, however. If you allocate 257 bytes instead of 256, you're going to be placed into a bin of 512 bytes. This may seem wasteful, but there is a consideration for this: every allocation of a particular size rank is tightly packed next to each other, which increases cache locality. It's a memory for speed tradeoff, in the end.
Alignment is also used as the size, if it's larger than the two, up to a maximum of 64 bytes by default. This was one of the design choices made to help eliminate any need for headers. Beyond a size of 64 bytes, all allocations are aligned to at least 64 bytes. Alignment beyond 64 bytes is not supported.
There is also no convoluted coalescing logic to be had for any allocation below ~63KiB. This was done for the sake of simplicity. Beyond 64KiB, the allocator has to make decisions on which slabs to merge together, which is where memory
usage and speed both take a hit.
To allocate 64KiB is to block out up to 128KiB, due to the nature of book-keeping on slab-wide allocations. That may be the weakest point of this allocator, and I'm open to feedback on possible workarounds.
The one upside of over-allocating like this is that if you resize within the same frame of memory that's already been allotted to you, it's virtually a no-op. The allocator has to do a few calculations, and it returns without touching any memory: simple and fast.
Beyond the
HUGE_ALLOCATION_THRESHOLD
, which is 3/4ths of a Superpage by default (1.5MiB), the allocator distributes chunks of at least a superpage in size directly through virtual memory. This is where memory waste becomes less noticeable, as we're no longer dealing with bins or slabs but whole chunks from virtual memory.Superpages also may waste up to one slab size of memory (64KiB) for the purposes of maintaining alignment, but this space is optionally used if a heap needs more space for its cache. With the current default values, one of these 64KiB blocks is used per 20 superpages allocated to a single thread. So it's about 3% of all virtual memory allocated this way.
The values dictating the sizes of slabs and maximum bins are all configurable through the
ODIN_HEAP_*
defines, so if your application really does need to make binned allocations of 64KiB, or if you find speed improvements by using smaller slabs, it's easy to change.I chose the default values of 64KiB slabs with a 32KiB max bin size after some microbenchmarking, but it's possible that different values could result in better performance for different scenarios.
To summarize: this allocator does not try to squeeze out every possible byte at every possible juncture, but it does try to be fast as much as possible.
There may be a case to be made for the reduction of fragmentation through slab allocation resulting in less actual memory usage at the end of the day versus a coalescing allocator, but that is probably an application-specific benefit and one I have not thoroughly investigated.
Credits
I hope to demonstrate that the design used in this allocator is not exceedingly novel (and thus, not untested) by pointing out the inspirations for each major feature based upon the literature reviewed. Each feature has been documented and in use in various implementations for over two decades now.
The following points are original ideas; original in the sense that they were realized during development and not drawn from any specific paper, not that they are wholly novel and have never been seen before.
dirty_bins
) to track which bins are dirty and need zeroing upon re-allocation was an iterative optimization realized after noticing that the allocator naturally uses the bin with the lowest address possible to keep cache locality, by virtue ofnext_free_sector
always being set to the minimum value. An earlier version of the allocator used a bitmap with the same layout oflocal_free
to track dirty bins.Quotes
The following passage inspired runtime configurable slab size classes.
This passage encouraged attention to optimizing the heuristics used for the bitmaps used to track free bins.
jemalloc's author, on the rich history of memory allocator design:
References
Lectures
Design Differences
Of note, jemalloc uses multiple arenas to reduce the issue of allocator-induced false sharing. However, those arenas are shared between active threads. The strategy of giving exclusive access to an arena on a per thread basis is more similar to Hoard than jemalloc.
With regard to what is called the global heap in the Hoard paper, there is the superpage orphanage in this allocator. They both fulfill similar duties as far as memory reuse. However, in Hoard, superblocks may be moved from per-processor heaps to the global heap, if they cross an emptiness threshold.
In my design, this ownership transfer mechanism is forgone in favor of an overall simplified synchronization process. Superpages do not change ownership until they are either completely empty and ready to be freed or the thread cleanly exits. For a remote thread to be able to decouple a superpage belonging to another thread would require more complicated logic behind the scenes and likely slow down regular single-threaded usage with atomic operations.
This design can result in an apparent memory leak if thread A allocates some number of bytes, and thread B frees all of the allocations but thread A never allocates anything ever again and does not exit, as either event would trigger the merging of its remote frees and subsequent freeing of its empty superpages.
This is one behavior to be aware of when writing concurrent programs that use this allocator in producer/consumer relationships. In practice however, it should be unusual that a thread accumulates a significant amount of memory that it hands off to another thread to free and never revisits its heap for the duration of the program.
The Name
Most allocators are either named after the author or have a fancy title. PHKmalloc represents the author's initials. Mimalloc presumably means Microsoft Malloc.
If I had to give this allocator design a name, I might call it "the lock-free bitmap slab allocator" after its key features. For the purpose of differentiating this specific implementation of a heap allocator from any others, I think "Feoramalloc" is suitable.
I've used
feoramalloc
in the test bench to differentiate it fromlibc
.Final Thoughts
In closing, I want to say that I hope this allocator can improve the efficiency of programs written in Odin while standing as an example of how to learn about these low-level concepts such as lock-free programming and heap allocators.
Obviously, it won't make all programs magically faster, and if you're already using a custom allocator, then you know more about your problem space better than a general-purpose allocator could possibly ever guess.
I think this is a significant step towards having an independent runtime. We can get consistent behavior across all platforms too, as well as the ability to learn very specific information about the heap through the included diagnostics.
This PR is a draft for now, while I hammer out the final details and receive feedback.
Help Requests
I mainly need help with non-Linux/POSIX virtual memory access. I can test this allocator against FreeBSD and NetBSD, but I do not have a Windows or Darwin machine to verify the system-specific code there.
Windows passed the CI tests, so I'm hopeful that it works there. The Darwin tests pass for Intel, but it stalls on the
core
test suite afterwards, so there's something strange going on there. Linux and FreeBSD are working.While testing, I hit an interesting snag with NetBSD. Its
mmap
syscall requires 7 arguments, but we only support 6, and I haven't been able to figure out what the calling convention for it is. That is to say, is it another register, or is it pushed to the stack, or something else. Could use some help there.I don't have a plan for what to do for systems that do not expose virtual memory access, since I don't have any experience with those systems. I'm assuming Orca and WASM do not expose a virtual memory subsystem akin to
mmap
or
VirtualAlloc
. I only recently started tinkering with wasm after finding the official example. These are otherwise foreign fields to me, and I'm open to feedback. We could perhaps have a malloc-based fallback allocator in that case.The only strong requirement the allocator has, regarding backing allocation, is some ability to request alignment or dispose of unnecessary pages. If we can do either of those, we're solid.
It would be great to hear about how this allocator impacts real-world application usage, too.
Also interested to hear how this could impact
-no-crt
. I noticed a commit recently about how Linux requires the C runtime to initialize thread local storage. I wasn't aware of that.API
I'm also looking to hear if anyone has any better ideas about organization or API. This allocator used to live in a package of its own right during all of my testing, but I had to merge it into
base
in order to avoid cyclic import issues while making it the default allocator. This resulted in a lot ofheap_
andHEAP_
prefixing.The same goes for the
virtual_memory_*
andget_current_thread_id
procs added tobase:runtime
. If anyone has a feel for how that could be improved, or if they're good as-is, I'd like to hear.Memory Order Consume
I'm uncertain if the
Consume
memory ordering really means consume. If you check underAtomic_Memory_Order
inbase:intrinsics
, it has a comment besideConsume
that saysMonotonic
, which I presume corresponds to this.Based on the documentation for LLVM's Acquire memory ordering, this is the one that actually corresponds to
memory_order_consume
.I'm leaning towards thinking my usage of
Consume
should actually be replaced withAcquire
, based on this, but I've left the memory order as-is for now until someone else can review and comment about it. It's no problem to use a stronger order, but if we can get away with a weaker one and preserve the semantics, all the better.I base most of my understanding of memory ordering on Herb Sutter's talks, referenced above, which I highly recommend to anyone interested in this subject.