Core: Consolidate hashing containers #12515

Ivorforce · 2025-05-28T13:15:54Z

Describe the project you are working on

Godot core.

Describe the problem or limitation you are having in your project

Godot currently has 5 hashing containers:

HashMap
HashSet
AHashMap ("Array hash map")
OAHashMap ("Open Addressing hash map")
StringName (singleton)

Maintaining multiple containers for similar purposes is a maintenance burden. Bug fixes and optimizations have to be applied for each. And using different containers reduces consistency across the code base.

Obviously, we can't just blindly remove any of them. We need to understand relevant problems, and weigh trade offs or make practical decisions. Like this, we should be able to consolidate implementations, and may even be able to confidently merge or remove container classes.

Overview

	HashMap	AHashMap ("Array"-)	OAHashMap ("Open Addressing"-)	StringName (Static)	HashSet¹
Year Introduced	2022 / reduz	2024 / nazarwadim	2018 / karoffel	2014 / reduz	2022 / reduz
Probing	Open Addressing / Robin Hood	Open Addressing / Robin Hood	Open Addressing / Robin Hood	Linked List / Low Occupancy	Open Addressing / Robin Hood
Growth	Primes	Power-of-2	Power-of-2	Static `2^16` size	Primes
Data Storage	Sparse array of `KeyValue` allocations	Dense array of `KeyValue`	Sparse array of `KeyValue`	Sparse array of `_Data` allocations	Dense array of `Key`
Iteration	Linked List / Guaranteed insertion Order	Linear / Non guaranteed order	Brute Force / Random Order /	-	Linear / Non guaranteed order
Optimal Implementation²	Mostly	Mostly	No	I think so	No

Let's get into each problem individually.

Probing: Most implementations use open addressing with linear probing, which is a popular strategy in the industry as well³. But this also means we have 4 separate, mostly identical implementations for this. StringName is different because its strategy is mostly having low occupancy, which works out well for its use-case as a singleton hash map.

Growth: There are two contenders for growth: Primes and power-of-2. Primes provide an intrinsic benefit because they can compensate for bad hash functions⁴. This is not as important for linear probing hash maps, because buckets naturally spread out. On the other hand, using power-of-2 growth allows you to use binary masking to map hashes to indexes. This is way faster than modulo and fastmod, which are bottlenecks in hash maps. If overhead from modulo can be avoided, prime growth is the best candidate, but power-of-2 seems to be more popular in the industry.

Data Storage: There is one primary trade-off here. Dense KeyValue arrays are the best choice, allocation and iteration wise, when all elements are the same size, and pointer stability need not be guaranteed. Otherwise, it is advisable to store hashes linearly, but store insertion order as a linked list inside the object to be stored. AHashMap and HashSet use dense arrays and are therefore optimal for owned data. HasMap uses KeyValue allocations and is therefore not optimal. OAHashMap uses sparse arrays and is therefore not optimal. There is no implementation yet for unowned storage (which would be ideal for Node children).

Iteration Order: This is not an intrinsic property of hash maps, and should not be part of their implementation. HashMap implementing it complicates the data type, and makes is unsuitable for use-cases where insertion order need not be preserved. AHashMap cannot be used when insertion order should be preserved (although it would be trivial to change this with a template parameter). OAHashMap iterates the sparse array in a brute-force way. This is not optimal compared to the dense strategy.

Optimal Implementation: All container implementations show some degree of optimizability. I've recently optimized HashMap without changing semantics, so it is now again competitive against AHashMap and OAHashMap. Previous implementers have, at best, measured performance in benchmarks against existing implementations. If they found it to be faster, they added their own, rather than examining why their implementations were more performant than the others. This has led to the confusing mess we're in right now. New collections should not be added without comparing optimal implementations, or at least fully replacing the old ones, as measurements are often more sensitive to suboptimal implementations than well-chosen strategies.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

I propose to consolidate decisions made for hashing implementations.

In particular, all problems that have an optimal solution should be decided the same for all containers. Problems that have practical trade-offs warrant more than one implementation.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

From my examinations, I propose to make the following decisions:

Probing: We can keep using linear probing. We should make a single 'linear probing' implementation and use it in all containers.

Growth: It should be examined whether it is possible to reduce modulo-like overhead further. Currently, we use fastmod and if-based mod, which are substantially faster than traditional modulo. Caching modulo values next to hashes may lead to better performance, at the cost of 4 bytes for each entry. If the overhead can be reduced sufficiently, prime-based capacity is warranted. On the other hand, we own all hashing functions, and know them to be of high quality. Plus, we use linear probing. Based on these factors, a simple power-of-2 based growth strategy may be better. I cannot make a final recommendation as of now.

Data Storage: This is a true trade-off. On the one hand, we have hashmaps like HashMap<Variant, Variant> (e.g. for Dictionary), i.e. small key-values, which are good contenders for array-based storage, because we don't want to make allocations for every Variant entry. On the other hand, we have HashMap<StringName, Node *> (for Node children), which are large, allocated data. For these, an unowned data strategy is preferable. Still, it's possible that this decision is best made by the user of the HashMap, so we may not need two implementations.

Iteration Order: Preserving iteration order logic should be moved outside HashMap. Self-owning data types (like Node) should keep their own records. Owned data types (like int) should use a HashMap wrapper type.

Optimal Implementation: HashMap and HashSet are important in the database, so short-term it's important to optimize these. Still, we won't get around refactoring all containers mid-term to improve and consolidate implementations. I believe it's possible to do this in small steps. I will propose these steps as we move forward with this proposal.

Summing Up

I estimate that the most important optimization to make is to get rid of HashMap's KeyValue allocations. It is unlikely we can modify HashMap interfaces in-place, because there are over 2500 users of it. For that reason, I believe we need to phase it out slowly.
Users that want the HashMap to own the data (with no pointer stability required) should use AHashMap instead (but we'll need another round of benchmarks). Before this can happen though, AHashMap interfaces should be cleaned up. In particular, it is important to remove it storing KeyValue explicitly, instead of a given type, as some users may want to store the keys in custom formats. In addition, I'd like to propose to change the interface to be Entry lookup based (like in Rust). This allows for easier use and better optimization. I will likely work towards this soon. I'll describe this interface in more detail when it's relevant.
We could further experiment an "interspersed data" hash table type, which puts small data (like pointers) next to the hashes in order to reduce cache misses on lookup (AHashMap puts them into a separate array instead). This type of structure may be ideal for data not owned by the hashmap, like Node *.

OAHashMap is now obsolete, since all other hash table implementations also use open addressing. In particular, AHashMap makes the same decisions, but is a better implementation. It should be possible to remove OAHashMap in favour of AHashMap⁵.

It is unlikely that preserving insertion order is needed for most hashmap users. It can be extracted into an "insertion order preserving" wrapper type. In the case of Node children, next and previous are probably best stored in the Node itself, in order to reduce allocations for the hash table when nodes are moved, and to optimize iteration.

If this enhancement will not be used often, can it be worked around with a few lines of script?

No.

Is there a reason why this should be core and not an add-on in the asset library?

It's core.

HashSet does not feel like a hash map, but most implementations boil down to hash maps with empty values (or rather, most optimal HashMap implementations boil down to HashSet implementations with in-place keys). So we should be able to consolidate decisions here, too. ↩
By "Optimal Implementation" I mean that the chosen strategy is optimally implemented, not that the strategy is optimal. ↩
Notably, for simple implementations. Extremely well optimized implementations, such as abseil's flat_hash_map, make use of SIMD to compare multiple keys at once. Looking at its implementation, it's unlikely we'll want to adopt it (or re-implement it) over the potential maintenance burden. Notably, user nazarwadim has attempted this, but the implementation is complex, as expected. We may want to revisit this idea in the future, when more pertinent improvements have been taken care of. ↩
Apparently discussed in https://www.taylorfrancis.com/books/mono/10.1201/9781420035179/handbook-data-structures-applications-sartaj-sahni-dinesh-mehta-dinesh-mehta, though I did not read it. But the explanation is simple: If your hash function always produces, say, hashes divisible by 2, using a prime capacity ensures that all buckets are used (because of the wraparound). ↩
AHashMap implementer nazarwadim claimed that OAHashMap is preferable to AHashMap for small, trivial key-values. From a glance at implementations I don't think this is true. We'll need extensive benchmarks either way to warrant removing OAHashMap altogether. ↩

The text was updated successfully, but these errors were encountered:

Ivorforce changed the title ~~Consolidate hashing containers~~ Core: Consolidate hashing containers May 28, 2025

Ivorforce added topic:core performance labels May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Core: Consolidate hashing containers #12515

Core: Consolidate hashing containers #12515

Ivorforce commented May 28, 2025 •

edited

Loading

Uh oh!

Core: Consolidate hashing containers #12515

Core: Consolidate hashing containers #12515

Comments

Ivorforce commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the project you are working on

Describe the problem or limitation you are having in your project

Overview

Describe the feature / enhancement and how it helps to overcome the problem or limitation

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

Summing Up

If this enhancement will not be used often, can it be worked around with a few lines of script?

Is there a reason why this should be core and not an add-on in the asset library?

Footnotes

Ivorforce commented May 28, 2025 •

edited

Loading