Skip to content

StringZilla 4.0! #201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 444 commits into
base: main
Choose a base branch
from
Open

StringZilla 4.0! #201

wants to merge 444 commits into from

Conversation

ashvardanian
Copy link
Owner

@ashvardanian ashvardanian commented Dec 7, 2024

This PR entirely refactors the codebase and separates the single-header implementation into separate headers. Moreover, it brings faster kernels for:

  • Sorting of string sequences and pointer-sized integers,
  • Levenshtein edit distances for DNA alignment and UTF-8 fuzzy matching,
  • Needleman-Wunsch pairwise global alignment for proteins,
  • Multi-pattern search & feature-extraction on GPUs,
  • AES-based portable general-purpose hashing functions,

And more community contributions:

Why Split the Files? Matching SimSIMD Design

Sadly, most of the modern software development tooling stinks. VS Code is just as slow and unresponsive as the older Atom and the other web-based technologies, while LSP implementations for C++ are equally slow and completely mess up code highlighting for files over 5,000 Lines Of Code (LOCs). So, I've unbundled the single-header solution into multiple headers, similar to SimSIMD.

Also, similar to SimSIMD, CPU feature detection has been reworked to separate serial implementations, Haswell, Skylake, Ice Lake, NEON, and SVE.

Faster Sequence Alignment & Scoring on GPUs

Faster Sorting

Our old algorithm didn't perform any memory allocations and tried to fit too much into the provided buffers. The new breaking change in the API allows passing a memory allocator, making the implementation more flexible. It now works fine on 32-bit systems as well.

The new serial algorithm is often 5x faster than the std::sort of C++ Standard Templates Library for a vector of strings. It's also often 10x faster than the qsort_r in the GNU C library. There are even faster versions available for Ice Lake CPUs with AVX-512 and Arm CPUs with SVE.

Faster Hashing Algorithms

Our old algorithm was a variation of the Karp-Rabin hash and was designed more for rolling hashing workloads. Sadly, such hashing schemes don't pass SMHasher and similar hash-testing suites and a better solution was needed. For years I was contemplating designing a good general-purpose hash-function based on AES instructions, implemented in hardware for several CPU generations now. As discussed with @jandrewrogers's, and can be seen in his AquaHash project, those instructions provide an almost unique amount of mixing logic per CPU cycle of latency.

Many popular hash libraries like AHash in the Rust ecosystem cleverly combine such AES instructions with 8-bit shuffles and 64-bit additions, but rarely harness the full power of the CPU due to constraints of Rust tooling and complexity of using masked x86 AVX-512 and predicated Arm SVE2 instructions.
StringZilla does that and ticks a few more boxed:

  • Outputs 64-bit hashes and passes the SMHasher --extra tests.
  • Is fast for both short (velocity) and long strings (throughput).
  • Supports incremental (streaming) hashing, when the data arrives in chunks.
  • Supports custom seeds for hashes and have it affecting every bit of the output.
  • Provides dynamic-dispatch for different architectures to simplify deployment.
  • Documents its logic and guarantees the same output across different platforms.

Implementing this logic, that provides both fast and high-quality hashes, often capable of computing 4 hashes simultaneously, made these kernels handy not only for the hashing itself, but also for higher-level operations like database-style hash-joins and set intersections, as well as advanced sequence alignment algorithms for bioinformatics.

Multi-Pattern Search

Nest Steps

  • Ditch using constant memory in CUDA. It was the original design to load the substitution tables for NW and SW into shared memory, until a simpler design with constant memory was accepted. That proved to be a mistake. The performance numbers are way lower than expected.
  • Use Distributed Shared Memory for 10x larger NW and SW inputs. That would require additional variants of _linear_score_on_each_cuda_warp and _affine_score_on_each_cuda_warp, that scale to 16 blocks of ((228 - 1) KB shared memory on each... totaling at 3.5 MB of shared memory that can be used for alignments. With u32 scores and 3 DP diagonals (in case of non-affine gaps) that should significantly accelerate the alignment of ~300K-long strings. But keep in mind that cluster.sync() is still very expensive - 1300 cycles - only 40% less than 2200 cycles for grid.sync().
  • Add asynchronous host callbacks for updating the addressable memory region in NW and SW.

@ashvardanian ashvardanian changed the title 4.0! StringZilla 4.0! Dec 7, 2024
In the past, token benchmarks weren't balanced.
For equality comparisons and ordering, they would
take random strings which are almost always differing
in the very first character and in length, making
branch prediction trivial and performance identical
between backends.

The new benchmarks include self-comparisons, which
are more similar to hash-table probing or strings sorting
workloads.
This leads to doubling the performance
on mixed workloads which may include
self-comparisons, where both comparison
arguments are the same.
The initial version only reimplements the
substring and byteset search benchmarks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants