StringZilla 4.0! #201

ashvardanian · 2024-12-07T12:00:30Z

This PR entirely refactors the codebase and separates the single-header implementation into separate headers. Moreover, it brings faster kernels for:

Sorting of string sequences and pointer-sized integers,
Levenshtein edit distances for DNA alignment and UTF-8 fuzzy matching,
Needleman-Wunsch pairwise global alignment for proteins,
Multi-pattern search & feature-extraction on GPUs,
AES-based portable general-purpose hashing functions,

And more community contributions:

Detecting CPU capabilities 👏 @GoWind - Feat: #143 Inline ASM for detecting CPU features on ARM #196
Windows cross-compilation 👏 @ashbob999 - Added windows cross compile builds & fixed build issues #169
CMake refactor 👏 @friendlyanon - Fix the CMake code of the project #85
Charset initialization 👏 @alexbarev - Bug: Last elements in basic_charset initialization are discarded #200
Benchmarking sorting algorithms 👏 @ashbob999 Fix: hybrid bench sort issues #209

Why Split the Files? Matching SimSIMD Design

Sadly, most of the modern software development tooling stinks. VS Code is just as slow and unresponsive as the older Atom and the other web-based technologies, while LSP implementations for C++ are equally slow and completely mess up code highlighting for files over 5,000 Lines Of Code (LOCs). So, I've unbundled the single-header solution into multiple headers, similar to SimSIMD.

Also, similar to SimSIMD, CPU feature detection has been reworked to separate serial implementations, Haswell, Skylake, Ice Lake, NEON, and SVE.

Faster Sequence Alignment & Scoring on GPUs

Faster Sorting

Our old algorithm didn't perform any memory allocations and tried to fit too much into the provided buffers. The new breaking change in the API allows passing a memory allocator, making the implementation more flexible. It now works fine on 32-bit systems as well.

The new serial algorithm is often 5x faster than the std::sort of C++ Standard Templates Library for a vector of strings. It's also often 10x faster than the qsort_r in the GNU C library. There are even faster versions available for Ice Lake CPUs with AVX-512 and Arm CPUs with SVE.

Faster Hashing Algorithms

Our old algorithm was a variation of the Karp-Rabin hash and was designed more for rolling hashing workloads. Sadly, such hashing schemes don't pass SMHasher and similar hash-testing suites and a better solution was needed. For years I was contemplating designing a good general-purpose hash-function based on AES instructions, implemented in hardware for several CPU generations now. As discussed with @jandrewrogers's, and can be seen in his AquaHash project, those instructions provide an almost unique amount of mixing logic per CPU cycle of latency.

Many popular hash libraries like AHash in the Rust ecosystem cleverly combine such AES instructions with 8-bit shuffles and 64-bit additions, but rarely harness the full power of the CPU due to constraints of Rust tooling and complexity of using masked x86 AVX-512 and predicated Arm SVE2 instructions.
StringZilla does that and ticks a few more boxed:

Outputs 64-bit hashes and passes the SMHasher --extra tests.
Is fast for both short (velocity) and long strings (throughput).
Supports incremental (streaming) hashing, when the data arrives in chunks.
Supports custom seeds for hashes and have it affecting every bit of the output.
Provides dynamic-dispatch for different architectures to simplify deployment.
Documents its logic and guarantees the same output across different platforms.

Implementing this logic, that provides both fast and high-quality hashes, often capable of computing 4 hashes simultaneously, made these kernels handy not only for the hashing itself, but also for higher-level operations like database-style hash-joins and set intersections, as well as advanced sequence alignment algorithms for bioinformatics.

Multi-Pattern Search

Nest Steps

Ditch using constant memory in CUDA. It was the original design to load the substitution tables for NW and SW into shared memory, until a simpler design with constant memory was accepted. That proved to be a mistake. The performance numbers are way lower than expected.
Use Distributed Shared Memory for 10x larger NW and SW inputs. That would require additional variants of _linear_score_on_each_cuda_warp and _affine_score_on_each_cuda_warp, that scale to 16 blocks of ((228 - 1) KB shared memory on each... totaling at 3.5 MB of shared memory that can be used for alignments. With u32 scores and 3 DP diagonals (in case of non-affine gaps) that should significantly accelerate the alignment of ~300K-long strings. But keep in mind that cluster.sync() is still very expensive - 1300 cycles - only 40% less than 2200 cycles for grid.sync().
Add asynchronous host callbacks for updating the addressable memory region in NW and SW.

scripts/bench.hpp

In the past, token benchmarks weren't balanced. For equality comparisons and ordering, they would take random strings which are almost always differing in the very first character and in length, making branch prediction trivial and performance identical between backends. The new benchmarks include self-comparisons, which are more similar to hash-table probing or strings sorting workloads.

This leads to doubling the performance on mixed workloads which may include self-comparisons, where both comparison arguments are the same.

…into main-dev

The initial version only reimplements the substring and byteset search benchmarks.

ashvardanian changed the title ~~4.0!~~ StringZilla 4.0! Dec 7, 2024

ashvardanian force-pushed the main-dev branch from 9deb2f8 to 1de3166 Compare February 13, 2025 13:18

ashvardanian mentioned this pull request Feb 23, 2025

Initial GoLang Support #211

Merged

ashvardanian added 4 commits March 10, 2025 06:00

Fix: Composing STL collections

f9da4ed

Fix: Revert to XMM on Haswell

4bec1e5

Fix: No intersect for Skylake

48d70ea

Improve: Logging in container benchmarks

4d955d3

github-advanced-security bot found potential problems Mar 10, 2025

View reviewed changes

scripts/bench.hpp Fixed Show fixed Hide fixed

scripts/bench.hpp Fixed Show fixed Hide fixed

scripts/bench.hpp Fixed Show fixed Hide fixed

scripts/bench.hpp Fixed Show fixed Hide fixed

ashvardanian added 22 commits March 10, 2025 10:15

Add: Comparisons in SVE

c31020d

This leads to doubling the performance on mixed workloads which may include self-comparisons, where both comparison arguments are the same.

Merge branch 'main-dev' of https://github.com/ashvardanian/StringZilla …

deafa9e

…into main-dev

Docs: Outdated function naming & spelling

92b9a56

Fix: Extra comma in printf

298d214

Make: Formatting CMakeLists.txt

467b4b8

Docs: Ignore formatting CMake

366816e

Merge branch 'main-dev' of https://github.com/ashvardanian/StringZilla …

97519fc

…into main-dev

Add: All new benchmarking suite

4744406

The initial version only reimplements the substring and byteset search benchmarks.

Fix: Reverse order std::search offsets

244e605

Improve: New-style "container" benchmarks

3f1c723

Make: Upgrade to C++20 for benchmarks

aa7f275

Improve: Naming "vtable" entries

b9794e5

Fix: Naming byteset signature

9676cdb

Docs: Describe trivial types

af686dd

Improve: New token-level benchmarks

12e1edd

Improve: Faster equality checks on NEON/SVE

20f35c7

Fix: Computing improvement percent

a5de795

Add: SVE backend for sorting

148b615

Improve: Better sorting benchmarks

7d534fb

Add: Intersection benchmarks

e860af0

Improve: Naming benchmark names

0074ad7

ashvardanian added 30 commits May 4, 2025 22:49

Add: Affine Levenshtein variants on GPU

3e1df93

Fix: Levenshtein w. Affine costs on GPU for zero-length ins

076e58a

Fix: Affine top row initialization

b9e4160

Add: Ice Lake Affine Levenshtein kernels

51afad5

Improve: Shrink Affine Ice Lake kernels

79e7a2f

Improve: Fuzzy test Ice Lake kernels

d603f7d

Fix: Comparing Affine benchmark results

64d8f4d

Improve: Run multiple warps per block

3b2f263

Improve: Scheduling speculative kernels

8a6a185

Improve: More noticeable signaling in tests

ece416d

Add: Affine gaps Levenshtein on Kepler

f09bbf9

Add: Hopper Levenshtein kernels

cb7c48d

Add: NW and SW for Hopper

e53c8d5

Fix: NW & SW on Hopper

4134e44

Fix: OOB impact on SW scoring

cd9ff1e

Improve: Divergent branches on i16 SW on Hopper

817b15c

Add: Parallel Ice Lake variants

6aaf16c

Add: CUDA Aho Corasick placeholder

d087afa

Improve: Support executors in multi-pattern search

35fad3d

Fix: Checking STRINGWARS_STRESS env-var

ade74f6

Improve: New caching primitives

df1f7ef

Docs: Aho-Corasick CUDA design

496e55a

Make: Rename bench_search -> bench_find

6f65624

Improve: Benchmark early exit

aaa6927

Improve: Custom validators for nullary benchmarks

0cf994a

Add: bench_find_many.cpp

2d594e4

Improve: Propagate allocators in safe_vector

0c442d1

Improve: Multi-byte characters support

85c5bf8

Fix: Missing std::generate include

853a0fa

Improve: New multi-pattern search APIs

20c9135

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StringZilla 4.0! #201

StringZilla 4.0! #201

ashvardanian commented Dec 7, 2024 •

edited

Loading

StringZilla 4.0! #201

Are you sure you want to change the base?

StringZilla 4.0! #201

Conversation

ashvardanian commented Dec 7, 2024 • edited Loading

Why Split the Files? Matching SimSIMD Design

Faster Sequence Alignment & Scoring on GPUs

Faster Sorting

Faster Hashing Algorithms

Multi-Pattern Search

Nest Steps

ashvardanian commented Dec 7, 2024 •

edited

Loading