Add experimental SIMD tokenizer #1934

benfdking · 2025-09-05T18:30:17Z

Summary

add simd-tokenizer feature to sqruff-lib-core and gate portable_simd usage
wire Lexer to tokenize with SimdTokenizer when the feature is enabled
expose the SIMD tokenizer module conditionally

Testing

cargo test -p sqruff-lib-core
cargo test -p sqruff-lib-core --features simd-tokenizer

https://chatgpt.com/codex/tasks/task_e_68bb26bf18688330a6be5fbb04e56cc5

chatgpt-codex-connector

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

chatgpt-codex-connector · 2025-09-05T18:32:55Z

crates/lib-core/src/parser/simd_tokenizer.rs

+}
+
+const CHUNK: usize = 64;
+
+fn is_ident_start(b: u8) -> bool {
+    b.is_ascii_alphabetic() || b == b'_' || b >= 0x80
+}
+
+fn skip_whitespace(bytes: &[u8]) -> usize {
+    let mut i = 0;
+    while i + CHUNK <= bytes.len() {


[P1] Avoid infinite loop on non‑ASCII identifiers

When the SIMD tokenizer encounters a byte ≥ 0x80 it treats it as a valid identifier start, but take_identifier only advances over ASCII letters/digits/underscores. For non‑ASCII input len becomes 0 and pos += len never advances, so while pos < bytes.len() spins forever as soon as the source contains any UTF‑8 characters beyond ASCII. Either reject non‑ASCII in is_ident_start or teach take_identifier to advance over those bytes.

Useful? React with 👍 / 👎.

cursor

This is the final PR Bugbot will review for you during this billing cycle.

Your free Bugbot reviews will reset on October 13

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, please upgrade to Bugbot Pro by visiting the Cursor dashboard. Your first 14 days will be free!

cursor · 2025-09-05T18:34:13Z

crates/lib-core/src/parser/lexer.rs

+
+                str_buff = resort_res.forward_string;
+                element_buffer.append(&mut resort_res.elements);
+            }


Bug: Duplicate EndOfFile Segments in SIMD Path

The SIMD tokenizer adds an EndOfFile token, leading to duplicate EndOfFile segments. The elements_to_segments method also unconditionally adds an EndOfFile segment, resulting in two when the SIMD tokenizer is enabled, unlike the single segment from the non-SIMD path.

Additional Locations (1)

crates/lib-core/src/parser/simd_tokenizer.rs#L88-L93

cursor · 2025-09-05T18:34:13Z

crates/lib-core/src/parser/simd_tokenizer.rs

+        i += 1;
+    }
+    i
+}


Bug: Identifier Parsing Mismatch Causes Infinite Loop

The is_ident_start function incorrectly validates non-ASCII bytes, including UTF-8 continuation bytes, as valid identifier starts. This clashes with take_identifier, which only consumes ASCII characters. This inconsistency causes incorrect tokenization of non-ASCII identifiers and, more critically, an infinite loop when take_identifier fails to advance the position.

Additional Locations (1)

crates/lib-core/src/parser/simd_tokenizer.rs#L98-L101

github-actions · 2025-09-05T18:41:26Z

Benchmark for `96d48ae`

Click to view benchmark

Test	Base	PR	%
DepthMap::from_parent	51.0±0.51µs	55.1±0.65µs	+8.04%
fix_complex_query	11.7±0.08ms	12.1±0.05ms	+3.42%
fix_superlong	138.1±17.37ms	140.8±16.27ms	+1.96%
parse_complex_query	4.1±0.04µs	4.5±0.05µs	+9.76%
parse_expression_recursion	7.2±0.06µs	7.7±0.10µs	+6.94%
parse_simple_query	1051.4±21.49ns	1237.8±12.87ns	+17.73%

github-actions · 2025-09-10T21:36:13Z

Benchmark for `6cb3df5`

Click to view benchmark

Test	Base	PR	%
DepthMap::from_parent	50.9±0.51µs	54.9±1.83µs	+7.86%
fix_complex_query	11.8±0.09ms	12.0±0.38ms	+1.69%
fix_superlong	145.9±14.97ms	146.7±14.51ms	+0.55%
parse_complex_query	4.2±0.07µs	4.2±0.04µs	0.00%
parse_expression_recursion	7.3±0.10µs	7.4±0.10µs	+1.37%
parse_simple_query	1047.7±13.97ns	1058.0±12.14ns	+0.98%

openhands-ai · 2025-09-10T21:38:53Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- PR Checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1934 at branch `codex/implement-simd-tokenizer-in-sqruff`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

Use experimental SIMD tokenizer behind feature flag

cd827d8

benfdking added the codex label Sep 5, 2025 — with ChatGPT Codex Connector

chatgpt-codex-connector bot reviewed Sep 5, 2025

View reviewed changes

cursor bot reviewed Sep 5, 2025

View reviewed changes

progress on testing

b2c2dba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add experimental SIMD tokenizer #1934

Add experimental SIMD tokenizer #1934

benfdking commented Sep 5, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Sep 5, 2025

Uh oh!

cursor bot Sep 5, 2025

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

openhands-ai bot commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add experimental SIMD tokenizer #1934

Are you sure you want to change the base?

Add experimental SIMD tokenizer #1934

Conversation

benfdking commented Sep 5, 2025

Summary

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This is the final PR Bugbot will review for you during this billing cycle.

Uh oh!

cursor bot Sep 5, 2025

Choose a reason for hiding this comment

Bug: Duplicate EndOfFile Segments in SIMD Path

Uh oh!

cursor bot Sep 5, 2025

Choose a reason for hiding this comment

Bug: Identifier Parsing Mismatch Causes Infinite Loop

Uh oh!

github-actions bot commented Sep 5, 2025

Benchmark for 96d48ae

Uh oh!

github-actions bot commented Sep 10, 2025

Benchmark for 6cb3df5

Uh oh!

openhands-ai bot commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Benchmark for `96d48ae`

Benchmark for `6cb3df5`