-
Notifications
You must be signed in to change notification settings - Fork 44
Add experimental SIMD tokenizer #1934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments
to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
} | ||
|
||
const CHUNK: usize = 64; | ||
|
||
fn is_ident_start(b: u8) -> bool { | ||
b.is_ascii_alphabetic() || b == b'_' || b >= 0x80 | ||
} | ||
|
||
fn skip_whitespace(bytes: &[u8]) -> usize { | ||
let mut i = 0; | ||
while i + CHUNK <= bytes.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Avoid infinite loop on non‑ASCII identifiers
When the SIMD tokenizer encounters a byte ≥ 0x80 it treats it as a valid identifier start, but take_identifier
only advances over ASCII letters/digits/underscores. For non‑ASCII input len
becomes 0 and pos += len
never advances, so while pos < bytes.len()
spins forever as soon as the source contains any UTF‑8 characters beyond ASCII. Either reject non‑ASCII in is_ident_start
or teach take_identifier
to advance over those bytes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle.
Your free Bugbot reviews will reset on October 13
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, please upgrade to Bugbot Pro by visiting the Cursor dashboard. Your first 14 days will be free!
|
||
str_buff = resort_res.forward_string; | ||
element_buffer.append(&mut resort_res.elements); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Duplicate EndOfFile Segments in SIMD Path
The SIMD tokenizer adds an EndOfFile
token, leading to duplicate EndOfFile
segments. The elements_to_segments
method also unconditionally adds an EndOfFile
segment, resulting in two when the SIMD tokenizer is enabled, unlike the single segment from the non-SIMD path.
Additional Locations (1)
i += 1; | ||
} | ||
i | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Identifier Parsing Mismatch Causes Infinite Loop
The is_ident_start
function incorrectly validates non-ASCII bytes, including UTF-8 continuation bytes, as valid identifier starts. This clashes with take_identifier
, which only consumes ASCII characters. This inconsistency causes incorrect tokenization of non-ASCII identifiers and, more critically, an infinite loop when take_identifier
fails to advance the position.
Additional Locations (1)
Benchmark for 96d48aeClick to view benchmark
|
Benchmark for 6cb3df5Click to view benchmark
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like
Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
Summary
simd-tokenizer
feature tosqruff-lib-core
and gateportable_simd
usageLexer
to tokenize withSimdTokenizer
when the feature is enabledTesting
cargo test -p sqruff-lib-core
cargo test -p sqruff-lib-core --features simd-tokenizer
https://chatgpt.com/codex/tasks/task_e_68bb26bf18688330a6be5fbb04e56cc5