Skip to content

Conversation

benfdking
Copy link
Collaborator

Summary

  • add simd-tokenizer feature to sqruff-lib-core and gate portable_simd usage
  • wire Lexer to tokenize with SimdTokenizer when the feature is enabled
  • expose the SIMD tokenizer module conditionally

Testing

  • cargo test -p sqruff-lib-core
  • cargo test -p sqruff-lib-core --features simd-tokenizer

https://chatgpt.com/codex/tasks/task_e_68bb26bf18688330a6be5fbb04e56cc5

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

Comment on lines +95 to +105
}

const CHUNK: usize = 64;

fn is_ident_start(b: u8) -> bool {
b.is_ascii_alphabetic() || b == b'_' || b >= 0x80
}

fn skip_whitespace(bytes: &[u8]) -> usize {
let mut i = 0;
while i + CHUNK <= bytes.len() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Avoid infinite loop on non‑ASCII identifiers

When the SIMD tokenizer encounters a byte ≥ 0x80 it treats it as a valid identifier start, but take_identifier only advances over ASCII letters/digits/underscores. For non‑ASCII input len becomes 0 and pos += len never advances, so while pos < bytes.len() spins forever as soon as the source contains any UTF‑8 characters beyond ASCII. Either reject non‑ASCII in is_ident_start or teach take_identifier to advance over those bytes.

Useful? React with 👍 / 👎.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle.

Your free Bugbot reviews will reset on October 13

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, please upgrade to Bugbot Pro by visiting the Cursor dashboard. Your first 14 days will be free!


str_buff = resort_res.forward_string;
element_buffer.append(&mut resort_res.elements);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Duplicate EndOfFile Segments in SIMD Path

The SIMD tokenizer adds an EndOfFile token, leading to duplicate EndOfFile segments. The elements_to_segments method also unconditionally adds an EndOfFile segment, resulting in two when the SIMD tokenizer is enabled, unlike the single segment from the non-SIMD path.

Additional Locations (1)

Fix in Cursor Fix in Web

i += 1;
}
i
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Identifier Parsing Mismatch Causes Infinite Loop

The is_ident_start function incorrectly validates non-ASCII bytes, including UTF-8 continuation bytes, as valid identifier starts. This clashes with take_identifier, which only consumes ASCII characters. This inconsistency causes incorrect tokenization of non-ASCII identifiers and, more critically, an infinite loop when take_identifier fails to advance the position.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link

github-actions bot commented Sep 5, 2025

Benchmark for 96d48ae

Click to view benchmark
Test Base PR %
DepthMap::from_parent 51.0±0.51µs 55.1±0.65µs +8.04%
fix_complex_query 11.7±0.08ms 12.1±0.05ms +3.42%
fix_superlong 138.1±17.37ms 140.8±16.27ms +1.96%
parse_complex_query 4.1±0.04µs 4.5±0.05µs +9.76%
parse_expression_recursion 7.2±0.06µs 7.7±0.10µs +6.94%
parse_simple_query 1051.4±21.49ns 1237.8±12.87ns +17.73%

Copy link

Benchmark for 6cb3df5

Click to view benchmark
Test Base PR %
DepthMap::from_parent 50.9±0.51µs 54.9±1.83µs +7.86%
fix_complex_query 11.8±0.09ms 12.0±0.38ms +1.69%
fix_superlong 145.9±14.97ms 146.7±14.51ms +0.55%
parse_complex_query 4.2±0.07µs 4.2±0.04µs 0.00%
parse_expression_recursion 7.3±0.10µs 7.4±0.10µs +1.37%
parse_simple_query 1047.7±13.97ns 1058.0±12.14ns +0.98%

Copy link

openhands-ai bot commented Sep 10, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • PR Checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1934 at branch `codex/implement-simd-tokenizer-in-sqruff`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant