Skip to content

[Lua] Highlight strings used by string.{format,gsub,pack,etc} #4194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

mbartlett21
Copy link
Contributor

@mbartlett21 mbartlett21 commented Mar 20, 2025

This PR adds highlighting to the strings that are passed to string.{format,find,match,gsub,gmatch,pack,unpack,packsize}.

They are split into three categories:

  • Format strings
    https://www.lua.org/manual/5.4/manual.html#pdf-string.format
    These are done similarly to C's sprintf and friends, except that Lua
    only supports some specifiers and adds q as an option.

  • Patterns
    https://www.lua.org/manual/5.4/manual.html#6.4.1
    They are passed as the second argument to the four functions that take
    patterns, meaning that I can also match on
    (whatever_value):match 'this%s*that'. For scopes, I have roughly
    followed what the builtin regular expressions package does.
    I haven't worked on nested groups as that got quite hard with handling
    it inside the three different types of strings that Lua provides.
    The special capture () returns a position. I'm not sure what scope
    this should be. I've set it to punctuation.section.brackets.lua for
    the moment.

  • Pack format strings
    https://www.lua.org/manual/5.4/manual.html#6.4.2
    The alignment and endian specifiers I have scoped as
    storage.modifier and I have scoped the actual type designations as
    storage.type. For the padding that can be specified with x or X,
    I've scoped it to punctuation.separator.padding as I didn't see any
    better one, and that helped differentiate it from ones that have
    values.

Since format strings are always the first argument, I did not find a way
of handling them when the function is called as a method.
(e.g. ('%3.5f'):format(3)).

@jrappen
Copy link
Collaborator

jrappen commented Mar 20, 2025

From a quick glance over this PR I noticed:

  • Changes to the syntax file are not completely covered by the added tests in the test file.
  • Changes suggest anchor keywords only match at beginning and end of single-line strings, but anywhere in multi-line strings. Maybe double-check that, I don't know Lua.

@mbartlett21
Copy link
Contributor Author

mbartlett21 commented Mar 20, 2025

From a quick glance over this PR I noticed:

  • Changes to the syntax file are not completely covered by the added tests in the test file.

Noted. I'll probably look at the tests from eregex etc.

  • Changes suggest anchor keywords only match at beginning and end of single-line strings, but anywhere in multi-line strings. Maybe double-check that, I don't know Lua.

That is correct. Anchors are only matched at the ends of strings. elsewhere they match the character.
Edit: actually you're right. I'll need to check that

@michaelblyons
Copy link
Collaborator

It has become the practice more and more to use named contexts to push and set into, rather than successively indenting anonymous contexts. This lets inheriting syntaxes add extra rules in more locations without replicating big chunks of the base syntax.

@mbartlett21
Copy link
Contributor Author

@jrappen

I've added (roughly) the tests from the Regular Expression package and worked on getting the syntax to work with character classes and captures now.

Changes suggest anchor keywords only match at beginning and end of single-line strings, but anywhere in multi-line strings. Maybe double-check that, I don't know Lua.

This is fixed now so that the anchor only matches at the very start and very end. (Because of how multiline strings work, it has to account for a possible preceding newline)


@michaelblyons
Were you meaning the sections under pattern-{single-quoted,double-quoted,multiline}-string-body? I don't think there is any nested ones left in what's been added now.

@mbartlett21
Copy link
Contributor Author

I've added some lookahead matching on the one line for immediate calls so that it can work with ("string"):{pack,format}.

push:
- pattern-unquoted-balanced-char
- pattern-unquoted-balanced-char
- match: \%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume:

- match: '%.'
  scope: constant.character.escape.lua

should work just fine? No?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. With the quoted ones, it needs to be in a separate pattern so that the additional prototype can end the string if it is %', but I'll test what happens with an embedded syntax

@jrappen
Copy link
Collaborator

jrappen commented Mar 25, 2025

  • remove the todo list from the opening comment when done
  • find a way to get rid of the anonymous contexts five (?!) levels deep

@mbartlett21
Copy link
Contributor Author

find a way to get rid of the anonymous contexts five (?!) levels deep

This seems to be to do with with_prototype, which I've used in order to deduplicate code between double-quoted and single-quoted strings.

Or are you meaning in the builtin-modules pattern structure?

@michaelblyons
Copy link
Collaborator

Instead of something like

strings:
  - match: '"'
    scope: punctuation.definition.string.begin
    push:
      - meta_scope: string.quoted.double
      - match: '"'
        scope: punctuation.definition.string.end
        pop: 1

Use

strings:
  - match: '"'
    scope: punctuation.definition.string.begin
    push: string-double-body

string-double-body:
  - meta_scope: string.quoted.double
  - match: '"'
    scope: punctuation.definition.string.end
    pop: 1

mbartlett21 and others added 2 commits April 1, 2025 14:21
- add named contexts
- add `leading_wspace` var
- add `first_line_language` var
@jrappen
Copy link
Collaborator

jrappen commented Apr 23, 2025

added some named contexts and section markers

Comment on lines +1175 to +1191
- match: \[(=*)\[
scope: punctuation.definition.string.begin.lua
set: multiline-string-body
with_prototype:
- include: fmtstr-embed

- match: \'
scope: punctuation.definition.string.begin.lua
set: single-quoted-string-body
with_prototype:
- include: fmtstr-embed

- match: \"
scope: punctuation.definition.string.begin.lua
set: double-quoted-string-body
with_prototype:
- include: fmtstr-embed
Copy link
Collaborator

@deathaxe deathaxe May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'd try to avoid with_prototype wherever possible as it is known to cause trouble every here and then by either causing complexity to explode or by just creating inclusion loops.

In this very special case it is bombs on ants, as string contexts are trivial.

Hence I'd recommend:

Suggested change
- match: \[(=*)\[
scope: punctuation.definition.string.begin.lua
set: multiline-string-body
with_prototype:
- include: fmtstr-embed
- match: \'
scope: punctuation.definition.string.begin.lua
set: single-quoted-string-body
with_prototype:
- include: fmtstr-embed
- match: \"
scope: punctuation.definition.string.begin.lua
set: double-quoted-string-body
with_prototype:
- include: fmtstr-embed
- match: \[(=*)\[
scope: punctuation.definition.string.begin.lua
set: multiline-fmtstr-body
- match: \'
scope: punctuation.definition.string.begin.lua
set: single-quoted-fmtstr-body
- match: \"
scope: punctuation.definition.string.begin.lua
set: double-quoted-fmtstr-body
single-quoted-fmtstr-body:
- meta_include_prototype: false
- meta_scope: meta.string.lua string.quoted.single.lua
- include: single-quoted-string-body
- include: fmtstr-placeholder
double-quoted-fmtstr-body:
- meta_include_prototype: false
- meta_scope: meta.string.lua string.quoted.double.lua
- include: double-quoted-string-body
- include: fmtstr-placeholder
multiline-fmtstr-body:
- meta_include_prototype: false
- meta_scope: meta.string.lua string.quoted.multiline.lua
- include: multiline-string-body
- include: fmtstr-placeholder

Maybe a little bit more verbose implementation wise, but more efficient.

Comment on lines +1193 to +1214
fmtstr-embed:
- meta_include_prototype: false
- match: '%%'
scope: constant.character.escape.lua

- match: '%'
scope: constant.other.placeholder.lua
push: fmtstr-pattern

fmtstr-pattern:
- meta_include_prototype: false
# these characters never appear in a format pattern so it can be simpler
- match: (?=[\\\"\'\]])
pop: 1
- match: '[-+#0-9. ]'
scope: constant.other.placeholder.lua
- match: '[cdiuoxXaAfeEgGpqs]'
scope: constant.other.placeholder.lua
pop: 1
- match: '.'
scope: invalid.illegal.invalid-format.lua
pop: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format string placeholder patterns should be as restrictive as possible to avoid false positives. They are also rather unlikely to be subject of interpolation and for same reason probably don't need or even shouldn't support it.

The following suggestion is a tweaked pattern from Python syntax.

Suggested change
fmtstr-embed:
- meta_include_prototype: false
- match: '%%'
scope: constant.character.escape.lua
- match: '%'
scope: constant.other.placeholder.lua
push: fmtstr-pattern
fmtstr-pattern:
- meta_include_prototype: false
# these characters never appear in a format pattern so it can be simpler
- match: (?=[\\\"\'\]])
pop: 1
- match: '[-+#0-9. ]'
scope: constant.other.placeholder.lua
- match: '[cdiuoxXaAfeEgGpqs]'
scope: constant.other.placeholder.lua
pop: 1
- match: '.'
scope: invalid.illegal.invalid-format.lua
pop: 1
fmtstr-placeholder:
- match: '%%'
scope: constant.character.escape.lua
- match: |-
(?x)
%
\#? # alternate form
0? # pad with zeros
\-? # left-adjust
\ ? # implicit sign
[+-]? # sign
\d* # width
(\.\d*)? # precision
[cdiuoxXaAfeEgGpqs%]
scope: constant.other.placeholder.lua
- match: '%[^"''\\\]]'
scope: invalid.illegal.invalid-format.lua

Note: See also somewhat renamed context name.

Please double check, whether I missed something.

Comment on lines +1220 to +1251
packstr-string:
- match: \[(=*)\[
scope: punctuation.definition.string.begin.lua
set: multiline-string-body
with_prototype:
- include: packstr-pattern

- match: \'
scope: punctuation.definition.string.begin.lua
set: single-quoted-string-body
with_prototype:
- include: packstr-pattern

- match: \"
scope: punctuation.definition.string.begin.lua
set: double-quoted-string-body
with_prototype:
- include: packstr-pattern

packstr-pattern:
- meta_include_prototype: false
- match: '[=<>]|![0-9]*'
scope: storage.modifier.lua

- match: '[bBhHlLjJTfdnz]|[Iis][0-9]*|c[0-9]+'
scope: storage.type.lua

- match: '[xX]'
scope: punctuation.separator.padding.lua

- match: (?:[^\\\"\'\]\s])
scope: invalid.illegal.invalid-storage.lua
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous comments apply here as well.

Suggested change
packstr-string:
- match: \[(=*)\[
scope: punctuation.definition.string.begin.lua
set: multiline-string-body
with_prototype:
- include: packstr-pattern
- match: \'
scope: punctuation.definition.string.begin.lua
set: single-quoted-string-body
with_prototype:
- include: packstr-pattern
- match: \"
scope: punctuation.definition.string.begin.lua
set: double-quoted-string-body
with_prototype:
- include: packstr-pattern
packstr-pattern:
- meta_include_prototype: false
- match: '[=<>]|![0-9]*'
scope: storage.modifier.lua
- match: '[bBhHlLjJTfdnz]|[Iis][0-9]*|c[0-9]+'
scope: storage.type.lua
- match: '[xX]'
scope: punctuation.separator.padding.lua
- match: (?:[^\\\"\'\]\s])
scope: invalid.illegal.invalid-storage.lua
packstr-string:
- match: \[(=*)\[
scope: punctuation.definition.string.begin.lua
set: multiline-packstr-body
- match: \'
scope: punctuation.definition.string.begin.lua
set: single-quoted-packstr-body
- match: \"
scope: punctuation.definition.string.begin.lua
set: double-quoted-packstr-body
single-quoted-packstr-body:
- meta_include_prototype: false
- meta_scope: meta.string.lua string.quoted.single.lua
- include: single-quoted-string-body
- include: packstr-patterns
double-quoted-packstr-body:
- meta_include_prototype: false
- meta_scope: meta.string.lua string.quoted.double.lua
- include: double-quoted-string-body
- include: packstr-patterns
multiline-packstr-body:
- meta_include_prototype: false
- meta_scope: meta.string.lua string.quoted.multiline.lua
- include: multiline-string-body
- include: packstr-patterns
packstr-patterns:
- match: '[=<>]|![0-9]*'
scope: storage.modifier.lua
- match: '[bBhHlLjJTfdnz]|[Iis][0-9]*|c[0-9]+'
scope: storage.type.lua
- match: '[xX]'
scope: punctuation.separator.padding.lua
- match: '[^"''\\\]\s]'
scope: invalid.illegal.invalid-storage.lua

Comment on lines +1098 to +1102
- match: \[(=*)\[
scope: punctuation.definition.string.begin.lua
set:
- pattern-multiline-string-body
- pattern-test-first-caret-multiline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reliably escape on proper amount of = chars, embed in combo with pop could be used here:

Suggested change
- match: \[(=*)\[
scope: punctuation.definition.string.begin.lua
set:
- pattern-multiline-string-body
- pattern-test-first-caret-multiline
- match: \[(=*)\[
scope: meta.string.lua string.quoted.multiline.lua punctuation.definition.string.begin.lua
embed: pattern-multiline-string-body
embed_scope: meta.string.lua string.quoted.multiline.lua
escape: \]\1\]
escape_captures:
0: meta.string.lua string.quoted.multiline.lua punctuation.definition.string.end.lua
pop: 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better than the other suggestion above, I wouldn't call it a "quoted" string, though.

Comment on lines +1129 to +1134
pattern-test-first-caret-multiline:
- meta_include_prototype: false
- match: $\n?
set: pattern-test-first-caret
- include: pattern-test-first-caret

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obsolete with previous change

Suggested change
pattern-test-first-caret-multiline:
- meta_include_prototype: false
- match: $\n?
set: pattern-test-first-caret
- include: pattern-test-first-caret

Comment on lines +1159 to +1168
pattern-multiline-string-body:
- meta_include_prototype: false
- meta_scope: meta.string.lua string.quoted.multiline.lua
- include: multiline-string-body
- match: ''
embed: pattern-unquoted
# we don't seem to be able to use the capture to get the right one
# this means that patterns that contain something that matches this
# will lose their context half way through.
escape: (?=\]=*\])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to simple multi-push, optionally consuming \n directly after opening punctuation.

Suggested change
pattern-multiline-string-body:
- meta_include_prototype: false
- meta_scope: meta.string.lua string.quoted.multiline.lua
- include: multiline-string-body
- match: ''
embed: pattern-unquoted
# we don't seem to be able to use the capture to get the right one
# this means that patterns that contain something that matches this
# will lose their context half way through.
escape: (?=\]=*\])
pattern-multiline-string-body:
- meta_include_prototype: false
- match: \n?
push:
- pattern-unquoted
- pattern-test-first-caret

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants