Thoughts about the best way to allow syntax highlighting of multiline strings & comments #2220

LorneHyde · 2025-06-15T14:41:46Z

LorneHyde
Jun 15, 2025

When reviewing this PR, @SquidDev suggested:

The long-term fix here is probably to switch edit's syntax highlighter to use a proper lexer. CC:T does already have one, which should be relatively reusable. I don't know if that's something you'd be interested in looking at?

I am interested in updating edit.lua's syntax highlighter, but I'm not convinced that reusing the existing lexer is the best way. My thoughts are below.

Places where the current lexer isn't applicable to `edit.lua`

Firstly, I'm open to other opinions, but my gut thought is that we should only reuse the lexer if lexer.lua itself will require no modification. The reason for this is because bugs in the lexer (and therefore in the compiler) are much worse than bugs in the syntax highlighter. Any bugs in the syntax highlighter have a purely cosmetic impact (highlighting something the wrong colour) whereas (in the worst case) bugs in the lexer could result in the user's code getting parsed incorrectly! To avoid the risk of introducing obscure bugs into the lexer while trying to make it more reusable, I think we ought to leave the lexer untouched.

Unfortunately, I think there a fair few places where we would need to modify the lexer's code to make it work for syntax highlighting. A fairly simple example would be that we want a different behaviour for unterminated multiline strings, between the compiler vs edit.lua - the former gives an ERROR token, whereas we'd like the syntax highlighter to still display this as a string!

Another example would be the way that positions are reported. lex.lua stores its position as a single integer, rather than storing (line number, position within line). This is slightly inconvenient for use in edit.lua, because the display is drawn line-by-line.

Why I'm not convinced reusing the lexer would simplify `edit.lua`

Also, I'm not sure we'd actually reduce the amount of code needed in edit.lua that much, by using the lexer.

While the existing lexer can be started from anywhere in the document, this doesn't include starting the lexer mid-token. Eg: if I have a multiline string, and the user changes a character in the middle of this string, we'd have to pass the whole multiline string to the lexer, since there's no way of passing the context of "we're in the middle of a string" to the lexer. "Rewinding" to the beginning of the beginning of the previous token before calling the lexer would fix this issue, but we'd then need to modify the internal state stored by edit.lua to contain some mapping from position to token, and make sure this is updated appropriately.

We'd also need logic in edit.lua to figure out how many times to call lex_one each time a character is changed. For instance, if the user removes a "start multiline comment" token which previously started a long comment, then all of the code that was previously commented out needs to be re-lexed, even if this is many tokens! On the flip side, we don't want to re-lex the entire "rest of the file" every time the user makes a small change at the top of the file.

What still needs to be changed, and what's my solution?

Regardless, I still think the syntax highlighter in edit.py should be rewritten. The reason for this is because it currently only highlights one line at a time, which means it can't properly highlight multiline comments or multiline strings.

My proposed solution is as follows:

edit.lua should store a list of triples (one for each line), containing:

Raw text for that line
List of tokens that said line has been parsed into
The way in which this line changes "whether we're in a multiline comment" and "whether we're in a multiline string". The possible values are:
1. Starting a multiline comment (which continues onto the next line)
2. In the middle of a multiline comment (which started on a previous line and continues onto the next line)
3. Ending a multiline comment (which started on a previous line)
4. Starting a multiline string (which continues onto the next line)
5. In the middle of a multiline string (which started on a previous line and continues onto the next line)
6. Ending a multiline string (which started on a previous line)
7. Ending a multiline string and starting a multiline comment
8. Ending a multiline comment and starting a multiline string
9. Not in a string nor a comment (this includes "multiline" strings and comments that are all on one line)

When the user updates a line (entering or deleting a character) then always re-lex the entire line (this is something we already do), with additional context of whether we're in a string or comment. Additionally:

If the line is starting a string or comment:
- Check whether the user's change means that this line no longer starts a string/comment. If so, start re-lexing the lines that follow, keeping track of whether re-lexing each line actually changed anything. As soon as we hit a line where re-lexing changed nothing, we can stop.
If the line is mid-string or mid-comment:
- If the user's change ends the string/comment, start re-lexing the lines that follow, keeping track of whether re-lexing each line actually changed anything. As soon as we hit a line where re-lexing changed nothing, we can stop.
If the line ends a string or comment:
- Check whether the user's change means that this line no longer ends the string/comment. If so, start re-lexing the lines that follow, keeping track of whether re-lexing each line actually changed anything. As soon as we hit a line where re-lexing changed nothing, we can stop.
If it's not in a string/comment:
- Check whether the user's change means this line now starts a string/comment. If so, re-lex every line until the end of the string/comment (or end of file, if said string/comment is unterminated)

SquidDev · 2025-06-18T13:25:08Z

SquidDev
Jun 18, 2025
Maintainer

Thank you so much for looking at this!

So my main motivation around reusing the lexer was to avoid too much code duplication (especially around the more awkward bits of Lua's syntax, like long strings and \z string escapes). But I confess I haven't put too much thought into it beyond that, so if it doesn't end up helping, then more than happy to go another route!

The reason for this is because bugs in the lexer (and therefore in the compiler) are much worse than bugs in the syntax highlighter.

Oh, just to clarify here, the lexer is only used for error-reporting (#1298, etc...), and not by Lua's compiler (that's written in Java instead). So while no bugs are desirable, the fallout here is much smaller!

edit.lua should store a list of triples (one for each line), containing:

Raw text for that line

List of tokens that said line has been parsed into

The way in which this line changes "whether we're in a multiline comment" and "whether we're in a multiline string". The possible values are: [...]

I think that sounds sensible. A couple of thoughts:

I think we could skip storing the tokens for each line. Highlighting a single line is relatively fast (as long as you avoid re-highlighting subsequent lines!), so I don't think it's worth the memory¹ and implementation cost.
I don't think you need to store all transitions at the start of the line, and can instead just store how to continue lexing/highlighting this line. So that would be:
- Following a new-line escape ("xyz\).
- Following a zap: ("xyz\z)
- In a multi-line comment (--[[ xyz), storing the length of the long-string boundary.
- In a multi-line string ([[ xyz), storing the length of the long-string boundary.

Then I think the logic roughly follows what you had already:

When a line is changed, re-lex that line.
Take the lexer's continuation state (e.g., are we in a long-string), and update the next line's starting state.
If the next line's state has changed (and it's visible on screen), re-lex that line, continuing from 1.

I don't know if that sounds sensible?

You probably could fit this into the existing lexer, by returning the continuation as tuple of a function + arguments to that function. So for instance, "xyz\ would return something like return tokens.STRING, pos, { lex_string, 0, quote }. Less sure there — as you say, might not be worth it.

I'm probably over-stating the memory cost a little. Some back-of-the-envelope numbers show it'd roughly be the same as the size of the file, so a 2x increase. But given how small most files in CC are, that's probably not an issue. ↩

8 replies

LorneHyde Jun 25, 2025
Author

Seems like a good solution ¯\_(ツ)_/¯

I'd be tempted to use lambda functions rather than a table, for the continuation function. Ie: instead of returning { lex_string, 1, 1, quote }, returning something more like function(context, str) return lex_string(context, str, 1, 1, quote) end. I think that would avoid exposing the internals of the lexer to edit.lua - especially since lex_string etc are local functions anyway, so it's weird that edit.lua should have to know about what arguments it takes.

I've also realised you've already started using the lexer for syntax highighting (albeit in a way that doesn't work for multiline) at 69353a4 just last week!

SquidDev Jun 25, 2025
Maintainer

I'd be tempted to use lambda functions rather than a table, for the continuation function

100% agreed that would be nicer. Because we need to be able to compare states, this annoyingly needs to be a table (or something else we can inspect and use for equality).

I've also realised you've already started using the lexer for syntax highighting (albeit in a way that doesn't work for multiline) at 69353a4 just last week!

Yeah, that one was terrible timing on my part! Made literally hours before the discussion was opened. 😞 Sorry!

LorneHyde Jun 25, 2025
Author

Because we need to be able to compare states, this annoyingly needs to be a table (or something else we can inspect and use for equality).

I'm confused at what you mean here - in Lua, it is possible to compare functions for equality, and this is already something we do (since we're passing continuation into shallowEqual where the first element of this table is a function).

LorneHyde Jun 25, 2025
Author

^ adding to the above:

I guess my suggestion wouldn't work for lambda functions, since they don't compare as equal, but would work if we instead return new wrapper functions (names like continue_lex_string) defined at the top level in lexer.lua.

Maybe that feels messy because of the number of new functions we'd need. But it still feels like it'd give slightly better module boundaries than the current solution which feels like we're exposing an implementation details of the lexer to edit.lua

SquidDev Jun 25, 2025
Maintainer

Ahh, sorry. What I meant is that two instances of the same function with the same upvalues won't be equal. So each time we return a function(context, str) return lex_string(context, str, 1, 1, quote) end, well get a new non-equal function, even if the quote is the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thoughts about the best way to allow syntax highlighting of multiline strings & comments #2220

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Thoughts about the best way to allow syntax highlighting of multiline strings & comments #2220

Uh oh!

LorneHyde Jun 15, 2025

Places where the current lexer isn't applicable to edit.lua

Why I'm not convinced reusing the lexer would simplify edit.lua

What still needs to be changed, and what's my solution?

Replies: 1 comment · 8 replies

Uh oh!

SquidDev Jun 18, 2025 Maintainer

Footnotes

Uh oh!

LorneHyde Jun 25, 2025 Author

Uh oh!

Uh oh!

SquidDev Jun 25, 2025 Maintainer

Uh oh!

Uh oh!

LorneHyde Jun 25, 2025 Author

Uh oh!

LorneHyde Jun 25, 2025 Author

Uh oh!

SquidDev Jun 25, 2025 Maintainer

LorneHyde
Jun 15, 2025

Places where the current lexer isn't applicable to `edit.lua`

Why I'm not convinced reusing the lexer would simplify `edit.lua`

Replies: 1 comment 8 replies

SquidDev
Jun 18, 2025
Maintainer

LorneHyde Jun 25, 2025
Author

SquidDev Jun 25, 2025
Maintainer

LorneHyde Jun 25, 2025
Author

LorneHyde Jun 25, 2025
Author

SquidDev Jun 25, 2025
Maintainer