Skip to content

Proposed SIL TeX-like lexer (SILE) #144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: default
Choose a base branch
from

Conversation

Omikhleia
Copy link

Greetings,

This PR proposes a Scintillua lexer for the "SIL in TeX-like flavor" syntax used by the SILE typesetting system.

It's my first attempt a (theoretically simple) lexer, I hope I did it correctly ;)

Regards.

@Omikhleia
Copy link
Author

N.B. An illustration of what it does: sile-typesetter/sile#2220 (comment)

Copy link
Owner

@orbitalquark orbitalquark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! This looks pretty good. I've left some comments for some things to address.

local simple_value = (P(1) - S(',;]'))^1
local quoted_value = lexer.range('"', false, false)
local param = lex:tag(lexer.ATTRIBUTE, identifier) * eq * lex:tag(lexer.STRING, quoted_value + simple_value)
local param_list = param * (ws^0 * lex:tag(lexer.OPERATOR, ',') * ws^0 * param)^0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ws will include newlines. I don't know if that's acceptable syntax, but some editors may take issue when starting a lex at the beginning of a line that is inside a param_list. Scintillua cannot backtrack far enough to the start of the list. I'd recommend declaring local ws = lex:tag(lexer.WHITESPACE, S(' \t')) so that only horizontal space matches.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newlines are accepted there, and are not that uncommon. E.g.

\framebox[
          bordercolor=red, fillcollor=#f0f0f0,
          shadow=true, shadowcolor=black
]{Some box}

(I tend to do this myself, for separating parameters of different nature)


-- 2. Comments.
local line_comment = lexer.to_eol('%')
local env_comment = lexer.range(P('\\begin') * optparams * P('{comment}'), P('\\end{comment}'))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically Scintillua lexers will allow the trailing delimiter to be optional (P('\end{comment}')^-1), so that the comment is highlighted correctly as you type. I'd recommend doing that here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not changed: I tried, but with the ^-1 the end was never satisfied (when present) ??

lexers/sil.lua Outdated
-- We need alt names for multiple embeddings and rules
local base_rule_id = name .. lang
local embedder = lexer.load(lang, base_rule_id .. '_cmd')
lex:embed(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clever!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😊

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that clever actually, and there's an edge case for commands that I'have not idea how to address.
This would be valid:

\lua{SILE.process({ "text" })}

So the "end rule" for embedding cannot just be a }, it has to be balanced... I haven't found such a case in the sources (most embedded lexers have clear start/end rules not expected in child lexer) ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response.

I don't think there is an easy way to address this. The only thing I can think of is when encountering a '{' or '}', increase or decrease a count variable, and store that in a persistent lexer line state. However, I'm not sure that would work for the one-line example you gave.

I think you'll have to accept the limitation, or not perform this kind of highlighting.

Copy link
Author

@Omikhleia Omikhleia Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response. The only thing I can think of is when encountering a '{' or '}', increase or decrease a count variable (...)

Thanks for you feedback.
I've added in a fixup some lpeg-heavy machinery to keep track of the brace pairing level and exit the child lexer upon the first unpaired closing brace. The "state" is kept in the parsing patterns via a table.1

It does work well, as far as I can tell, for syntax highlighting a whole file:
image

(with ugly colors here to better distinguish the items 😸 )

But I don't know how an editor (SciTE, I suppose) behaves in those case when characters are entered by the user as he is typing).

Footnotes

  1. Logic somewhat inspired / derived from the way the lunamark library keeps track of nested divs in Markdown, with a similar technique.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B. Of course, the pairing and child lexer hacking is fairly naive: there would still probably be an issue with unpaired braces part of a comment, escape sequence or other acceptable syntax (e.g. a string, etc.) in the embedded child language.
But I think this is such an edge case that it's probably acceptable (and several other editor solutions do not handle these cases very gracefully either, anyway...)

Copy link
Owner

@orbitalquark orbitalquark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played around with your lexer, and I know you've put a lot of work into this, but it's too brittle. It'll get the parse right when opening a file, but once you start typing into it, it all falls apart.

For example, I've opened this file:

\begin[papersize=a5]{document} 
\nofolios 
\neverindent 
\use[module=packages.math] 

\chapter{Some SIL code} 

\section[numbered=false]{Some SIL code} 

\begin[rule="0.4pt"]{epigraph} 
Lorem ipsum dolor sit amet, consectetur adipiscing elit.

\comment{A comment} 

\ftl{Some text} 

\xml{<tag>Some XML code</tag>} 

\em{Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.} 
\caption{John \font[variant=+smcp]{Doe}.} 
\end{epigraph} 

\math[mode=diplay]{ 
	\int_{0)^{1} x^2 dx = \frac{1}{3} 
}

\lua{SILE.typesetter:typeset("Hello, world!")}

\font[family=DejaVu Serif, size=12pt]{Hello, world!} 

\font[family=DejaVu Serif, size=12pt]

\begin{lua} 
SILE.typesetter:typeset("Hello, world!") 
\end{lua}

\end{document}

I put my caret on the \xml{<tag>Some XML code</tag>} line after the first { and hit enter. My editor will start lexing the line starting with <tag> (the previous line is \xml{). I get an error:

/path/to/lexers/lexer.lua:1255: back reference 'brace_level' not found

If I undo the change, now pressing enter elsewhere in the file gives me the same error; editing becomes impossible. In fact, pressing enter anywhere in the file after immediately opening it gives me the error.


I also tried a simpler case:

\framebox[bordercolor=red, fillcollor=#f0f0f0, shadow=true, shadowcolor=black]{Some box}

This highlights great, but now I want to split it into multiple lines between the [ ]. After I press enter after the first [ followed by Tab (for indentation), highlighting immediately breaks, because lexing starts at bordercolor, which loses the context from what is now the previous line:

image


In my experience, lexers need to be written as context-free as possible in order to highlight reliably when editing. For example, the HTML lexer is aware that tag attributes may span multiple lines, so it only highlights an attribute if an = immediately follows it:

local attribute_eq = (known_attribute + unknown_attribute) * ws^-1 * equals
lex:add_rule('attribute', attribute_eq)
Similarly it only highlights numbers if they trail a =, indicating an attribute value. Yes, it's not perfect, but it gets it right a vast majority of the time and it's robust for editing.


What you have here is an impressive one-time lexer. You've done some clever things I didn't realize could be viable. However, Scintillua's lexers are designed to lex files being edited, so we need to keep that in mind.

I'm not sure what to suggest moving forward. Maybe just take a step back and see what you can do to make it a bit more context-free. I would also encourage you to try editing code with your lexer active. You said you didn't know how an editor like SciTE behaves as the user types. Experiment and see :) You can iron out bugs and issues as they arise.

@Omikhleia Omikhleia marked this pull request as draft February 22, 2025 23:06
@Omikhleia
Copy link
Author

Omikhleia commented Feb 22, 2025

I've played around with your lexer, and I know you've put a lot of work into this, but it's too brittle.
(...)
Experiment and see :)

Thanks a lot for all the feedbacks and hints, @orbitalquark
I Just installed Textadept, and indeed reproduced the same problems... Ouch indeed.

What you have here is an impressive one-time lexer. You've done some clever things I didn't realize could be viable. However, Scintillua's lexers are designed to lex files being edited, so we need to keep that in mind.
I'm not sure what to suggest moving forward.

I don't know either (I experimented some other ways towards a more context-free approach tonight, but without success so far, and my free time is running short, riding two horses with one behind...)

Well... it somewhat works for highlighting a "complete file", so for now:

  • I'll use that code as an "addon" in my own use case (the highlighter.sile module for SILE, which does take a complete input so doesn't have to handle the issues of a typist editing a file on the fly, a "one-time" lexer is sufficient there...).
  • I've just switched this PR to Draft status. It's kind of a step back which I didn't foresee, but thanks again for the encouragements and ideas. I'm glad I discovered this repository, and now Textadept -- The effort was certainly not in vain, I've learnt a few things in that adventure, and I hope to return to it at some later point. (I'm also considering another lexer, which I hope is even simpler... though I discovered here that the devil is sometimes in the details, but heh, at least, that one would have no need for embedded child languages and might have a context-free parsing... We'll see... Once bitten, twice shy 🐱 ).

@orbitalquark
Copy link
Owner

I think you have a great attitude that will carry you to success. There's no hurry, so take your time :) Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants