-
Notifications
You must be signed in to change notification settings - Fork 25
Proposed SIL TeX-like lexer (SILE) #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: default
Are you sure you want to change the base?
Conversation
N.B. An illustration of what it does: sile-typesetter/sile#2220 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! This looks pretty good. I've left some comments for some things to address.
local simple_value = (P(1) - S(',;]'))^1 | ||
local quoted_value = lexer.range('"', false, false) | ||
local param = lex:tag(lexer.ATTRIBUTE, identifier) * eq * lex:tag(lexer.STRING, quoted_value + simple_value) | ||
local param_list = param * (ws^0 * lex:tag(lexer.OPERATOR, ',') * ws^0 * param)^0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ws
will include newlines. I don't know if that's acceptable syntax, but some editors may take issue when starting a lex at the beginning of a line that is inside a param_list
. Scintillua cannot backtrack far enough to the start of the list. I'd recommend declaring local ws = lex:tag(lexer.WHITESPACE, S(' \t'))
so that only horizontal space matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newlines are accepted there, and are not that uncommon. E.g.
\framebox[
bordercolor=red, fillcollor=#f0f0f0,
shadow=true, shadowcolor=black
]{Some box}
(I tend to do this myself, for separating parameters of different nature)
|
||
-- 2. Comments. | ||
local line_comment = lexer.to_eol('%') | ||
local env_comment = lexer.range(P('\\begin') * optparams * P('{comment}'), P('\\end{comment}')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically Scintillua lexers will allow the trailing delimiter to be optional (P('\end{comment}')^-1
), so that the comment is highlighted correctly as you type. I'd recommend doing that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not changed: I tried, but with the ^-1
the end was never satisfied (when present) ??
lexers/sil.lua
Outdated
-- We need alt names for multiple embeddings and rules | ||
local base_rule_id = name .. lang | ||
local embedder = lexer.load(lang, base_rule_id .. '_cmd') | ||
lex:embed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clever!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that clever actually, and there's an edge case for commands that I'have not idea how to address.
This would be valid:
\lua{SILE.process({ "text" })}
So the "end rule" for embedding cannot just be a }, it has to be balanced... I haven't found such a case in the sources (most embedded lexers have clear start/end rules not expected in child lexer) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed response.
I don't think there is an easy way to address this. The only thing I can think of is when encountering a '{' or '}', increase or decrease a count variable, and store that in a persistent lexer line state. However, I'm not sure that would work for the one-line example you gave.
I think you'll have to accept the limitation, or not perform this kind of highlighting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed response. The only thing I can think of is when encountering a '{' or '}', increase or decrease a count variable (...)
Thanks for you feedback.
I've added in a fixup some lpeg-heavy machinery to keep track of the brace pairing level and exit the child lexer upon the first unpaired closing brace. The "state" is kept in the parsing patterns via a table.1
It does work well, as far as I can tell, for syntax highlighting a whole file:
(with ugly colors here to better distinguish the items 😸 )
But I don't know how an editor (SciTE, I suppose) behaves in those case when characters are entered by the user as he is typing).
Footnotes
-
Logic somewhat inspired / derived from the way the lunamark library keeps track of nested divs in Markdown, with a similar technique. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B. Of course, the pairing and child lexer hacking is fairly naive: there would still probably be an issue with unpaired braces part of a comment, escape sequence or other acceptable syntax (e.g. a string, etc.) in the embedded child language.
But I think this is such an edge case that it's probably acceptable (and several other editor solutions do not handle these cases very gracefully either, anyway...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've played around with your lexer, and I know you've put a lot of work into this, but it's too brittle. It'll get the parse right when opening a file, but once you start typing into it, it all falls apart.
For example, I've opened this file:
\begin[papersize=a5]{document}
\nofolios
\neverindent
\use[module=packages.math]
\chapter{Some SIL code}
\section[numbered=false]{Some SIL code}
\begin[rule="0.4pt"]{epigraph}
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
\comment{A comment}
\ftl{Some text}
\xml{<tag>Some XML code</tag>}
\em{Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.}
\caption{John \font[variant=+smcp]{Doe}.}
\end{epigraph}
\math[mode=diplay]{
\int_{0)^{1} x^2 dx = \frac{1}{3}
}
\lua{SILE.typesetter:typeset("Hello, world!")}
\font[family=DejaVu Serif, size=12pt]{Hello, world!}
\font[family=DejaVu Serif, size=12pt]
\begin{lua}
SILE.typesetter:typeset("Hello, world!")
\end{lua}
\end{document}
I put my caret on the \xml{<tag>Some XML code</tag>}
line after the first {
and hit enter. My editor will start lexing the line starting with <tag>
(the previous line is \xml{
). I get an error:
/path/to/lexers/lexer.lua:1255: back reference 'brace_level' not found
If I undo the change, now pressing enter elsewhere in the file gives me the same error; editing becomes impossible. In fact, pressing enter anywhere in the file after immediately opening it gives me the error.
I also tried a simpler case:
\framebox[bordercolor=red, fillcollor=#f0f0f0, shadow=true, shadowcolor=black]{Some box}
This highlights great, but now I want to split it into multiple lines between the [ ]
. After I press enter after the first [
followed by Tab (for indentation), highlighting immediately breaks, because lexing starts at bordercolor
, which loses the context from what is now the previous line:
In my experience, lexers need to be written as context-free as possible in order to highlight reliably when editing. For example, the HTML lexer is aware that tag attributes may span multiple lines, so it only highlights an attribute if an =
immediately follows it:
Lines 47 to 48 in 6665a6a
local attribute_eq = (known_attribute + unknown_attribute) * ws^-1 * equals | |
lex:add_rule('attribute', attribute_eq) |
=
, indicating an attribute value. Yes, it's not perfect, but it gets it right a vast majority of the time and it's robust for editing.
What you have here is an impressive one-time lexer. You've done some clever things I didn't realize could be viable. However, Scintillua's lexers are designed to lex files being edited, so we need to keep that in mind.
I'm not sure what to suggest moving forward. Maybe just take a step back and see what you can do to make it a bit more context-free. I would also encourage you to try editing code with your lexer active. You said you didn't know how an editor like SciTE behaves as the user types. Experiment and see :) You can iron out bugs and issues as they arise.
bb70705
to
0c950a0
Compare
Thanks a lot for all the feedbacks and hints, @orbitalquark
I don't know either (I experimented some other ways towards a more context-free approach tonight, but without success so far, and my free time is running short, riding two horses with one behind...) Well... it somewhat works for highlighting a "complete file", so for now:
|
I think you have a great attitude that will carry you to success. There's no hurry, so take your time :) Cheers. |
Greetings,
This PR proposes a Scintillua lexer for the "SIL in TeX-like flavor" syntax used by the SILE typesetting system.
It's my first attempt a (theoretically simple) lexer, I hope I did it correctly ;)
Regards.