Description
In my experimental environment, I found json to g4 only with "parser" cause some syntax error, syntax parsing errors may lead to the possibility of losing a large amount of mutated data.
I made mincase lex.json
:
{
"<A>": [["<NUMBER>", "<STRING>", "\n"]],
"<NUMBER>": [["10"], ["99"]],
"<STRING>": [["(", "<HEXSTRING>", ")"]],
"<HEXSTRING>": [["<CHAR>", "<HEXSTRING>"], []],
"<CHAR>": [
["0"], ["1"], ["2"], ["3"], ["4"], ["5"], ["6"], ["7"],
["8"], ["9"], ["a"], ["b"], ["c"], ["d"], ["e"], ["f"]
]
}
Grammar-Mutator make
it, generate Grammar.g4
is:
grammar Grammar;
entry
: node_A EOF
;
node_A
: node_NUMBER node_STRING '\n'
;
node_NUMBER
: '10'
| '99'
;
node_STRING
: '(' node_HEXSTRING ')'
;
node_HEXSTRING
:
| node_CHAR node_HEXSTRING
;
node_CHAR
: '0'
| '1'
| '2'
| '3'
| '4'
| '5'
| '6'
| '7'
| '8'
| '9'
| 'a'
| 'b'
| 'c'
| 'd'
| 'e'
| 'f'
;
we prepared input data seed1 / seed2
, and use antlr4-parse
to testing:
why is 10(10)
parsed incorrectly? because antlr4 is divided into two stages: lexer and parser. during lexer stage, node_NUMBER:10
will be recognized as TOKEN, and in the parser stage, the result is node_NUMBER (node_NUMBER)
, so an error occurred.
in the antlr4 grammar, lex rules begin with an uppercase letter, parser rules begin with a lowercase letter, so we should tell antlr4 the lexical rules clearly, patch Grammar_patch.g4
:
grammar Grammar_patch;
entry
: node_A EOF
;
node_A
: node_NUMBER Node_STRING '\n'
;
node_NUMBER
: '10'
| '99'
;
Node_STRING
: '(' Node_HEXSTRING ')'
;
Node_HEXSTRING
:
| Node_CHAR Node_HEXSTRING
;
Node_CHAR
: '0'
| '1'
| '2'
| '3'
| '4'
| '5'
| '6'
| '7'
| '8'
| '9'
| 'a'
| 'b'
| 'c'
| 'd'
| 'e'
| 'f'
;
testing again:
the "warning" prompts us it can match the empty string, this may cause antlr4 parsing backtrace issues, but we can easily mark it with
fragment Node_HEXSTRING
maybe we can optimize the json to g4 generation code, to distinguish between lexer and parser?
Metadata
Metadata
Assignees
Type
Projects
Status