You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is related to an earlier discussion last year, here: #3321
Since then I've done more research and started to define the grammar informally, but it is certainly based on the PL/I grammar (Subset G, ANSI X3.74-1987, the standard is a very high quality document but isn't freely available sadly, I have a printed copy). That last discussion left me a bit confused as some issues were raised by others as they explored Rexx, so I never quite got clarity on whether ANTLR can parse this or not.
Here's an example of some simple legal PL/I code that PL/I compilers can parse:
/* The language has no reserved words, that means an identifier can be the same as any keyword. */
/* The keywords below are: declare, string, if, then, else, binary, call - all are keywords but can also be used as identifiers */
declare name string(32); // identifier named 'name'
declare declare string(32); // identifier named 'declare'
declare string string(32); // identifier named 'string'
declare if binary(15); // identifier named 'if'
declare then binary(15); // identifier named 'then'
declare else binary(15); // identifier named 'else'
declare binary(15) binary(15); // an 15 element array of binary(15) integers.
if if = declare then
then = else;
else
if = then;
call call(if, then); // call to a procedure named 'call' passing two binary(15) args.
goto goto; // goto a label named 'goto'
return return; // return the expression 'return' - a single identifier
etc, this gives an idea. If that sample can be parsed then that would likely prove that ANTRL can do this.
Question 1 - Lexical Analysis
Anyway, at the lexical level I'm interested in supporting numeric literals that can contain separator spaces (not simply underscores as is common in many languages).
e.g, these tokens would look like, starting with trivial cases that are simply regular:
a = 123;
a = 123.456;
a = 123:h; // Hex
a = 10101.1101:b; // binary
a = 101 110 010:b; // binary
a = DEF ABC:h; // hex
a = F0C5 770A.C4:h; // hex
a = 204 701 334:o; // octal
a = BAF4.7603p+4; // hex float
I've scraped together code (hand crafted FSM based lexer) that can recognize these but It's a little messy and I don't want to create messy code really.
Cases like "DEF ABC:h" present a challenge because that scans into <Identifier> <Identifier> <Colon> <Identifier> and that must be converted eventually, to <NumericLiteral>, with say a lexeme of "DEFABC:h".
I have added a "hacked" layer into the tokenizer that accumulates tokens looking for the pattern for the literal and when/if it sees that, it discards the accumulated tokens and creates and returns the desired <NumericLiteral> token else it just returns the original token seqeuence.
If ANTRL has the power to represent this kind of thing, I will likely want to adopt that for the lexical analysis at least.
Question 2 - Grammar
A separate question is can ANTRL handle the language grammar that has no reserved words, the way PL/I parser's have done this is by having parser that:
Asks, can the next statement (sequence of semicolon terminated tokens) be parsed as a validly formed assignment or not? (i.e. does it match the syntax of <reference> = <expression> ).
Yes, parse it as an assignment, generate the tree for that. (Do all semantic stuff, name resolution etc in next phase)
No, treat first token (must be an identifier) as a keyword and if it is, parse the statement as a keyword statement.
Else this must be a syntax error, report it as such.
I've written such a parser in the past, a hand crafted recursive descent, but have no idea if such a grammar can be represented in systems like ANTLR.
The parser doesn't see (nor does the lexer generate) "Keyword tokens". All keywords are returned as <Identifier> and have a bool property stating if it is or is not also a keyword. The parser then treats these identifiers as either true identifiers used in expressions and so on, or as actual keywords, as it parses, that decision is contextual.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This is related to an earlier discussion last year, here: #3321
Since then I've done more research and started to define the grammar informally, but it is certainly based on the PL/I grammar (Subset G, ANSI X3.74-1987, the standard is a very high quality document but isn't freely available sadly, I have a printed copy). That last discussion left me a bit confused as some issues were raised by others as they explored Rexx, so I never quite got clarity on whether ANTLR can parse this or not.
Here's an example of some simple legal PL/I code that PL/I compilers can parse:
etc, this gives an idea. If that sample can be parsed then that would likely prove that ANTRL can do this.
Question 1 - Lexical Analysis
Anyway, at the lexical level I'm interested in supporting numeric literals that can contain separator spaces (not simply underscores as is common in many languages).
e.g, these tokens would look like, starting with trivial cases that are simply regular:
I've scraped together code (hand crafted FSM based lexer) that can recognize these but It's a little messy and I don't want to create messy code really.
Cases like "DEF ABC:h" present a challenge because that scans into
<Identifier> <Identifier> <Colon> <Identifier>
and that must be converted eventually, to<NumericLiteral>
, with say a lexeme of "DEFABC:h".I have added a "hacked" layer into the tokenizer that accumulates tokens looking for the pattern for the literal and when/if it sees that, it discards the accumulated tokens and creates and returns the desired
<NumericLiteral>
token else it just returns the original token seqeuence.If ANTRL has the power to represent this kind of thing, I will likely want to adopt that for the lexical analysis at least.
Question 2 - Grammar
A separate question is can ANTRL handle the language grammar that has no reserved words, the way PL/I parser's have done this is by having parser that:
<reference>
=<expression>
).I've written such a parser in the past, a hand crafted recursive descent, but have no idea if such a grammar can be represented in systems like ANTLR.
The parser doesn't see (nor does the lexer generate) "Keyword tokens". All keywords are returned as
<Identifier>
and have a bool property stating if it is or is not also a keyword. The parser then treats these identifiers as either true identifiers used in expressions and so on, or as actual keywords, as it parses, that decision is contextual.Beta Was this translation helpful? Give feedback.
All reactions