-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe: indentation-sensitive languages #246
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice recipe. I have some comments about the insights to the implementation. I think we should add links or detailed information how this token builder/lexer works.
hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md
Outdated
Show resolved
Hide resolved
hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md
Outdated
Show resolved
Hide resolved
hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md
Outdated
Show resolved
Hide resolved
Thanks for your comments! I will address them as soon as a decision is reached about eclipse-langium/langium#1608 to edit the last section as well |
@Lotes Thanks for waiting so long, all comments should be addressed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor things and one big question. The most of my thoughts were already resolved. Thanks.
hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md
Outdated
Show resolved
Hide resolved
return true | ||
``` | ||
|
||
the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
with capital T
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was intended as a continuation of the sentence before it, only interrupted by the code snippet. Not sure if it makes sense or if it counts as a separate sentence 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, I would suggest to add 3 dots at the end of the first phase and at the beginning of the second phrase.
hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,135 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about indentation in the implementation: How do you distinguish between spaces and tabs?
This could be an interesting point of the configuration to show here. Maybe in an own section or as appendix.
How to align this with an editor-config, see https://editorconfig.org or other approaches?
I guess you choose only spaces or tabs for the WS token, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question!
I thought a lot about the best approach here, and in the end decided not to discriminate between them, which is the simpler way. Alternatives included allowing only one or the other through a config parameter, or treating a tab as n
spaces (again, for a configurable n
). I thought these 2 alternatives were a bit too strict (though that's how Python behaves, for example, by prohibiting mixing them), and I thought that ideally I could issue a warning, but I couldn't find a way to accept a token and still issue a warning/error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, now that I think about it, I could add some payload to the returned token and then in the lexer check for the payload and add to the errors array, but then there would still be no way of making it a warning rather than an error. Perhaps LexerResult
should be augmented to allow warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think resolution of my question would block this recipe. I mean we can still change something afterwards. Extending the LexerResult sounds too much for this change.
Another question could be: How to write an indention-aware formatter? Is it even applicable or doable? How is it done for Python?
We do not have to answer this now. I was just interested about some consequences or follow-up tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not yet have experience writing formatters in Langium, but I don't see why it would be difficult to do. Generally, there are 2 approaches: formatting and pretty printing. One way to implement a formatter is to search for some (anti-)patterns in the code and issue TextEdit
s just for them. Pretty printers normally use the AST/CST (or some other intermediate representation) and transform them back into code, regardless of how it initially looked like before parsing. (or at least that's how I understand the difference between them)
Both approaches seem possible with the indentation-aware tokens, though the second one (pretty printer) is probably easier to implement, assuming we want the formatter to ensure consistent indentation characters and sizes.
For Python, one of the most popular formatter is black, and it uses the pretty printing approach. Not sure how other formatters handle inconsistent indentation, tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have nothing more to add right now. Let's wait for a second opinion on your recipe :-) .
return true | ||
``` | ||
|
||
the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, I would suggest to add 3 dots at the end of the first phase and at the beginning of the second phrase.
@@ -0,0 +1,135 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think resolution of my question would block this recipe. I mean we can still change something afterwards. Extending the LexerResult sounds too much for this change.
Another question could be: How to write an indention-aware formatter? Is it even applicable or doable? How is it done for Python?
We do not have to answer this now. I was just interested about some consequences or follow-up tasks.
@Lotes Langium v3.2 has already been published with the indentation-aware token builder, and questions about its usage already started coming in (eclipse-langium/langium#1696). Are we waiting for another review or can this recipe be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the guide. Looks good!
@aabounegm Do you think it makes sense to replace the links to code (for the classes) with the new TypeDoc links?
Sure! Replaced 👍 |
This is a guide on how to use the new
IndentationAwareTokenBuilder
, added to Langium in eclipse-langium/langium#1578 (published inv3.2.0
).