-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe: indentation-sensitive languages #246
Merged
Merged
Changes from 1 commit
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
e7f56cc
Add a guide for indentation-sensitive language
aabounegm f002a86
Merge branch 'eclipse-langium:main' into main
aabounegm a293fc4
Add links to the TokenBuilder & Lexer
aabounegm 2395760
Add a short explanation on how the solution works
aabounegm 566f5b3
Remove playground compatibility section
aabounegm 92b72d3
Clarify why `WS` is split into 2 tokens
aabounegm 518844f
Add an example snippet
aabounegm b6cf6e2
Document the `ignoreIndentationDelimiters` option
aabounegm d51e2e0
Remove extranneous "is"
aabounegm 88817ea
Merge branch 'eclipse-langium:main' into main
aabounegm 3e18c93
Minor changes
msujew e34967d
Replace links to source with links to TypeDoc
aabounegm File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
116 changes: 116 additions & 0 deletions
116
hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
--- | ||
title: Indentation-sensitive languages | ||
weight: 300 | ||
--- | ||
|
||
Some programming languages (such as Python, Haskell, and YAML) use indentation to denote nesting, as opposed to special non-whitespace tokens (such as `{` and `}` in C++/JavaScript). | ||
This can be difficult to express in the EBNF notation used for defining a language grammar in Langium, which is context-free. | ||
To achieve that, you can make use of synthetic tokens in the grammar which you would then redefine using Chevrotain in a custom token builder. | ||
|
||
Starting with Langium v3.2, such token builder (and an accompanying lexer) are provided for easy plugging into your language. | ||
Lotes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Configuring the token builder and lexer | ||
|
||
To be able to use the indendation tokens in your grammar, you first have to import and register the `IndentationAwareTokenBuilder` and `IndentationAwareLexer` services in your module as such: | ||
aabounegm marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```ts | ||
import { IndentationAwareTokenBuilder, IndentationAwareLexer } from 'langium'; | ||
|
||
// ... | ||
export const HelloWorldModule: Module<HelloWorldServices, PartialLangiumServices & HelloWorldAddedServices> = { | ||
// ... | ||
parser: { | ||
TokenBuilder: () => new IndentationAwareTokenBuilder(), | ||
Lexer: (services) => new IndentationAwareLexer(services), | ||
}, | ||
}; | ||
// ... | ||
``` | ||
|
||
The `IndentationAwareTokenBuilder` constructor optionally accepts an object defining the names of the tokens you used to denote indentation and whitespace in your `.langium` grammar file. It defaults to: | ||
```ts | ||
{ | ||
indentTokenName: 'INDENT', | ||
dedentTokenName: 'DEDENT', | ||
whitespaceTokenName: 'WS', | ||
} | ||
``` | ||
|
||
## Writing the grammar | ||
|
||
In your langium file, you have to define terminals with the same names you passed to `IndentationAwareTokenBuilder` (or the defaults shown above if you did not override them). | ||
For example, let's define the grammar for a simple version of Python with support for only `if` and `return` statements, and only booleans as expressions: | ||
|
||
```langium | ||
grammar PythonIf | ||
|
||
entry Statement: If | Return; | ||
|
||
If: | ||
'if' condition=BOOLEAN ':' | ||
INDENT thenBlock+=Statement+ | ||
DEDENT | ||
('else' ':' | ||
INDENT elseBlock+=Statement+ | ||
DEDENT)?; | ||
|
||
Return: 'return' value=BOOLEAN; | ||
|
||
terminal BOOLEAN returns boolean: /true|false/; | ||
terminal INDENT: 'synthetic:indent'; | ||
terminal DEDENT: 'synthetic:dedent'; | ||
hidden terminal WS: /[\t ]+/; | ||
hidden terminal NL: /[\r\n]+/; | ||
Lotes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
The important terminals here are `INDENT`, `DEDENT`, and `WS`. | ||
`INDENT` and `DEDENT` are used to delimit a nested block, similar to `{` and `}` (respectively) in C-like languages. | ||
Note that `INDENT` indicates an **increase** in indentation, not just the existence of leading whitespace, which is why in the example above we used it only at the beginning of the block, not before every `Statement`. | ||
|
||
The content you choose for these 3 terminals doesn't matter since it will overridden by `IndentationAwareTokenBuilder` anyway. However, you might still want to choose tokens that don't overlap with other terminals for easier use in the playground. | ||
Lotes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Playground compatibility | ||
aabounegm marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Since the Langium playground doesn't support overriding the default services, you cannot use indentation-aware grammar there. | ||
However, you can get around this by defining the indentation terminals in a way that doesn't overlap with other terminals, and then actually using them to simulate indentation. | ||
|
||
For example, for the grammar above, you can write: | ||
``` | ||
if false: | ||
synthetic:indent return true | ||
synthetic:dedent | ||
else: | ||
synthetic:indent if false: | ||
synthetic:indent return false | ||
synthetic:dedent synthetic:dedent | ||
``` | ||
|
||
instead of: | ||
``` | ||
if false: | ||
return true | ||
else: | ||
if false: | ||
return false | ||
``` | ||
|
||
since all whitespace will be ignored anyway. | ||
|
||
While this approach doesn't easily scale, it can be useful for testing when defining your grammar. | ||
|
||
## Drawbacks | ||
|
||
Using this token builder, all leading whitespace becomes significant, no matter the context. | ||
This means that it will no longer be possible for an expression to span multiple lines if one of these lines starts with whitespace and an `INDENT` token is not explicitly allowed in that position. | ||
|
||
For example, the following Python code wouldn't parse: | ||
```python | ||
x = [ | ||
1, # ERROR: Unexpected INDENT token | ||
] | ||
``` | ||
without explicitly specifying that `INDENT` is allowed after `[`. | ||
|
||
This can be worked around by using [multi-mode lexing](https://github.com/eclipse-langium/langium-website/pull/132). | ||
|
||
<!-- TODO: change link from PR to webpage after it's published. --> |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about indentation in the implementation: How do you distinguish between spaces and tabs?
This could be an interesting point of the configuration to show here. Maybe in an own section or as appendix.
How to align this with an editor-config, see https://editorconfig.org or other approaches?
I guess you choose only spaces or tabs for the WS token, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question!
I thought a lot about the best approach here, and in the end decided not to discriminate between them, which is the simpler way. Alternatives included allowing only one or the other through a config parameter, or treating a tab as
n
spaces (again, for a configurablen
). I thought these 2 alternatives were a bit too strict (though that's how Python behaves, for example, by prohibiting mixing them), and I thought that ideally I could issue a warning, but I couldn't find a way to accept a token and still issue a warning/error.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, now that I think about it, I could add some payload to the returned token and then in the lexer check for the payload and add to the errors array, but then there would still be no way of making it a warning rather than an error. Perhaps
LexerResult
should be augmented to allow warnings?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think resolution of my question would block this recipe. I mean we can still change something afterwards. Extending the LexerResult sounds too much for this change.
Another question could be: How to write an indention-aware formatter? Is it even applicable or doable? How is it done for Python?
We do not have to answer this now. I was just interested about some consequences or follow-up tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not yet have experience writing formatters in Langium, but I don't see why it would be difficult to do. Generally, there are 2 approaches: formatting and pretty printing. One way to implement a formatter is to search for some (anti-)patterns in the code and issue
TextEdit
s just for them. Pretty printers normally use the AST/CST (or some other intermediate representation) and transform them back into code, regardless of how it initially looked like before parsing. (or at least that's how I understand the difference between them)Both approaches seem possible with the indentation-aware tokens, though the second one (pretty printer) is probably easier to implement, assuming we want the formatter to ensure consistent indentation characters and sizes.
For Python, one of the most popular formatter is black, and it uses the pretty printing approach. Not sure how other formatters handle inconsistent indentation, tbh.