From e7f56cc2a77c5abeeee499a2d38ba0a3f7b3d9c5 Mon Sep 17 00:00:00 2001 From: Abdelrahman Aly Abounegm Date: Thu, 18 Jul 2024 15:16:21 +0300 Subject: [PATCH 01/10] Add a guide for indentation-sensitive language --- .../lexing/indentation-sensitive-languages.md | 116 ++++++++++++++++++ 1 file changed, 116 insertions(+) create mode 100644 hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md diff --git a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md new file mode 100644 index 00000000..5e7fac1e --- /dev/null +++ b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md @@ -0,0 +1,116 @@ +--- +title: Indentation-sensitive languages +weight: 300 +--- + +Some programming languages (such as Python, Haskell, and YAML) use indentation to denote nesting, as opposed to special non-whitespace tokens (such as `{` and `}` in C++/JavaScript). +This can be difficult to express in the EBNF notation used for defining a language grammar in Langium, which is context-free. +To achieve that, you can make use of synthetic tokens in the grammar which you would then redefine using Chevrotain in a custom token builder. + +Starting with Langium v3.2, such token builder (and an accompanying lexer) are provided for easy plugging into your language. + +## Configuring the token builder and lexer + +To be able to use the indendation tokens in your grammar, you first have to import and register the `IndentationAwareTokenBuilder` and `IndentationAwareLexer` services in your module as such: + +```ts +import { IndentationAwareTokenBuilder, IndentationAwareLexer } from 'langium'; + +// ... +export const HelloWorldModule: Module = { + // ... + parser: { + TokenBuilder: () => new IndentationAwareTokenBuilder(), + Lexer: (services) => new IndentationAwareLexer(services), + }, +}; +// ... +``` + +The `IndentationAwareTokenBuilder` constructor optionally accepts an object defining the names of the tokens you used to denote indentation and whitespace in your `.langium` grammar file. It defaults to: +```ts +{ + indentTokenName: 'INDENT', + dedentTokenName: 'DEDENT', + whitespaceTokenName: 'WS', +} +``` + +## Writing the grammar + +In your langium file, you have to define terminals with the same names you passed to `IndentationAwareTokenBuilder` (or the defaults shown above if you did not override them). +For example, let's define the grammar for a simple version of Python with support for only `if` and `return` statements, and only booleans as expressions: + +```langium +grammar PythonIf + +entry Statement: If | Return; + +If: + 'if' condition=BOOLEAN ':' + INDENT thenBlock+=Statement+ + DEDENT + ('else' ':' + INDENT elseBlock+=Statement+ + DEDENT)?; + +Return: 'return' value=BOOLEAN; + +terminal BOOLEAN returns boolean: /true|false/; +terminal INDENT: 'synthetic:indent'; +terminal DEDENT: 'synthetic:dedent'; +hidden terminal WS: /[\t ]+/; +hidden terminal NL: /[\r\n]+/; +``` + +The important terminals here are `INDENT`, `DEDENT`, and `WS`. +`INDENT` and `DEDENT` are used to delimit a nested block, similar to `{` and `}` (respectively) in C-like languages. +Note that `INDENT` indicates an **increase** in indentation, not just the existence of leading whitespace, which is why in the example above we used it only at the beginning of the block, not before every `Statement`. + +The content you choose for these 3 terminals doesn't matter since it will overridden by `IndentationAwareTokenBuilder` anyway. However, you might still want to choose tokens that don't overlap with other terminals for easier use in the playground. + +### Playground compatibility + +Since the Langium playground doesn't support overriding the default services, you cannot use indentation-aware grammar there. +However, you can get around this by defining the indentation terminals in a way that doesn't overlap with other terminals, and then actually using them to simulate indentation. + +For example, for the grammar above, you can write: +``` +if false: +synthetic:indent return true +synthetic:dedent +else: +synthetic:indent if false: +synthetic:indent return false +synthetic:dedent synthetic:dedent +``` + +instead of: +``` +if false: + return true +else: + if false: + return false +``` + +since all whitespace will be ignored anyway. + +While this approach doesn't easily scale, it can be useful for testing when defining your grammar. + +## Drawbacks + +Using this token builder, all leading whitespace becomes significant, no matter the context. +This means that it will no longer be possible for an expression to span multiple lines if one of these lines starts with whitespace and an `INDENT` token is not explicitly allowed in that position. + +For example, the following Python code wouldn't parse: +```python +x = [ + 1, # ERROR: Unexpected INDENT token +] +``` +without explicitly specifying that `INDENT` is allowed after `[`. + +This can be worked around by using [multi-mode lexing](https://github.com/eclipse-langium/langium-website/pull/132). + + From a293fc4dd167cc414755e85ee5bbc729088638ae Mon Sep 17 00:00:00 2001 From: Abdelrahman Aly Abounegm Date: Thu, 22 Aug 2024 14:36:11 +0000 Subject: [PATCH 02/10] Add links to the TokenBuilder & Lexer --- .../docs/recipes/lexing/indentation-sensitive-languages.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md index 5e7fac1e..40a63e6a 100644 --- a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md +++ b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md @@ -11,7 +11,9 @@ Starting with Langium v3.2, such token builder (and an accompanying lexer) are p ## Configuring the token builder and lexer -To be able to use the indendation tokens in your grammar, you first have to import and register the `IndentationAwareTokenBuilder` and `IndentationAwareLexer` services in your module as such: +To be able to use the indendation tokens in your grammar, you first have to import and register the [`IndentationAwareTokenBuilder`](https://github.com/eclipse-langium/langium/blob/bfca81f9e2411dd25a73f6b2711470e2c33788ed/packages/langium/src/parser/indentation-aware.ts#L78) +and [`IndentationAwareLexer`](https://github.com/eclipse-langium/langium/blob/bfca81f9e2411dd25a73f6b2711470e2c33788ed/packages/langium/src/parser/indentation-aware.ts#L358) +services in your module as such: ```ts import { IndentationAwareTokenBuilder, IndentationAwareLexer } from 'langium'; From 2395760134aabe895df799dcb111d7dbeedce3ae Mon Sep 17 00:00:00 2001 From: Abdelrahman Aly Abounegm Date: Thu, 22 Aug 2024 14:44:37 +0000 Subject: [PATCH 03/10] Add a short explanation on how the solution works --- .../docs/recipes/lexing/indentation-sensitive-languages.md | 1 + 1 file changed, 1 insertion(+) diff --git a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md index 40a63e6a..c63db112 100644 --- a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md +++ b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md @@ -8,6 +8,7 @@ This can be difficult to express in the EBNF notation used for defining a langua To achieve that, you can make use of synthetic tokens in the grammar which you would then redefine using Chevrotain in a custom token builder. Starting with Langium v3.2, such token builder (and an accompanying lexer) are provided for easy plugging into your language. +They work by modifying the underlying Chevrotain token generated for your indentation terminal tokens to use a custom matcher function instead that has access to more context than simple Regular Expressions, allowing it to store state and detect _changes_ in indentation levels. This is why you should provide it with the names of the tokens you used to denote indentation: so it can override the correct tokens for your grammar. ## Configuring the token builder and lexer From 566f5b3e708d75260cd6f437f70a2308c4b5d679 Mon Sep 17 00:00:00 2001 From: Abdelrahman Aly Abounegm Date: Thu, 22 Aug 2024 14:48:39 +0000 Subject: [PATCH 04/10] Remove playground compatibility section --- .../lexing/indentation-sensitive-languages.md | 29 ------------------- 1 file changed, 29 deletions(-) diff --git a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md index c63db112..0e913d70 100644 --- a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md +++ b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md @@ -72,35 +72,6 @@ Note that `INDENT` indicates an **increase** in indentation, not just the existe The content you choose for these 3 terminals doesn't matter since it will overridden by `IndentationAwareTokenBuilder` anyway. However, you might still want to choose tokens that don't overlap with other terminals for easier use in the playground. -### Playground compatibility - -Since the Langium playground doesn't support overriding the default services, you cannot use indentation-aware grammar there. -However, you can get around this by defining the indentation terminals in a way that doesn't overlap with other terminals, and then actually using them to simulate indentation. - -For example, for the grammar above, you can write: -``` -if false: -synthetic:indent return true -synthetic:dedent -else: -synthetic:indent if false: -synthetic:indent return false -synthetic:dedent synthetic:dedent -``` - -instead of: -``` -if false: - return true -else: - if false: - return false -``` - -since all whitespace will be ignored anyway. - -While this approach doesn't easily scale, it can be useful for testing when defining your grammar. - ## Drawbacks Using this token builder, all leading whitespace becomes significant, no matter the context. From 92b72d3461e9cdb6f1327284933998bafa75c4e3 Mon Sep 17 00:00:00 2001 From: Abdelrahman Aly Abounegm Date: Thu, 22 Aug 2024 15:00:22 +0000 Subject: [PATCH 05/10] Clarify why `WS` is split into 2 tokens --- .../docs/recipes/lexing/indentation-sensitive-languages.md | 1 + 1 file changed, 1 insertion(+) diff --git a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md index 0e913d70..754059f4 100644 --- a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md +++ b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md @@ -69,6 +69,7 @@ hidden terminal NL: /[\r\n]+/; The important terminals here are `INDENT`, `DEDENT`, and `WS`. `INDENT` and `DEDENT` are used to delimit a nested block, similar to `{` and `}` (respectively) in C-like languages. Note that `INDENT` indicates an **increase** in indentation, not just the existence of leading whitespace, which is why in the example above we used it only at the beginning of the block, not before every `Statement`. +Additionally, the separation of `WS` from simply `\s+` to `[\t ]+` and `[\r\n]+` is necessary because a simple `\s+` will match the new line character, as well as any possible indentation after it. To ensure correct behavior, the token builder modifies the pattern of the `whitespaceTokenName` token to be `[\t ]+`, so a separate hidden token for new lines needs to be explicitly defined. The content you choose for these 3 terminals doesn't matter since it will overridden by `IndentationAwareTokenBuilder` anyway. However, you might still want to choose tokens that don't overlap with other terminals for easier use in the playground. From 518844f3bcc7ff19db790198cd806e340bb7f63e Mon Sep 17 00:00:00 2001 From: Abdelrahman Aly Abounegm Date: Thu, 22 Aug 2024 15:30:31 +0000 Subject: [PATCH 06/10] Add an example snippet --- .../recipes/lexing/indentation-sensitive-languages.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md index 754059f4..a83e307b 100644 --- a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md +++ b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md @@ -73,6 +73,17 @@ Additionally, the separation of `WS` from simply `\s+` to `[\t ]+` and `[\r\n]+` The content you choose for these 3 terminals doesn't matter since it will overridden by `IndentationAwareTokenBuilder` anyway. However, you might still want to choose tokens that don't overlap with other terminals for easier use in the playground. +With the default configuration and the grammar above, for the following code sample: +``` +if true: + return false +else: + if true: + return true +``` + +the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`. + ## Drawbacks Using this token builder, all leading whitespace becomes significant, no matter the context. From b6cf6e2ca7a2643606136b9b351e16b0e0bb6463 Mon Sep 17 00:00:00 2001 From: Abdelrahman Aly Abounegm Date: Sun, 25 Aug 2024 20:11:03 +0000 Subject: [PATCH 07/10] Document the `ignoreIndentationDelimiters` option --- .../lexing/indentation-sensitive-languages.md | 69 ++++++++++++++----- 1 file changed, 51 insertions(+), 18 deletions(-) diff --git a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md index a83e307b..2e8c119a 100644 --- a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md +++ b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md @@ -30,15 +30,65 @@ export const HelloWorldModule: Module = { + parser: { + TokenBuilder: () => new IndentationAwareTokenBuilder({ + ignoreIndentationDelimiters: [ + ['L_BRAC', 'R_BARC'], // <-- This typo will now cause a TypeScript error + ] + }), + Lexer: (services) => new IndentationAwareLexer(services), + }, +}; +``` + ## Writing the grammar In your langium file, you have to define terminals with the same names you passed to `IndentationAwareTokenBuilder` (or the defaults shown above if you did not override them). @@ -83,20 +133,3 @@ else: ``` the lexer will output the following sequence of tokens: `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `else`, `INDENT`, `if`, `BOOLEAN`, `INDENT`, `return`, `BOOLEAN`, `DEDENT`, `DEDENT`. - -## Drawbacks - -Using this token builder, all leading whitespace becomes significant, no matter the context. -This means that it will no longer be possible for an expression to span multiple lines if one of these lines starts with whitespace and an `INDENT` token is not explicitly allowed in that position. - -For example, the following Python code wouldn't parse: -```python -x = [ - 1, # ERROR: Unexpected INDENT token -] -``` -without explicitly specifying that `INDENT` is allowed after `[`. - -This can be worked around by using [multi-mode lexing](https://github.com/eclipse-langium/langium-website/pull/132). - - From d51e2e0d98c478aaf78366755e95cdf727eba866 Mon Sep 17 00:00:00 2001 From: Abdelrahman Aly Abounegm Date: Thu, 29 Aug 2024 14:44:40 +0000 Subject: [PATCH 08/10] Remove extranneous "is" --- .../docs/recipes/lexing/indentation-sensitive-languages.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md index 2e8c119a..59927685 100644 --- a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md +++ b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md @@ -52,7 +52,7 @@ x = [ any indentation between `[` and `]` should be ignored. To achieve similar behavior with the `IndentationAwareTokenBuilder`, the `ignoreIndentationDelimiters` option can be used. -It accepts is a list of pairs of token names (terminal or keyword) and turns off indentation token detection between each pair. +It accepts a list of pairs of token names (terminal or keyword) and turns off indentation token detection between each pair. For example, if you construct the `IndentationAwareTokenBuilder` with the following options: ```ts From 3e18c9350e70cee2dff146daf378d8a1a45a5713 Mon Sep 17 00:00:00 2001 From: Mark Sujew Date: Tue, 8 Oct 2024 13:10:49 +0000 Subject: [PATCH 09/10] Minor changes --- .../lexing/indentation-sensitive-languages.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md index 59927685..c98b6252 100644 --- a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md +++ b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md @@ -5,10 +5,10 @@ weight: 300 Some programming languages (such as Python, Haskell, and YAML) use indentation to denote nesting, as opposed to special non-whitespace tokens (such as `{` and `}` in C++/JavaScript). This can be difficult to express in the EBNF notation used for defining a language grammar in Langium, which is context-free. -To achieve that, you can make use of synthetic tokens in the grammar which you would then redefine using Chevrotain in a custom token builder. +To achieve that, you can make use of synthetic tokens in the grammar which you would then redefine in a custom token builder. -Starting with Langium v3.2, such token builder (and an accompanying lexer) are provided for easy plugging into your language. -They work by modifying the underlying Chevrotain token generated for your indentation terminal tokens to use a custom matcher function instead that has access to more context than simple Regular Expressions, allowing it to store state and detect _changes_ in indentation levels. This is why you should provide it with the names of the tokens you used to denote indentation: so it can override the correct tokens for your grammar. +Starting with Langium 3.2.0, such token builder (and an accompanying lexer) are provided for easy plugging into your language. +They work by modifying the underlying token type generated for your indentation terminal tokens to use a custom matcher function instead that has access to more context than simple Regular Expressions, allowing it to store state and detect _changes_ in indentation levels. ## Configuring the token builder and lexer @@ -19,15 +19,14 @@ services in your module as such: ```ts import { IndentationAwareTokenBuilder, IndentationAwareLexer } from 'langium'; -// ... export const HelloWorldModule: Module = { // ... parser: { TokenBuilder: () => new IndentationAwareTokenBuilder(), Lexer: (services) => new IndentationAwareLexer(services), + // ... }, }; -// ... ``` The `IndentationAwareTokenBuilder` constructor optionally accepts an object defining the names of the tokens you used to denote indentation and whitespace in your `.langium` grammar file, as well as a list of delimiter tokens inside of which indentation should be ignored. It defaults to: @@ -43,18 +42,21 @@ The `IndentationAwareTokenBuilder` constructor optionally accepts an object defi ### Ignoring indentation between specific tokens Sometimes, it is necessary to ignore any indentation token inside some expressions, such as with tuples and lists in Python. For example, in the following statement: -```python + +```py x = [ 1, 2 ] ``` + any indentation between `[` and `]` should be ignored. To achieve similar behavior with the `IndentationAwareTokenBuilder`, the `ignoreIndentationDelimiters` option can be used. It accepts a list of pairs of token names (terminal or keyword) and turns off indentation token detection between each pair. For example, if you construct the `IndentationAwareTokenBuilder` with the following options: + ```ts new IndentationAwareTokenBuilder({ ignoreIndentationDelimiters: [ @@ -63,6 +65,7 @@ new IndentationAwareTokenBuilder({ ], }) ``` + then no indentation tokens will be emitted between either of those pairs of tokens. ### Configuration options type safety @@ -124,7 +127,8 @@ Additionally, the separation of `WS` from simply `\s+` to `[\t ]+` and `[\r\n]+` The content you choose for these 3 terminals doesn't matter since it will overridden by `IndentationAwareTokenBuilder` anyway. However, you might still want to choose tokens that don't overlap with other terminals for easier use in the playground. With the default configuration and the grammar above, for the following code sample: -``` + +```py if true: return false else: From e34967d2253e349b3051f8d7d2f5ba34f403a2ac Mon Sep 17 00:00:00 2001 From: Abdelrahman Abounegm Date: Tue, 8 Oct 2024 17:34:37 +0300 Subject: [PATCH 10/10] Replace links to source with links to TypeDoc --- .../docs/recipes/lexing/indentation-sensitive-languages.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md index c98b6252..132f9fd6 100644 --- a/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md +++ b/hugo/content/docs/recipes/lexing/indentation-sensitive-languages.md @@ -12,8 +12,8 @@ They work by modifying the underlying token type generated for your indentation ## Configuring the token builder and lexer -To be able to use the indendation tokens in your grammar, you first have to import and register the [`IndentationAwareTokenBuilder`](https://github.com/eclipse-langium/langium/blob/bfca81f9e2411dd25a73f6b2711470e2c33788ed/packages/langium/src/parser/indentation-aware.ts#L78) -and [`IndentationAwareLexer`](https://github.com/eclipse-langium/langium/blob/bfca81f9e2411dd25a73f6b2711470e2c33788ed/packages/langium/src/parser/indentation-aware.ts#L358) +To be able to use the indendation tokens in your grammar, you first have to import and register the [`IndentationAwareTokenBuilder`](https://eclipse-langium.github.io/langium/classes/langium.IndentationAwareTokenBuilder.html) +and [`IndentationAwareLexer`](https://eclipse-langium.github.io/langium/classes/langium.IndentationAwareLexer.html) services in your module as such: ```ts