Generate syntax highlighting files #4113

vanillajonathan · 2023-02-13T21:48:49Z

vanillajonathan
Feb 13, 2023

After you create a parser then you have write syntax highlighting files to syntax highlight the language your parser parses. When doing changes to the parser you manually have to update these files.

It would be good if you could generate such syntax highlight files for:

Ace JavaScript editor.
Lezer for the CodeMirror JavaScript editor.
Monaco, Microsoft's JavaScript editor used by VS Code.
GtkSourceView, the GTK source code editor component used by GNOME Text Editor, GNOME Builder and others.
KDE Syntax Highlighting Engine used by Kate, KDevelop and others. (API)
highlight.js, syntax highlighting for the web.
Sublime Text, text editor
TextMate, text editor

ericvergnaud · 2023-02-13T21:55:04Z

ericvergnaud
Feb 13, 2023
Maintainer

Antlr Lexer grammars don't provide the level of details required to achieve this goal. All Lexer patterns are equal, whereas code editors typically expect different sets for control keywords (if, else ...) built-in types (string, int ...) punctuation pairs ( parenthesis, brackets ...) comments and so forth.
Plus keyword highlighting is just the tip of the iceberg...

1 reply

vanillajonathan Feb 13, 2023
Author

Alright, then you may close this discussion. Thanks!

kaby76 · 2023-02-14T18:58:16Z

kaby76
Feb 14, 2023

A grammar by itself is insufficient for syntax highlighting! You must have additional information to describe an association of parts of the parse tree--or tokens--that you are interested in with some highlighting property such as color. A parser generator only inputs a grammar. The extra information for syntax highlighting has basically nothing to do with a grammar and a parser generator. So, the correct design would be to use a parser generator to create a tool that inputs the parse tree plus additional syntax highlighting tuples, then implements syntax highlighting.

For example, TextMate tmLanguage files describe a set of tuples of regular expressions and types. The main problem I have with TextMate is that it's really grotesque as a specification:

Tuples are enumerated in JSON, which is a great standard for data transfer, but just a terrible standard possible for writing specs.
It's limited to regular expressions, not a context-free grammar.

XText implements syntax highlighting in Eclipse by assuming an association with certain names in the grammar, like "Identifier". What a horrible design! You know what they say about "assume"--"ass" "u" "me".

At one point, I prototyped a Visual Studio Code extension and C# LSP server that would take a trgen-generated parser from an Antlr grammar, and a list of tuples of XPath expressions on the parse tree and types, and perform "semantic" highlighting. I found it useful in "debugging" a grammar visually when observing the syntax highlighting of some input file. You could "see" how the parse worked via color. Since the extension is an LSP implementation (in C#, which compiles and runs anywhere), the LSP server could theoretically be reused for extensions for other text editors. https://github.com/kaby76/uni-vscode

So, for the grammars-v4/java/java grammar, there would be additional information needed (ha ha, as a JSON file!!!):

[{
 "Suffix":".java",
 "ParserLocation":"c:/Users/kenne/Documents/GitHub/i2248/java/java/Generated/bin/Debug/net5.0/Test.dll",
 "ClassesAndClassifiers":[
    {"Item1":"class","Item2":"//classDeclaration/IDENTIFIER"},
    {"Item1":"property","Item2":"//fieldDeclaration/variableDeclarators/variableDeclarator/variableDeclaratorId/IDENTIFIER"},
    {"Item1":"variable","Item2":"//variableDeclarator/variableDeclaratorId/IDENTIFIER"},
    {"Item1":"method","Item2":"//methodDeclaration/IDENTIFIER"},
    {"Item1":"keyword","Item2":"//(ABSTRACT | ASSERT | BOOLEAN | BREAK | BYTE | CASE | CHAR | CLASS | CONST | CONTINUE | DEFAULT | DO | DOUBLE | ELSE | ENUM | EXTENDS | FINAL | FINALLY | FLOAT | FOR | IF | GOTO | IMPLEMENTS | IMPORT | INSTANCEOF | INT | INTERFACE | LONG | NATIVE | NEW | PACKAGE | PRIVATE | PROTECTED | PUBLIC | SHORT | STATIC | STRICTFP | SUPER | SWITCH | SYNCHRONIZED | THIS | THROW | THROWS | TRANSIENT | TRY | VOID | VOLATILE | WHILE)"},
    {"Item1":"string","Item2":"//(DECIMAL_LITERAL | HEX_LITERAL | OCT_LITERAL | BINARY_LITERAL | HEX_FLOAT_LITERAL | BOOL_LITERAL | CHAR_LITERAL | STRING_LITERAL | NULL_LITERAL)"}
  ]
}]

As I vaguely recall, LSP has standard types like "variable", "method", "keyword", so I used that--which, again, is just horrible. LSP defines these types, but it should not! People need to stop assuming the basics of what a language is. It's too limited.

The main problem with such an implementation is that Antlr4 is not an incremental parser (although one could rewrite the codegen templates for Antlr4 to implement that). Worse, many languages in the grammars-v4 repo are just terrible because people don't know how to write a grammar. So, a parser generated from a bad grammar is horribly slow.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate syntax highlighting files #4113

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Generate syntax highlighting files #4113

vanillajonathan Feb 13, 2023

Replies: 2 comments · 1 reply

ericvergnaud Feb 13, 2023 Maintainer

vanillajonathan Feb 13, 2023 Author

kaby76 Feb 14, 2023

vanillajonathan
Feb 13, 2023

Replies: 2 comments 1 reply

ericvergnaud
Feb 13, 2023
Maintainer

vanillajonathan Feb 13, 2023
Author

kaby76
Feb 14, 2023