-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Make white-space handling less confusing / more consistent with the introduction of an "adjacent selector": ,
#271
Comments
Added |
👍 The context-dependent meaning of
As an extension, I would consider changing |
Unfortunately, this doesn't take into consideration the other combinators like Apart from the drawbacks already mentioned, there's also the fact that there is some duplication of functionality when For the particular case of Another alternative we could take into consideration is to separate non-whitespace from the other modifiers and have it work inside of the rules.
Above, only the last |
For the purposes of this RFC, and to avoid going off on a bike-shedding tournament, we should avoid discussion of how best to change / replace the implicit This RFC is for the discussion of an "adjacent selector" (I don't know if "selector", "operator" or "combinator" is the correct term for these things, it's never stated in the documentation). If we get into discussing alternate strategies for the As Pest currently stands for this RFC, the (sorry to be nagging, but I just want a simple adjacent operator as soon as possible, and worry about the whitespace weirdness after that) |
I agree that this could lead to a really long and possibly not as productive discussion, but the issue here is that adding an "adjacent selector" would only add complexity to the grammar without removing whitespace-dependent behaviour of atomic rules, since While I do agree that the current design may not be the most intuitive, I think that the best way forward would be to only change the grammar when absolutely needed. Maybe a good idea would be to implement this as a prototype, or simply argue a bit more about the case where adding this features plays well with all the other features and is a good addition by itself. And don't get discouraged! The issue your brought up very much exists and I'm excited to find a solution together. |
I have to admit that I'm very excited about this RFC, because I think the discussion could potentially make In that spirit: another option would be to introduce a modifier to the operators themselves to indicate that they take optional whitespace. Something like That looks a bit ridiculous, now that I see it written out. Clearly I've been using too much J. |
@wirelyre, I'm really excited about this too. I feel like this and |
I can't find the discussion, but I'd like to record what I do remember from discussion here: Consensus was that having a "pure adjacency" available is very useful, and should definitely be added.
There are no specific plans on using
In order to allow repetitions to be strict, I believe the winning proposal was to treat it somewhat like how the repetitions are handled in Rust macros: Changes not part of this RFC's motivated changes, but on the table for consideration for a 3.0:
A very vague outline of how to go about adding this
|
I very much like introducing these rules as "experimental" for 2.1 with possibly an opt-in flag. That should make the transition to 3.0 much leaner. |
+1 from me. I'd also like to suggest an operator that says whitespace is required between two rules. I can't do this in normal mode with whitespace defined and adding ( Back to the whitespace topic: There is no option for mandatory whitespace AFAIK except to use atomics, manually add |
The current plan is to provide That said, what use cases do you have for mandatory trivia? I've personally found none (other than "longest match" lexing) that are actually necessary, due to traditional lexers stripping out trivia before passing along to the parser stage. The "longest match" lexing is easy to think that it should be achieved via mandatory trivia, e.g. |
@CAD97 There are languages where whitespace matters. The best practical example I can give is Haskell, where whitespace is a token used as the function application operator - Later update: Whitespace between arguments in Haskell is apparently not required, I was mistaken. |
The difference is that when whitespace is semantic like that, I don't see it as being trivia. I admit that greater control over the trivia is a great option to be had, but for lack of a better way of putting it, it seems like an abuse of trivia to use it semantically (beyond not allowing it in terminals, even when they're built from pieces). For example, does Haskell allow |
Sure, that absolutely makes sense. I think I'd personally be fine with defining something like The implicit whitespace matching by default is a nice feature but it can easily go unnoticed and quietly cause trouble if you're not careful about disabling it where it's not wanted. The proposals in #333 are a good step toward making that distinction clearer. |
This is great! Is there a time line for the feature? My use case is requiring a new line at a certain point in a grammar. At all other points, new lines are optional (and therefore white space). |
@ejoebstl, I have the parser working, but there's more to it:
I have been quite busy lately since I moved to a new country and I'm currently getting settled. Hopefully I'll get some stuff done this weekend. |
@dragostis |
@ejoebstl, oh, yes! I'll upload my progress here tomorrow. I have the whole parser written and would love some help. |
@ejoebstl, here's the repo: https://github.com/pest-parser/pest3 For now, work can live there. The next logical step would be to port the validator to the new version of meta. |
#118 is closed but I do not see it here. Unless there's some workaround I would like to have silent atomic rules, like Currently is there at least some way to suppress atomic rule? |
Also I find it confusing as |
Is there another way to do this?
|
I have the same use case @ejoebstl has - for HCL files: most of the file can just treat WHITESPACE as WHITESPACE, but within an object declaration, objectelem can be separated by either WHITESPACE* ~ "," ~ WHITESPACE*, or by (!newline ~ WHITESPACE)* ~ newline ~ WHITESPACE*; and that latter case is awkward to introduce today without a needless intermediary node in the parse tree due to the lack of silent-atomics (#520 , and previously #118 ). |
Would it possible to introduce a smaller change-set that would not require v3? |
I feel the base use case is that you can't select between |
@CAD97 any thoughts? The original proposal appeared to be about the extra adjacency selector/operator, but it was co-opted into a discussion about the grammar overhaul. While the grammar overhaul looked to be heading in a good direction, it'd take time. |
The main pitfall with just introducing a new direct adjacency operator is that repetitions also consume In current pest, whether So I think what would fit most would be introducing |
The main pitfall with just introducing a new direct adjacency operator is that repetitions also consume In current pest, whether So I think what would fit most would be introducing |
@Sytten would you like to take a stab at it? |
I will see if I have the time. We fixed the most immediate issue in async graphql but it would be a good change. |
This is just exactly that I've wanted to find in pest literally a minute ago! |
@gavrilikhin-d feel free to open a PR; this functionality may need to be feature-guarded under "grammar-extras", similarly to the node/branch tags, to avoid semver breaking changes. |
Summary
Make white-space handling less confusing / more consistent with the introduction of an "adjacent selector":
,
Motivation
The rules
soi
,eoi
,white-space
&comment
look like any other rule, and withwhitespace
andcomment
being names that users are extremely likely to use themselves, the hidden behaviour can be confusing, unexpected and even impractical for the user's desireThe implicit behaviour of the
~
selector only when awhitespace
rule is defined, is sometimes not what the user wants and they have to resort to the@
modifier even though this effects the entire rule and not just one selector. This can make it very difficult to achieve certain effects without resorting to nested rulesGuide-level explanation
Adjacent elements in a rule can be separated by a ",":
Where rule a is followed by rule b which is followed by rule c. No white-space is assumed between the rules, though note that any rule can itself contain a
whitespace
rule.Contrast this with:
Which will check for optional white-space and comments between rules a, b & c.
The adjacent (
,
) and white-space selector (~
) can be combined in a rule; this can be used to carefully control the automatic assumption of white-space or comments.The behaviour of the adjacent selector remains consistent within nested rules making use of modifiers.
In the below example, the behaviour of rule a is unaffected by the parent rules utilising different white-space moderators.
Reference-level explanation
I can't comment on the inner workings of Pest, as I haven't even completed my first ever Rust program yet, but based on writing parsers in other languages (VB6, Go, Perl6), the adjacent selector should provide minimum difficulty. It doesn't add functionality, and expresses a natural state of one token following another.
Drawbacks
Even with the addition of such a simple feature there is still the cost of implementation, testing, documentation and compatibility
There could be unforeseen consequences with parsing behaviour with complex combinations of features; adding another feature increases the available complexity
Rationale and alternatives
This is the simplest possible design to resolve a need of adjacent selection without changing the behaviour of existing selectors (for compatibility). The choice of actual character used (
,
), the name, the terminology etc, can be debated.In some languages (e.g. PHP, some functional languages) a dot is used for concatenation:
This may be desirable over a comma as it is more explicitly a "operator" between words, where as a comma could be confused for something more general and peppered by the user in places where it shouldn't be.
Adjacency could be communicated without the use of a character where one rule separated from another rule by white-space implies adjacency; e.g.:
Whilst this form exists in BNF and some derivatives, the use of an explicit separator avoids potential parse-errors or unintentional behaviour from the user and also provides visual balance as every rule is always separated in all cases, regardless of separator.
Alternative options include changing the functionality of existing features, such as how the "~" selector or "@" modifier operates.
The automagic behaviour of rules with specific names is a documentation / support / learning hindrance, but I believe that that can be resolved separately outside of this RFC as they will have much more drastic implications.
The proposed feature adds to the project, provides benefits, whilst also not taking away from existing features. At the cost of additional documentation, it may help users avoid issues starting out.
Prior art
The use of a comma for adjacency is present in EBNF, which Pest roughly follows.
Unresolved questions
?
The text was updated successfully, but these errors were encountered: