Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0067-feel-split-function] The regex usage differences from the DMN specification #676

Open
saig0 opened this issue Sep 25, 2024 · 11 comments

Comments

@saig0
Copy link
Contributor

saig0 commented Sep 25, 2024

Description

The test case 0067-feel-split-function: 001 verifies the usage of the split() function with a regex argument.

The DMN evaluates the following FEEL expression:

split("John Doe", "\s")

The test case is the same as in the DMN 1.5 specification (Chapter 10.3.4.3 String functions, page 142):

Screenshot from 2024-09-25 12-50-44

However, there is a difference in the regex. Instead of "\\s" from the DMN specification, the test case uses "\s" (one backslash vs. two backslash).

We should align the TCK test case with the DMN specification.

As far as I understand the DMN specification, the string literal "\s" is invalid because it is not a listed escape character (from grammar rule 64):

Screenshot from 2024-09-25 12-55-52

@falko
Copy link
Member

falko commented Sep 27, 2024

This looks to me like a spec issue. There is no need for a double backslash since \s is not a recognized escape sequence in FEEL. Double backslash would only be needed for things that are in grammar rule 64, e.g. split("John\nDoe", "\\n"). On the other hand, the double backslash will just be translated into a single backslash by the FEEL parser leaving a \s for the regular expression parser. So not really an issue with the spec and just a slightly misleading example. A proposal for clarification could be to change the example with \s into using a single backslash and adding the above \n example in addition to show when escaping the backslash is needed.

@saig0
Copy link
Contributor Author

saig0 commented Sep 30, 2024

adding the above \n example in addition to show when escaping the backslash is needed.

👍 For adding the example in the spec and the TCK.

There is no need for a double backslash since \s is not a recognized escape sequence in FEEL.

Are you sure? I'm a bit confused about how the escape sequences and regex characters should be handled in FEEL.

Let's look at the following examples:

// splits the string literal with a new line at the new line character
split("John\nDoe", "\\n")

// splits the string literal with a whitespace at the whitespace character
split("John Doe", "\s")

In JavaScript, the example looks a bit different. \s is not allowed because it is not an escape character.

"John\nDoe".split(new RegExp("\\n"));
"John Doe".split(new RegExp("\\s"));

// or, with literal notation
"John\nDoe".split(/\n/);
"John Doe".split(/\s/);

It looks similar in Java:

"John\nDoe".split("\\n")

// "\s" produces a compile failure
"John Doe".split("\\s")

cc: @nikku @barmac

@barmac
Copy link

barmac commented Sep 30, 2024

I think we should stick to what the specification states as it conforms to both Java and JavaScript standards. I don't see a reason why FEEL should deviate from the de-facto standard of the programming languages.
So I support it that we should fix the problem within TCK.

@nikku
Copy link
Contributor

nikku commented Sep 30, 2024

The question to discuss here is if \ is a valid character in FEEL, or not. According to the spec \\ is the escaped form of \, yielding a \ character; this may or may not imply that \ itself is illegal, as it is part of an "escape sequence".

Cf. JavaScript example, again, a raw \ does not exist, outside of valid escape sequences:

image

@baldimir
Copy link
Collaborator

Based on the discussion on the meeting:

  • The parameter is defined not as a string but delimiter, which is defined as a "pattern". "pattern" has following description in the specification:

"pattern, replacement, and flags SHALL conform to the syntax and constraints specified in
clause 7.6 of XQuery 1.0 and XPath 2.0 Functions and Operators. Note that where XPath
specifies an error result, FEEL specifies a null result"

So the format should depend on the XQuery and XPath syntax and needs to be checked there, if the current form is correct in the spec, or in the test.

@opatrascoiu
Copy link
Contributor

According to the XML spec for regular expressions https://www.w3.org/TR/xmlschema-2/#regexs

split("John Doe", "\s")

is the correct syntax.

@nikku
Copy link
Contributor

nikku commented Nov 21, 2024

The parameter is defined not as a string but delimiter, which is defined as a "pattern".

I'm sorry but I cannot follow that line of thought. What is the type of pattern?

  • "\s" looks like a string to me. "foo\sbar", too.
  • If it is a string, is the literal value of "\s" the value \s, so is the literal value of "\" the character \?
  • This is relevant, because the interpreter likely verifies the pattern based on the literal value, not some static matching.
  • Looking at prior comments we can see that "\" with a literal value is something uncommon in many languages we know. To express \ you have to encode it via "\\".

If "\s" is not a string, then my argument does not apply, of course. But then let's figure what the type of "\s" or "foo\sbar" is, and consider how to properly document that special type in the specification. In Java and JavaScript, of course these flags are strings, and hence the standard encoding rules of characters in strings applies in these languages.

According to the XML spec for regular expressions https://www.w3.org/TR/xmlschema-2/#regexs

The spec refers to the literal value of a string, not the representation, hence this quote is not relevant to the discussion. Whatever string representation yields the literal value of \s is a valid pattern (matching a single space char). If we decide "\s" yields that value we have to explain how "\" is interpreted. If we decide "\\s" yields that value then all is clear, because the spec already mentions that "\\" yields \ as a character:

image

@opatrascoiu
Copy link
Contributor

opatrascoiu commented Nov 22, 2024

I think that based on

  1. string literal = """, { character – (""" | vertical space) | string escape sequence}, """ ;
  2. string escape sequence = "'" | """ | "\" | "\n" | "\r" | "\t" | code point;
  • "\s" is a valid FEEL string and the value is a sequence of 2 characters: \ and s.

  • "\s" is also a valid FEEL string and the value is the same sequence of two characters as \ is a valid string escape sequence.

The lexeme is different but the value is the same. The split() call uses the value, not the lexeme.

I suggest adding a test for "\s" as well.

@nikku
Copy link
Contributor

nikku commented Nov 22, 2024

@opatrascoiu This partly answeres my question:

  • \ is a valid character inside of a FEEL string, hence you can write "\s" and it will be interpreted as \ + s
  • \ is not a valid character when it is part of an escape sequence, i.e. "\r" is interpreted as a single character, the carriage return, same applies for "\'", which is being interpreted as ' and "\\", which is interpreted as "a \ that would otherwise be treated as part of an escape sequence".
  • If you want to write a literal \r you have to escape the \ character: "\\r", if you want to write \s you don't have to do it, as \s is not a special "escape sequence".

I hope we end up in a simpler solution: \ always denotes the start of an escape sequence, and hence must be properly escaped in all cases. This is the direction that Java and JavaScript go.

The split() call uses the value, not the lexeme.

I think this is where I disagree. Intuitively, from the user perspective what looks like a string, is used like a string, can be passed like a string should be a string, and not a "special" interpretation of it.

{
  pattern: "AB\\sC",
  splitted: split("YES MAYBEAB COH YES", pattern)
}.splitted = [ "YES MAYBE", "OH YES" ]

@opatrascoiu
Copy link
Contributor

@nikku

To understand the semantics of strings and regular expressions we need to read the DMN / FEEL carefully. AFAIK these are covered as follows

The syntax of strings is defined in 10.3.1.2 Grammar rules:

33. string literal = """, { character – (""" | vertical space) | string escape sequence}, """ ;
35. string escape sequence = "'" | """ | "\" | "\n" | "\r" | "\t" | code point;

The semantics of strings is defined in section 10.3.2.3 Semantics of literals and datatypes

The syntax and semantics of FEEL regExp are the same with the ones in XQuery and XPath (see 10.3.4.3 String functions).

According to the 10.3.2.3 Semantics of literals and datatypes, the string lexemes are evaluated and produce values in the semantic domain (e.g. escaped sequences are replaced with the value of the character).

This value is used to evaluate other constructions (e.g. function calls). For example, if the value of the string passed to split() is not a valid XML regular expression, the evaluator reports an error. This is true both for FEEL and other languages (e.g. escape sequences in Java).

According to

33 string literal = """, { character – (""" | vertical space) | string escape sequence}, """ ;

"\" is valid and has the same value as "\\". Other languages, consider "" as an error, but not DMN / FEEL.

If order to be invalid, rule 33 should look like that

33 string literal = """, { character – (""" | vertical space | backslash) | string escape sequence}, """ ;

If there is another section that states that "" is an invalid string, please let me know.

Feel free to raise an RTF issue about marking "\" as an error (e.g. backslash should always be escaped).

@nikku
Copy link
Contributor

nikku commented Nov 22, 2024

"\" is valid and has the same value as "\\". Other languages, consider "\" as an error, but not DMN / FEEL.

This clears this discussion for me. Thanks for taking your time to elaborate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants