Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an edge case test that . matches \\u2029 and \\u2028 #35

Merged
merged 1 commit into from
May 8, 2024

Conversation

f3ath
Copy link
Collaborator

@f3ath f3ath commented Apr 26, 2023

I-Regexp follows the XSD-2 which states that the equivalent character class for . is [^\r\n]. Some programming languages (e.g. Javascript, Dart) treat . differently, in particular it won't match Unicode chars \u2029 and \u2028. This PR introduces a corresponding edge case test.

@@ -3454,6 +3454,36 @@
"a𐄁b"
]
},
{
"name": "functions, match, dot matcher on \\u2028",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use \u2028 in the JSON document?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the question. Do you mean removing the second \? That would make the test name indistinguishable from the other one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I was talking about using \u2028 in the "document" member.

Copy link
Collaborator Author

@f3ath f3ath Aug 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the document it is defined as \u2028, you can see it in the source file. But it gets replaced with the actual character when the cts.json get compiled. I'm not sure if it would be possible to keep it as \uXXXX in the compiled cts.json.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's terribly important to have the character escaped in the doc. The only problem I could see is potentially a particularly strict JSON parser might not be able to read it.

@gregsdennis
Copy link
Collaborator

gregsdennis commented May 8, 2024

My implementation (which uses the .Net regex engine in an "ECMAScript" configuration) is returning the \r and \n as well.

Name:     functions, match, dot matcher on \u2028
Selector: $[?match(@, '.')]
Document: ["\u2028","\r","\n",true,[],{}]
Result:   ["\u2028"]
Results:   null
IsValid:  True

Actual (values): ["\u2028","\r","\n"]

Actual (serialized):
{
  "Matches": [
    {
      "Value": "\u2028",
      "Location": "$[0]"
    },
    {
      "Value": "\r",
      "Location": "$[1]"
    },
    {
      "Value": "\n",
      "Location": "$[2]"
    }
  ],
  "Error": null
}

Probably related to this. I have code that does some translation, but I don't think I did the "little bit of lookahead assertion added to remove \r and \n" part.

Copy link
Collaborator

@gregsdennis gregsdennis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some googling I figured out how to add the lookahead exclusions, and the tests pass for me now.

@hiltontj
Copy link

Was able to fix this to get things passing in serde_json_path again with hiltontj/serde_json_path#92. Thanks for surfacing this one @f3ath!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants