query validation endpoint #2585

ivakegg · 2024-10-01T20:31:02Z

Currently we have a plan endpoint that syntactically validates the query, however there are many ways a user can create a valid query that is most obviously not what they had intended. We need an end point where we can configure rules that find potential query issues and return text describing what those may be.

Here is a list if issues with examples that we should be able to detect. Most all of these based on LUCENE queries:

Parens before field or completely missing parens when using ORs
FIELD:1234 OR 5678 -> should be FIELD:(1234 OR 5678)
(FIELD:1234 OR 5678) -> should be FIELD:(1243 OR 5687)
Phrases not quoted (where we would automatically AND)
anywhere that the boolean operator is missing
FIELD:term1 term2 -> should be FIELD:"term1 term2"
Index only fields used in a #INCLUDE or #EXCLUDE
Fields that do not exist in the system
A configurable set of rules:
If you see field X, then output message Y
If you see pattern X (regex), then output message Y
Missing arguments to #INCLUDE or #EXCLUDE when the first argument is a boolean:
#INCLUDE(OR,value) -> should be #INCLUDE(OR, FIELD, value)
Usage of NOT without using parens (NOT takes precidence so the resulting plan could be quite incorrect)
FIELD1:abc OR FIELD2:def NOT FIELD3:123 -> should be (FIELD1:abc OR FIELD2:def) NOT FIELD3:123
#TIME functions used with fields that are not typed as a date
Unfielded terms in general should probably be flagged as potential issues
Wildcards inside of phrases
FIELD:"abc*def" -> either the * is incorrect or they meant FIELD:/abc*def/
Using the wrong quote (` instead of ')
Proximity wher the number is smaller than the number of terms
FIELD:"term1 term2 term3"~1 -> the 1 should be 3 or greater
Not escaping special characters
FIELD:abc:def -> should be FIELD:abc\:def
Searching for a range of numbers against a non-numeric field

jstewart-shield · 2024-10-03T17:53:35Z

Hi! Julie here. Can you please create the endpoint first and return a generic message like "Query Validator coming soon." and merge that in before working on the actual validator? This will allow us to get the partners working on integration so when the capability is ready we can quickly turn it on.

lbschanno · 2024-10-03T17:59:30Z

@jstewart-shield certainly, I will try to get a quick turnaround on that for you.

Add a skeleton implementation of a query validation endpoint to allow partners to begin implementation. Part of work for #2585

lbschanno · 2024-10-04T17:43:37Z

@jstewart-shield A skeleton implementation of the endpoint is up in PR #2595. A POST to the endpoint query/{logicName}/validate will return the status 501 and the following content:

{
    "HasResults": false,
    "Messages": [
        "Query validator coming soon."
    ],
    "OperationTimeMS": 0
}

lbschanno · 2024-10-16T15:01:41Z

@ivakegg are there any cases for the rules above where we need to support query model mapping?

jstewart-shield · 2024-10-17T13:31:35Z

Yes! At least #3, #4, #5 (maybe #8?) in the list will need to check the query model. Also, another thing we want to do in the future if verifying a field is active (has had data ingested) during the date range of the query. That would need to check the query model and verify at least one (not all) of the fields is active.

lbschanno · 2024-10-18T16:16:12Z

@jstewart-shield understood, thank you for the clarification.

lbschanno · 2024-10-22T10:46:28Z

Follow-up question for @ivakegg and @jstewart-shield, should any of these validation criteria also apply to fields given via query parameters, or just the query itself?

jstewart-shield · 2024-10-22T13:18:43Z

Oh, great question. I would say long term that we would definitely want to consider that since it's part of the query and will impact the results but it's ok if the first version doesn't do it.

lbschanno · 2024-10-22T13:32:08Z

@jstewart-shield understood, thanks.

lbschanno · 2024-10-29T13:24:11Z

A couple of additional questions for @ivakegg and @jstewart-shield:

Should #13 apply to regex patterns for any non-regex reserved special characters? Specifically, should we check if any special character other than . + * ? ^ $ ( ) [ ] { } | \ are unescaped in regex patterns?
A number of the examples above result in a ParseException when attempting to parse the LUCENE query to a QueryNode. Specifically:
- The query #INCLUDE(OR,value) from #6 results in
```
datawave.query.language.parser.ParseException: java.lang.IllegalArgumentException: datawave.webservice.query.exception.BadRequestQueryException: Invalid arguments to function. include
```
- The query FIELD:abc:def from #13 results in
```
INVALID_SYNTAX_CANNOT_PARSE: Syntax Error, cannot parse FIELD:abc:def:
```
Are these cases that we still want to try to detect in the query string before parsing, or is it enough to return the exception details?

jstewart-shield · 2024-10-29T19:54:45Z

Yes please, if you can
Are you running the parser as part of this call? If so then the second error is fine but the first we would want to change as users won't understand it. If the parser doesn't run as part of this call then we'll definitely want to catch both as users will expect that if the query passes the validator that it will run successfully.

lbschanno · 2024-10-30T16:20:13Z

@jstewart-shield

No problem, will do.
The parser is indeed run as part of the call. I will add support for returning better error messages for the invalid function.

lbschanno · 2024-10-31T00:12:03Z

@ivakegg @jstewart-shield for #10, do we still want to flag cases for escaped wildcards inside of phrases, such as FIELD:"abc\*def", or is this considered valid?

ivakegg · 2024-10-31T13:29:19Z

Julie came back and said that any wildcard in the field value regardless of escaping should be flagged. What does the lucene parser do with such an expression anyway?

ivakegg · 2024-10-31T13:29:46Z

She is goint to test that and see what happens.

lbschanno · 2024-10-31T15:18:46Z

See below for examples on how the parser handles the LUCENE to JEXL conversion with wildcards:

Wildcard inside quoted phrase: FIELD:"abc*dec" -> FIELD == 'abc*dec'
Escaped wildcard inside quoted phrase: FIELD:"abc\*dec" -> FIELD == 'abc*dec'
Doubly-escaped wildcard inside quoted phrase: FIELD:"abc\\*def" -> FIELD == 'abc\\*dec'
Wildcard inside unquoted phrase: FIELD:abc*dec -> FIELD =~ 'abc.*?dec'
Escaped wildcard inside unquoted phrase: FIELD:abc\*dec -> FIELD == 'abc*dec'
Doubly-escaped wildcard inside unquoted phrase: FIELD:abc\\*dec -> FIELD =~ 'abc\\.*?dec'

It looks like backslashes are dropped when the LUCENE query is parsed by AccumuloSyntaxParser.

ivakegg · 2024-10-31T15:29:15Z

You got to it before she did... thanks. I think the answer still stands. If there is a '*' in the expression within quotes, then flag it as potentially invalid

lbschanno · 2024-10-31T15:33:32Z

Understood, will do.

jstewart-shield · 2024-10-31T20:57:37Z

Sorry, to add. If the * within the quotes is escaped then you don't need to flag it because I would assume the user was purposely trying to search the * and not intending a wildcard if they took the time to escape it.

lbschanno · 2024-11-01T12:40:05Z

@jstewart-shield understood, I'll factor that into the rule logic.

* Add a skeleton implementation of a query validation endpoint to allow external clients to begin implementation. Part of work for #2585

Implement the query/{logicName}/validate endpoint. This feature supports the ability to configure validation rules that will validate LUCENE and JEXL queries against a number of criteria and provide meaningful feedback to customers. Closes #2585

lbschanno · 2024-11-21T06:48:20Z

@jstewart-shield I have created PR #2645 for this. The PR description body contains samples of the messages for the validation rules, as well as the schema of the expected REST response. Please let me know if you want any changes made.

lbschanno assigned lbschanno and SethSmucker Oct 1, 2024

lbschanno added a commit that referenced this issue Oct 4, 2024

Add skeleton query validation endpoint

448e9c1

Add a skeleton implementation of a query validation endpoint to allow partners to begin implementation. Part of work for #2585

lbschanno added a commit that referenced this issue Oct 4, 2024

Add skeleton query validation endpoint

18be05c

Add a skeleton implementation of a query validation endpoint to allow partners to begin implementation. Part of work for #2585

lbschanno mentioned this issue Oct 4, 2024

Add skeleton query validation endpoint #2595

Merged

ivakegg pushed a commit that referenced this issue Nov 1, 2024

Add skeleton query validation endpoint (#2595)

9b5935f

* Add a skeleton implementation of a query validation endpoint to allow external clients to begin implementation. Part of work for #2585

lbschanno linked a pull request Nov 21, 2024 that will close this issue

Implement query validation endpoint #2645

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query validation endpoint #2585

query validation endpoint #2585

ivakegg commented Oct 1, 2024 •

edited by lbschanno

Loading

jstewart-shield commented Oct 3, 2024

lbschanno commented Oct 3, 2024

lbschanno commented Oct 4, 2024 •

edited

Loading

lbschanno commented Oct 16, 2024

jstewart-shield commented Oct 17, 2024

lbschanno commented Oct 18, 2024

lbschanno commented Oct 22, 2024

jstewart-shield commented Oct 22, 2024

lbschanno commented Oct 22, 2024

lbschanno commented Oct 29, 2024 •

edited

Loading

jstewart-shield commented Oct 29, 2024

lbschanno commented Oct 30, 2024

lbschanno commented Oct 31, 2024 •

edited

Loading

ivakegg commented Oct 31, 2024

ivakegg commented Oct 31, 2024

lbschanno commented Oct 31, 2024

ivakegg commented Oct 31, 2024

lbschanno commented Oct 31, 2024

jstewart-shield commented Oct 31, 2024

lbschanno commented Nov 1, 2024

lbschanno commented Nov 21, 2024

query validation endpoint #2585

query validation endpoint #2585

Comments

ivakegg commented Oct 1, 2024 • edited by lbschanno Loading

jstewart-shield commented Oct 3, 2024

lbschanno commented Oct 3, 2024

lbschanno commented Oct 4, 2024 • edited Loading

lbschanno commented Oct 16, 2024

jstewart-shield commented Oct 17, 2024

lbschanno commented Oct 18, 2024

lbschanno commented Oct 22, 2024

jstewart-shield commented Oct 22, 2024

lbschanno commented Oct 22, 2024

lbschanno commented Oct 29, 2024 • edited Loading

jstewart-shield commented Oct 29, 2024

lbschanno commented Oct 30, 2024

lbschanno commented Oct 31, 2024 • edited Loading

ivakegg commented Oct 31, 2024

ivakegg commented Oct 31, 2024

lbschanno commented Oct 31, 2024

ivakegg commented Oct 31, 2024

lbschanno commented Oct 31, 2024

jstewart-shield commented Oct 31, 2024

lbschanno commented Nov 1, 2024

lbschanno commented Nov 21, 2024

ivakegg commented Oct 1, 2024 •

edited by lbschanno

Loading

lbschanno commented Oct 4, 2024 •

edited

Loading

lbschanno commented Oct 29, 2024 •

edited

Loading

lbschanno commented Oct 31, 2024 •

edited

Loading