Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query validation endpoint #2585

Open
ivakegg opened this issue Oct 1, 2024 · 20 comments
Open

query validation endpoint #2585

ivakegg opened this issue Oct 1, 2024 · 20 comments
Assignees

Comments

@ivakegg
Copy link
Collaborator

ivakegg commented Oct 1, 2024

Currently we have a plan endpoint that syntactically validates the query, however there are many ways a user can create a valid query that is most obviously not what they had intended. We need an end point where we can configure rules that find potential query issues and return text describing what those may be.

Here is a list if issues with examples that we should be able to detect. Most all of these based on LUCENE queries:

  1. Parens before field or completely missing parens when using ORs
    FIELD:1234 OR 5678 -> should be FIELD:(1234 OR 5678)
    (FIELD:1234 OR 5678) -> should be FIELD:(1243 OR 5687)
  2. Phrases not quoted (where we would automatically AND)
    anywhere that the boolean operator is missing
    FIELD:term1 term2 -> should be FIELD:"term1 term2"
  3. Index only fields used in a #INCLUDE or #EXCLUDE
  4. Fields that do not exist in the system
  5. A configurable set of rules:
    If you see field X, then output message Y
    If you see pattern X (regex), then output message Y
  6. Missing arguments to #INCLUDE or #EXCLUDE when the first argument is a boolean:
    #INCLUDE(OR,value) -> should be #INCLUDE(OR, FIELD, value)
  7. Usage of NOT without using parens (NOT takes precidence so the resulting plan could be quite incorrect)
    FIELD1:abc OR FIELD2:def NOT FIELD3:123 -> should be (FIELD1:abc OR FIELD2:def) NOT FIELD3:123
  8. #TIME functions used with fields that are not typed as a date
  9. Unfielded terms in general should probably be flagged as potential issues
  10. Wildcards inside of phrases
    FIELD:"abc*def" -> either the * is incorrect or they meant FIELD:/abc*def/
  11. Using the wrong quote (` instead of ')
  12. Proximity wher the number is smaller than the number of terms
    FIELD:"term1 term2 term3"~1 -> the 1 should be 3 or greater
  13. Not escaping special characters
    FIELD:abc:def -> should be FIELD:abc\:def
  14. Searching for a range of numbers against a non-numeric field
@jstewart-shield
Copy link

Hi! Julie here. Can you please create the endpoint first and return a generic message like "Query Validator coming soon." and merge that in before working on the actual validator? This will allow us to get the partners working on integration so when the capability is ready we can quickly turn it on.

@lbschanno
Copy link
Collaborator

@jstewart-shield certainly, I will try to get a quick turnaround on that for you.

lbschanno added a commit that referenced this issue Oct 4, 2024
Add a skeleton implementation of a query validation endpoint to allow
partners to begin implementation.

Part of work for #2585
lbschanno added a commit that referenced this issue Oct 4, 2024
Add a skeleton implementation of a query validation endpoint to allow
partners to begin implementation.

Part of work for #2585
@lbschanno
Copy link
Collaborator

lbschanno commented Oct 4, 2024

@jstewart-shield A skeleton implementation of the endpoint is up in PR #2595. A POST to the endpoint query/{logicName}/validate will return the status 501 and the following content:

{
    "HasResults": false,
    "Messages": [
        "Query validator coming soon."
    ],
    "OperationTimeMS": 0
}

@lbschanno
Copy link
Collaborator

@ivakegg are there any cases for the rules above where we need to support query model mapping?

@jstewart-shield
Copy link

Yes! At least #3, #4, #5 (maybe #8?) in the list will need to check the query model. Also, another thing we want to do in the future if verifying a field is active (has had data ingested) during the date range of the query. That would need to check the query model and verify at least one (not all) of the fields is active.

@lbschanno
Copy link
Collaborator

@jstewart-shield understood, thank you for the clarification.

@lbschanno
Copy link
Collaborator

Follow-up question for @ivakegg and @jstewart-shield, should any of these validation criteria also apply to fields given via query parameters, or just the query itself?

@jstewart-shield
Copy link

Oh, great question. I would say long term that we would definitely want to consider that since it's part of the query and will impact the results but it's ok if the first version doesn't do it.

@lbschanno
Copy link
Collaborator

@jstewart-shield understood, thanks.

@lbschanno
Copy link
Collaborator

lbschanno commented Oct 29, 2024

A couple of additional questions for @ivakegg and @jstewart-shield:

  1. Should #13 apply to regex patterns for any non-regex reserved special characters? Specifically, should we check if any special character other than . + * ? ^ $ ( ) [ ] { } | \ are unescaped in regex patterns?
  2. A number of the examples above result in a ParseException when attempting to parse the LUCENE query to a QueryNode. Specifically:
    • The query #INCLUDE(OR,value) from #6 results in
    datawave.query.language.parser.ParseException: java.lang.IllegalArgumentException: datawave.webservice.query.exception.BadRequestQueryException: Invalid arguments to function. include
    
    • The query FIELD:abc:def from #13 results in
    INVALID_SYNTAX_CANNOT_PARSE: Syntax Error, cannot parse FIELD:abc:def:
    
    Are these cases that we still want to try to detect in the query string before parsing, or is it enough to return the exception details?

@jstewart-shield
Copy link

  1. Yes please, if you can
  2. Are you running the parser as part of this call? If so then the second error is fine but the first we would want to change as users won't understand it. If the parser doesn't run as part of this call then we'll definitely want to catch both as users will expect that if the query passes the validator that it will run successfully.

@lbschanno
Copy link
Collaborator

@jstewart-shield

  1. No problem, will do.
  2. The parser is indeed run as part of the call. I will add support for returning better error messages for the invalid function.

@lbschanno
Copy link
Collaborator

lbschanno commented Oct 31, 2024

@ivakegg @jstewart-shield for #10, do we still want to flag cases for escaped wildcards inside of phrases, such as FIELD:"abc\*def", or is this considered valid?

@ivakegg
Copy link
Collaborator Author

ivakegg commented Oct 31, 2024

Julie came back and said that any wildcard in the field value regardless of escaping should be flagged. What does the lucene parser do with such an expression anyway?

@ivakegg
Copy link
Collaborator Author

ivakegg commented Oct 31, 2024

She is goint to test that and see what happens.

@lbschanno
Copy link
Collaborator

See below for examples on how the parser handles the LUCENE to JEXL conversion with wildcards:

  • Wildcard inside quoted phrase: FIELD:"abc*dec" -> FIELD == 'abc*dec'
  • Escaped wildcard inside quoted phrase: FIELD:"abc\*dec" -> FIELD == 'abc*dec'
  • Doubly-escaped wildcard inside quoted phrase: FIELD:"abc\\*def" -> FIELD == 'abc\\*dec'
  • Wildcard inside unquoted phrase: FIELD:abc*dec -> FIELD =~ 'abc.*?dec'
  • Escaped wildcard inside unquoted phrase: FIELD:abc\*dec -> FIELD == 'abc*dec'
  • Doubly-escaped wildcard inside unquoted phrase: FIELD:abc\\*dec -> FIELD =~ 'abc\\.*?dec'

It looks like backslashes are dropped when the LUCENE query is parsed by AccumuloSyntaxParser.

@ivakegg
Copy link
Collaborator Author

ivakegg commented Oct 31, 2024

You got to it before she did... thanks. I think the answer still stands. If there is a '*' in the expression within quotes, then flag it as potentially invalid

@lbschanno
Copy link
Collaborator

Understood, will do.

@jstewart-shield
Copy link

Sorry, to add. If the * within the quotes is escaped then you don't need to flag it because I would assume the user was purposely trying to search the * and not intending a wildcard if they took the time to escape it.

@lbschanno
Copy link
Collaborator

@jstewart-shield understood, I'll factor that into the rule logic.

ivakegg pushed a commit that referenced this issue Nov 1, 2024
* Add a skeleton implementation of a query validation endpoint to allow external clients to begin implementation.

Part of work for #2585
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants