Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Improve COUNT function for multivalued fields #3205

Open
normanj-bitquill opened this issue Dec 17, 2024 · 2 comments
Open

[FEATURE] Improve COUNT function for multivalued fields #3205

normanj-bitquill opened this issue Dec 17, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@normanj-bitquill
Copy link
Contributor

Is your feature request related to a problem?
When the COUNT function is applied to a multivalued field, it will return the count of all scalar values associated with that field. Consider this table:

x | y
--+----------
1 | {1, 2, 3}
2 | {4, 5}
3 | NULL

COUNT(y) will return 5. Most SQL engines would return 2 in this case, since there are 2 rows containing a value for y.

What solution would you like?
The aggregate requested from the OpenSearch server could be more like:

  "aggs": {
    "foo": {
      "filter": {
        "bool": {
          "must": [
            {
              "exists": {
                "field": "y"
              }
            }
          ]
        }
      }
    }
  }

There are a few forms of COUNT that need to be considered:

  • COUNT(<FIELD>) - can use an aggregate like above
  • COUNT(<LITERAL>) - can use an aggregate like above, but use the field _id to count all rows
  • COUNT(*) - can use an aggregate like above, but use the filed _id to count all rows
  • COUNT(NULL) - will always return 0

What alternatives have you considered?
The alternative is to keep the existing behaviour. This is likely to produce results that are unexpected for users, especially if they are used to other SQL engines.

Do you have any additional context?
Recently a change was made to support multivalued fields in the SQL plugin. Some of the aggregates may need to be updated to handle the new multivalued fields.

@normanj-bitquill normanj-bitquill added enhancement New feature or request untriaged labels Dec 17, 2024
@Swiddis
Copy link
Collaborator

Swiddis commented Dec 17, 2024

Given we explicitly document COUNT as counting non-null rows, this is a bug.

Is it documented what the precise semantics of multivalued fields are for the SQL plugin? I imagine this isn't the only case where passing many values to a function that expects one value is going to be tricky. (count is easy enough, what about max or sum?)

@Swiddis Swiddis added bug Something isn't working and removed enhancement New feature or request untriaged labels Dec 17, 2024
@acarbonetto
Copy link
Collaborator

Is it documented what the precise semantics of multivalued fields are for the SQL plugin? I imagine this isn't the only case where passing many values to a function that expects one value is going to be tricky. (count is easy enough, what about max or sum?)

This is also a bug with max/min. Multi-value results are generally not handled well with aggregation because OpenSearch doesn't store these values as arrays like SQL does.

#3138

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants