Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Value set cannot handle high cardinality column #197

Closed
dai-chen opened this issue Dec 14, 2023 · 1 comment
Closed

[BUG] Value set cannot handle high cardinality column #197

dai-chen opened this issue Dec 14, 2023 · 1 comment
Assignees
Labels
0.2 bug Something isn't working

Comments

@dai-chen
Copy link
Collaborator

What is the bug?

In scenarios with high cardinality columns like clientIp in HTTP logs, the resulting value set can become exceedingly large. This leads to the creation of huge array fields within a single OpenSearch document.

How can one reproduce the bug?

CREATE SKIPPING INDEX ON http_logs
(status VALUE_SET)
WITH (
  auto_refresh=true
);

What is the expected behavior?

Either raise exception beforehand or provide other data structure like BloomFilter in #193

@dai-chen
Copy link
Collaborator Author

A workaround is to provide a max size for value set. For larger value set, an empty set is stored. Ref: https://clickhouse.com/docs/en/optimize/skipping-indexes#set

CREATE SKIPPING INDEX ON ...
(clientIp VALUE_SET(100))

Need to figure out:

  1. How to express this with Spark COLLECT_SET() function (currently used by skipping index build)
  2. How to ignore empty value set in OpenSearch query

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2 bug Something isn't working
Development

No branches or pull requests

1 participant