[BUG] Value set cannot handle high cardinality column #197

dai-chen · 2023-12-14T17:59:44Z

What is the bug?

In scenarios with high cardinality columns like clientIp in HTTP logs, the resulting value set can become exceedingly large. This leads to the creation of huge array fields within a single OpenSearch document.

How can one reproduce the bug?

CREATE SKIPPING INDEX ON http_logs
(status VALUE_SET)
WITH (
  auto_refresh=true
);

What is the expected behavior?

Either raise exception beforehand or provide other data structure like BloomFilter in #193

The text was updated successfully, but these errors were encountered:

dai-chen · 2023-12-14T23:10:35Z

A workaround is to provide a max size for value set. For larger value set, an empty set is stored. Ref: https://clickhouse.com/docs/en/optimize/skipping-indexes#set

CREATE SKIPPING INDEX ON ...
(clientIp VALUE_SET(100))

Need to figure out:

How to express this with Spark COLLECT_SET() function (currently used by skipping index build)
How to ignore empty value set in OpenSearch query

dai-chen added bug Something isn't working untriaged 0.2 and removed untriaged labels Dec 14, 2023

github-actions bot added the untriaged label Dec 14, 2023

dai-chen removed the untriaged label Dec 14, 2023

penghuo mentioned this issue Dec 14, 2023

[EPIC] Zero-ETL - AWS WAF logs Integration #198

Open

dai-chen mentioned this issue Dec 14, 2023

[Feature] OpenSearch and Apache Spark Integration #3

Closed

penghuo mentioned this issue Dec 19, 2023

[FEATURE] Create skipping index on IP type field #203

Open

This was referenced Dec 22, 2023

[FEATURE] Add bloom filter skipping index type #206

Closed

Restrict value set size using new CollectSetLimit function #207

Closed

Restrict the maximum size of value set by default limit #208

Merged

Configure value set max size in SQL statement #210

Merged

dai-chen self-assigned this Jan 8, 2024

dai-chen added this to OpenSearch Spark Project Planning Jan 8, 2024

dai-chen moved this to Under Review in OpenSearch Spark Project Planning Jan 8, 2024

dai-chen closed this as completed Jan 19, 2024

github-project-automation bot moved this from Under Review to Done in OpenSearch Spark Project Planning Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Value set cannot handle high cardinality column #197

[BUG] Value set cannot handle high cardinality column #197

dai-chen commented Dec 14, 2023

dai-chen commented Dec 14, 2023

[BUG] Value set cannot handle high cardinality column #197

[BUG] Value set cannot handle high cardinality column #197

Comments

dai-chen commented Dec 14, 2023

dai-chen commented Dec 14, 2023