merge from upstream #1

swamyrajamohan · 2022-06-04T00:49:13Z

No description provided.

…367) * Add geogrid queries and aggs to geo benchmarks * Fixed aggs queries for geo tracks * Fixed incorrect geo_grid query type

That way, it will be split in 16, but each part will start from beginning to end.

…in http_logs (#371)

`runtime-fields` challenge requires track parameter `runtime_fields` set to execute the runtime fields. We missed an issue with #357 during ci checks since it was missing.

We drop 2 documents from the corpus so that the document count is a multiple of 16 which will allow Rally to split at exact boundaries.

This commit refactors the exception handling and converts the runners to use named arguments. Luckily it appears there's only one place the method signature has changed, meaning all but one of the runner changes are BWC. The custom ILM runner will be made BWC via rally itself.

@timestamp

The workaround is replaced by a workaround that uses fixed start / end time range in the template. This way there is no need to update document's @timestamp value before indexing with an ingest pipeline. The @timestamp field values as they are can directly be used. This is beneficial because this doesn't add noice of an ingest pipeline to the challenge. Other tsdb challenges don't use a pipeline. This should allow for running with ingest_mode of value data_stream by default. Also in rally nightlies. Co-authored-by: Quentin Pradet <[email protected]>

* Use only epoch_second format for parsed corpus. * Use only strict_date_optional_time for runtime fields/unparsed data.

…389) This reverts commit cffc864.

@timestamp

* Revert "Revert "http_logs: use suitable date formater for @timestamp (#388)" (#389)" This reverts commit b1479ad. * Use ISO 8601 date format when ingest_pipeline==grok

With this commit we lower the target throughput to 1.1

Upgrade geneve in a controlled fashion: * test the tracks before the upgrade * avoid overlapping with possibly ongoing investigations * allow bisecting across geneve upgrades Related to elastic/elasticsearch-benchmarks#1482

* Add frequent document update challenges to the nyc_taxis track * Add detailed_results track param for update * Disable assertions for nyc_taxis: update-aggs-only New challenge documentation PR to follow.

This is important as running benchmarks against serverless requires at least 1 replica, hence this parameter needs to be at least configurable.

I experienced cases where force merge takes a long time and causes timeout issues, this should improve the situation.

This track indexes generated k8s pod and k8s container data based on the k8s integration and then test performance of searches that are based on k8s integration visualisations. See README for more details.

Minor typo edit

* DLM support for elastic/logs: * Add a new 'lifecycle' track parameter accepting (ilm | dlm) * Support DLM in component templates * Documentation the lifecycle parameter * Do not set the DLM lifecycle retention

…412)

Changed data_stream.dataset filter in status_per_pod_* queries from kubernetes.state_pod to kubernetes.pod. The kubernetes.state_pod value doesn't exist in the two data streams and therefor the this query currently never yields any result. This change should fix that.

* Update to use JDK 21 for build

* Paramaterise timeout * Update README.md

From ES v8.14 the default index type for dense_vectors is int8_hnsw. This modifies our rally tracks to refect it.

`copy_to` is used to copy from `kubernetes.event.message` to `message`. Now it is supported in Elasticsearch 8.15 and we can benchmark the security track including it. We also remove a parameter which was used to run a modified workflow, which was using `kubernetes.event.message` instead of `message`.

… templates (#668)

This PR changes the security track so that we can enable LogsDB in index templates. Note that the failure store is only available in serverless so we gate its usage excluding it in case the deployment is not serverless. For LogsDB testing we rely on Kibana to install all other component/composable templates. This is to make sure we need limited changes to the Rally track. While testing this new configuration we discovered that installation of (component) templates done by Kibana is Serverless only happens when a user interacts with it. This means (component) templates are not installed and the `elastic/security` track execution fails as a result of using (component) templates that do not exist.

* `enable_logsdb` (default: false) Determines whether the logsdb index mode gets used. If set then index sorting is configured to only use `@timestamp` field and the `source_enabled` parameter will have no effect. * `force_merge_max_num_segments` (default: unset): An integer specifying the max amount of segments the force-merge operation should use.

If the `host.name` field does not exists, indices created as backing indices of a data stream are injected with empty values of `host.name`. Sorting on `host.name` and `@timestamp` results in sorting just on `@timestamp`. Looking at some mappings I see a `host.hostname` exists. Also a cardinality aggregation results in hundreds of distinct values which suggests the filed is not empty. We would like to test using a meaningful combination of fields to sort on. Ideally we expect better benchmark results despite being possible that other, more effective, combinations of fields might exist. We are interested, anyway, in changes over time **given a valid set of fields to sort on**.

This PR introduces a new track parameter, `synthetic_source_keep` which is used to control the behaviour of synthetic source for all field types. It can have values `none`, `arrays` or `all` (`all` not usable when set at index level). See elastic/elasticsearch#112706 to understand the effect of each value. Later on we will use this to change the behaviour in our nightlies and run benchmarks on both `elastic/logs` and `elastic/security` using value `arrays`.

The addition of the index.mapping.synthetic_source_keep to tsdb is new. To http_logs is not and before the index.mapping.synthetic_source_keep setting was hard coded to arrays. I will open a separate PR that adds the source_keep track param to nightly configs. Having the source_keep makes comparing benchmark results between the different source keep options easier.

…chmarking (#637)

The parse tool requires the compressed file to be passed to it.

host.hostname has cardinality 100 while host.id has cardinality 50. This happen because in the dataset there is a host.if per each couple ho hostnames, like a single host.id and for each of them two hostnames like 'dustin.windows' and 'dustin.linux'. This is probably an artifact of the data generation script. Lower cardinality fields might: * reduce sorting overhead due to less comparisons * improve compression due to more data clustering together This change should at least allow us if there is any benefit in choosing a lower cardinality field.

…zations (#693) The recent optimization of ESQL distance-sort allows us to benchmark that properly. That PR also did a few bug-fixes to distance filtering, so we added an alternative version of the query that is now also optimized.

Skips Fleet component templates in elastic/logs when running with serverless

* Initial commit of a dbpedia_ranking relevance evaluation track * Forcing a CI run * Move dbpedia into a search/mteb directory * Rename dbpedia_ranking to dbpedia * Add performance metrics * Remove accidental file * Force CI to run * Fix typo in track * Remove dev.tsv * Remove default to dev.tsv

Add support for source mode to elastic/logs, elastic/security and http_logs.

New `esql-ccs-snapshot` challenge in `elastic/logs` which reuses existing ESQL queries but executes them in CCS context.

* Two more benchmarks for partial sorting with ESQL These cannot be replicated in _search since that only supports what can be pushed down to lucene, and this feature explicitly only pushes down part of the sort, and then does the other part in the compute engine. * Remove incorrectly added temporary benchmarks.

craigtaverner and others added 30 commits January 24, 2023 10:32

Added six geo benchmarks, three geo_grid and three grid aggregations (#…

f45101d

…367) * Add geogrid queries and aggs to geo benchmarks * Fixed aggs queries for geo tracks * Fixed incorrect geo_grid query type

Sync master branch with 8.7 (#368)

9fa7d1e

tsdb: Switch split16 to be a single file (#370)

27f3554

That way, it will be split in 16, but each part will start from beginning to end.

Exclude sort by field operations when runtime fields are benchmarked …

cb96ced

…in http_logs (#371)

Update ci test for runtime_fields (#372)

025e598

`runtime-fields` challenge requires track parameter `runtime_fields` set to execute the runtime fields. We missed an issue with #357 during ci checks since it was missing.

tsdb: set default source mode to synthetic (#375)

00012ac

tsdb: Fix split16 to avoid indexing 14 docs from the end (#378)

7df06b3

We drop 2 documents from the corpus so that the document count is a multiple of 16 which will allow Rally to split at exact boundaries.

Run lint and tests using GitHub Actions (#379)

08c8b48

Run rally-tracks-compat in GitHub Actions (#383)

f2f8181

Merge esql branch into master (#387)

09b785c

http_logs: use suitable date formater for @timestamp (#388)

cffc864

* Use only epoch_second format for parsed corpus. * Use only strict_date_optional_time for runtime fields/unparsed data.

Revert "http_logs: use suitable date formater for @timestamp (#388)" (#…

b1479ad

…389) This reverts commit cffc864.

http_logs: set the date formatter for @timestamp (#390)

11c5d6c

* Revert "Revert "http_logs: use suitable date formater for @timestamp (#388)" (#389)" This reverts commit b1479ad. * Use ISO 8601 date format when ingest_pipeline==grok

Lower target throughput for 'painless_static' operation (#391)

20b8183

With this commit we lower the target throughput to 1.1

fix date format for esql queries in nyc_taxis (#398)

d3173a5

Pin geneve to v0.2.0 (#397)

cfe2c53

Upgrade geneve in a controlled fashion: * test the tracks before the upgrade * avoid overlapping with possibly ongoing investigations * allow bisecting across geneve upgrades Related to elastic/elasticsearch-benchmarks#1482

Remove 'number_of_users' param from elastic/* READMEs (#399)

e86cbff

Update docs for integration ratios param (#400)

df647d6

Add refresh_interval parameter to tsdb track (#401)

2c4c48b

Add frequent document update challenges ES-6033 (#402)

1ba3005

* Add frequent document update challenges to the nyc_taxis track * Add detailed_results track param for update * Disable assertions for nyc_taxis: update-aggs-only New challenge documentation PR to follow.

Configurable number of replicas for dense_vector track (#403)

457f05f

This is important as running benchmarks against serverless requires at least 1 replica, hence this parameter needs to be at least configurable.

Dense vector to use polling mode for force merge call (#404)

658123b

I experienced cases where force merge takes a long time and causes timeout issues, this should improve the situation.

Add new track for tsdb based on k8s integration (#373)

39882e9

This track indexes generated k8s pod and k8s container data based on the k8s integration and then test performance of searches that are based on k8s integration visualisations. See README for more details.

Update README.md

a650621

Minor typo edit

Update a path reference in tsdb_k8s_queries README.md (#409)

737c294

Support DLM in the Logging Track (#406)

75bad8f

* DLM support for elastic/logs: * Add a new 'lifecycle' track parameter accepting (ilm | dlm) * Support DLM in component templates * Documentation the lifecycle parameter * Do not set the DLM lifecycle retention

Add daily schedule and manual trigger to GitHub Actions CI workflow (#…

060eb8a

…412)

gareth-ellis and others added 30 commits September 12, 2024 18:03

Update to use JDK 21 for build (#660)

74d31cd

* Update to use JDK 21 for build

Paramaterise timeout (#661)

21dcd36

* Paramaterise timeout * Update README.md

Modify the default vector index type to int8_hnsw (#658)

54d3433

From ES v8.14 the default index type for dense_vectors is int8_hnsw. This modifies our rally tracks to refect it.

Add components logs@mappings, logs@settings, ecs@mappings to endpoint…

c12d71c

… templates (#668)

Add backport action (#599)

3ae3304

Continue backport on error (#681)

d991a14

Added a stripped down version of logging-querying for frozen tier ben…

d785dd9

…chmarking (#637)

Ignore VSCode files (#686)

4493616

Update README.md (#614)

574275a

Update README.md (#604)

3f72278

The parse tool requires the compressed file to be passed to it.

Update elastic/security artifacts (#675)

c23a1bf

Configure Slack notifications through env (#696)

1268b05

Skip Fleet component templates in serverless (#697)

4cd9d4a

Skips Fleet component templates in elastic/logs when running with serverless

Add support for source mode to various tracks (#692)

ae63824

Add support for source mode to elastic/logs, elastic/security and http_logs.

Add new ESQL CCS challenge in Logs (#701)

b458720

New `esql-ccs-snapshot` challenge in `elastic/logs` which reuses existing ESQL queries but executes them in CCS context.

Add scalability queries to esql-ccs-snapshot Logs challenge (#705)

84f0edb

Add ESQL AVG query for non-runtime field (#706)

5a90649

Use polling mode for force merge in tsdb track (#707)

18c88a9

Exclude msmarco from IT tests (#708)

2df96de

Add ES|QL to categorize-text challenge (#712)

393ed97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from upstream #1

merge from upstream #1

swamyrajamohan commented Jun 4, 2022

merge from upstream #1

Are you sure you want to change the base?

merge from upstream #1

Conversation

swamyrajamohan commented Jun 4, 2022