Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge from upstream #1

Open
wants to merge 516 commits into
base: master
Choose a base branch
from
Open

Conversation

swamyrajamohan
Copy link
Owner

No description provided.

craigtaverner and others added 30 commits January 24, 2023 10:32
…367)

* Add geogrid queries and aggs to geo benchmarks

* Fixed aggs queries for geo tracks

* Fixed incorrect geo_grid query type
That way, it will be split in 16, but each part will start from
beginning to end.
`runtime-fields` challenge requires track parameter `runtime_fields` set to execute the runtime fields. We missed an issue with #357 during ci checks since it was missing.
We drop 2 documents from the corpus so that the document count is a
multiple of 16 which will allow Rally to split at exact boundaries.
This commit refactors the exception handling and converts the
runners to use named arguments. Luckily it appears there's only 
one place the method signature has changed, meaning all but 
one of the runner changes are BWC. The custom ILM runner will
be made BWC via rally itself.
The workaround is replaced by a workaround that uses fixed start / end time range in the template.

This way there is no need to update document's @timestamp value before indexing with an ingest pipeline.
The @timestamp field values as they are can directly be used. This is beneficial because this doesn't
add noice of an ingest pipeline to the challenge. Other tsdb challenges don't use a pipeline.

This should allow for running with ingest_mode of value data_stream by default. Also in rally nightlies.

Co-authored-by: Quentin Pradet <[email protected]>
* Use only epoch_second format for parsed corpus.
* Use only strict_date_optional_time for runtime fields/unparsed data.
* Revert "Revert "http_logs: use suitable date formater for @timestamp (#388)" (#389)"

This reverts commit b1479ad.

* Use ISO 8601 date format when ingest_pipeline==grok
With this commit we lower the target throughput to 1.1
Upgrade geneve in a controlled fashion:

* test the tracks before the upgrade
* avoid overlapping with possibly ongoing investigations
* allow bisecting across geneve upgrades

Related to elastic/elasticsearch-benchmarks#1482
* Add frequent document update challenges to the nyc_taxis track
* Add detailed_results track param for update
* Disable assertions for nyc_taxis: update-aggs-only

New challenge documentation PR to follow.
This is important as running benchmarks against serverless requires at least 1 replica, hence this parameter needs to be at least configurable.
I experienced cases where force merge takes a long time and causes
timeout issues, this should improve the situation.
This track indexes generated k8s pod and k8s container data based on the k8s integration and then test performance of searches that are based on k8s integration visualisations. See README for more details.
Minor typo edit
* DLM support for elastic/logs:
  * Add a new 'lifecycle' track parameter accepting (ilm | dlm)
  * Support DLM in component templates
  * Documentation the lifecycle parameter
  * Do not set the DLM lifecycle retention
Changed data_stream.dataset filter in status_per_pod_* queries from kubernetes.state_pod to kubernetes.pod.

The kubernetes.state_pod value doesn't exist in the two data streams and therefor the this query currently never yields any result. This change should fix that.
gareth-ellis and others added 30 commits September 12, 2024 18:03
* Update to use JDK 21 for build
* Paramaterise timeout

* Update README.md
From ES v8.14 the default index type for dense_vectors is int8_hnsw.
This modifies our rally tracks to refect it.
`copy_to` is used to copy from `kubernetes.event.message` to `message`.
Now it is supported in Elasticsearch 8.15 and we can benchmark the security
track including it. We also remove a parameter which was used to run a modified
workflow, which was using `kubernetes.event.message` instead of `message`.
This PR changes the security track so that we can enable LogsDB
in index templates. Note that the failure store is only available in serverless
so we gate its usage excluding it in case the deployment is not serverless.

For LogsDB testing we rely on Kibana to install all other component/composable
templates. This is to make sure we need limited changes to the Rally track.

While testing this new configuration we discovered that installation of (component)
templates done by Kibana is Serverless only happens when a user interacts with it.
This means (component) templates are not installed and the `elastic/security` track
execution fails as a result of using (component) templates that do not exist.
* `enable_logsdb` (default: false) Determines whether the logsdb index mode gets used. If set then index sorting is configured to only use `@timestamp` field and the `source_enabled` parameter will have no effect.
* `force_merge_max_num_segments` (default: unset): An integer specifying the max amount of segments the force-merge operation should use.
If the `host.name` field does not exists, indices created as backing indices of a data stream
are injected with empty values of `host.name`. Sorting on `host.name` and `@timestamp`
results in sorting just on `@timestamp`. Looking at some mappings I see a `host.hostname`
exists. Also a cardinality aggregation results in hundreds of distinct values which suggests
the filed is not empty.

We would like to test using a meaningful combination of fields to sort on. Ideally we expect
better benchmark results despite being possible that other, more effective, combinations of
fields might exist. We are interested, anyway, in changes over time **given a valid set of fields
to sort on**.
This PR introduces a new track parameter, `synthetic_source_keep` which is used to control the
behaviour of synthetic source for all field types. It can have values `none`, `arrays` or `all` (`all`
not usable when set at index level).
See elastic/elasticsearch#112706 to understand the effect of each value.

Later on we will use this to change the behaviour in our nightlies and run benchmarks on both `elastic/logs`
and `elastic/security` using value `arrays`.
The addition of the index.mapping.synthetic_source_keep to tsdb is new. To http_logs is not and before the index.mapping.synthetic_source_keep setting was hard coded to arrays. I will open a separate PR that adds the source_keep track param to nightly configs.

Having the source_keep makes comparing benchmark results between the different source keep options easier.
The parse tool requires the compressed file to be passed to it.
host.hostname has cardinality 100 while host.id has cardinality 50.
This happen because in the dataset there is a host.if per each couple
ho hostnames, like a single host.id and for each of them two hostnames
like 'dustin.windows' and 'dustin.linux'. This is probably an artifact
of the data generation script.

Lower cardinality fields might:
* reduce sorting overhead due to less comparisons
* improve compression due to more data clustering together

This change should at least allow us if there is any benefit in choosing
a lower cardinality field.
…zations (#693)

The recent optimization of ESQL distance-sort allows us to benchmark that properly. That PR also did a few bug-fixes to distance filtering, so we added an alternative version of the query that is now also optimized.
Skips Fleet component templates in elastic/logs when running with serverless
* Initial commit of a dbpedia_ranking relevance evaluation track

* Forcing a CI run

* Move dbpedia into a search/mteb directory

* Rename dbpedia_ranking to dbpedia

* Add performance metrics

* Remove accidental file

* Force CI to run

* Fix typo in track

* Remove dev.tsv

* Remove default to dev.tsv
Add support for source mode to elastic/logs, elastic/security and http_logs.
New `esql-ccs-snapshot` challenge in `elastic/logs` which reuses existing ESQL queries
but executes them in CCS context.
* Two more benchmarks for partial sorting with ESQL

These cannot be replicated in _search since that only supports what can be pushed down to lucene, and this feature explicitly only pushes down part of the sort, and then does the other part in the compute engine.

* Remove incorrectly added temporary benchmarks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.