Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable logsdb for http_logs #646

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions http_logs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ This track allows to overwrite the following parameters with Rally 0.8.0+ using
* `number_of_shards` (default: 5)
* `source_enabled` (default: true): A boolean defining whether the `_source` field is stored in the index.
* `index_settings`: A list of index settings. Index settings defined elsewhere (e.g. `number_of_replicas`) need to be overridden explicitly.
* `index_mode` (default: unset): Set to `logsdb` to enable indexing to [logs data streams](https://www.elastic.co/guide/en/elasticsearch/reference/master/logs-data-stream.html). If not enabled, Rally will not use logs data streams.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could just enable logsdb on the plain index and not use data streams. This would make this change a little bit simpler, since this is an index oriented track?

But I think it is fine to use logsdb with data stream here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL logsdb can be enabled on a plain index. I took from https://www.elastic.co/guide/en/elasticsearch/reference/current/logs-data-stream.html it was required to be a data stream.

I would like to leave data stream support for serverless, though I think combining the template into one may have complicated things a bit, so I am thinking of splitting the template into two files: one for a plain index, the other for data stream. WDYT?

* `index_type` (default: unset): Set to `data_stream` to enable indexing to data streams. `index_type` is not required when `index_mode` is set to `logsdb`.
* `cluster_health` (default: "green"): The minimum required cluster health.
* `ingest_pipeline`: Only applicable for `--challenge=append-index-only-with-ingest-pipeline`, selects which ingest
node pipeline to run. Valid options are `'baseline'` (default), `'grok'` and `'geoip'`. For example: `--challenge=append-index-only-with-ingest-pipeline --track-params="ingest_pipeline:'baseline'" `
Expand Down
39 changes: 36 additions & 3 deletions http_logs/challenges/common/default-schedule.json
Original file line number Diff line number Diff line change
@@ -1,11 +1,44 @@
{
"operation": "delete-index"
"operation": {
"name": "delete-data-stream",
"operation-type": "delete-data-stream",
"only-if-exists": false,
"data-stream": ["logs-181998", "logs-191998", "logs-201998", "logs-211998", "logs-221998", "logs-231998", "logs-241998", "reindexed-logs"]
},
"tags": ["setup"]
},
{
"operation": {
"name": "delete-index",
"operation-type": "delete-index",
"only-if-exists": false,
"index": ["logs-181998", "logs-191998", "logs-201998", "logs-211998", "logs-221998", "logs-231998", "logs-241998", "reindexed-logs"]
},
"tags": ["setup"]
},
{
"operation" : {
"name": "delete-all-index-templates",
"operation-type": "delete-composable-template"
},
"tags": ["setup"]
},
{
"operation": {
"operation-type": "create-index",
"name": "create-all-templates",
"operation-type": "create-composable-template"
},
"tags": ["setup"]
},
{
{%- if index_mode == "logsdb" or index_type == "data_stream" %}
{%- set indexing_operation_type = "create-data-stream" %}
{%- endif %}
"operation": {
"operation-type": {{ indexing_operation_type | default("create-index") | tojson }},
"settings": {{index_settings | default({}) | tojson}}
}
},
"tags": ["setup"]
},
{
"name": "check-cluster-health",
Expand Down
32 changes: 32 additions & 0 deletions http_logs/challenges/common/setup-schedule.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"operation": {
"name": "delete-data-stream",
"operation-type": "delete-data-stream",
"only-if-exists": false,
"data-stream": ["logs-181998", "logs-191998", "logs-201998", "logs-211998", "logs-221998", "logs-231998", "logs-241998", "reindexed-logs"]
},
"tags": ["setup"]
},
{
"operation": {
"name": "delete-index",
"operation-type": "delete-index",
"only-if-exists": false,
"index": ["logs-181998", "logs-191998", "logs-201998", "logs-211998", "logs-221998", "logs-231998", "logs-241998", "reindexed-logs"]
},
"tags": ["setup"]
},
{
"operation" : {
"name": "delete-all-index-templates",
"operation-type": "delete-composable-template"
},
"tags": ["setup"]
},
{
"operation": {
"name": "create-all-templates",
"operation-type": "create-composable-template"
},
"tags": ["setup"]
}
71 changes: 46 additions & 25 deletions http_logs/challenges/default.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,18 @@
"name": "append-no-conflicts-index-only",
"description": "Indexes the whole document corpus using Elasticsearch default settings. We only adjust the number of replicas as we benchmark a single node cluster and Rally will only start the benchmark if the cluster turns green. Document ids are unique so all index operations are append only.",
"schedule": [
{{ rally.collect(parts="common/setup-schedule.json") }},
{
"operation": "delete-index"
},
{
{%- if index_mode == "logsdb" or index_type == "data_stream" %}
{%- set indexing_operation_type = "create-data-stream" %}
{%- endif %}
"operation": {
"operation-type": "create-index",
"operation-type": {{ indexing_operation_type | default("create-index") | tojson }},
"settings": {{index_settings | default({}) | tojson}}
}
},
"tags": [
"setup"
]
},
{
"name": "check-cluster-health",
Expand Down Expand Up @@ -77,17 +81,21 @@
"name": "append-sorted-no-conflicts",
"description": "Indexes the whole document corpus in an index sorted by timestamp field in descending order (most recent first) and using a setup that will lead to a lower indexing throughput than the default settings. Document ids are unique so all index operations are append only.",
"schedule": [
{{ rally.collect(parts="common/setup-schedule.json") }},
{
"operation": "delete-index"
},
{
{%- if index_mode == "logsdb" or index_type == "data_stream" %}
{%- set indexing_operation_type = "create-data-stream" %}
{%- endif %}
"operation": {
"operation-type": "create-index",
"operation-type": {{ indexing_operation_type | default("create-index") | tojson }},
"settings": {%- if index_settings is defined %} {{index_settings | tojson}} {%- else %} {
"index.sort.field": "@timestamp",
"index.sort.order": "desc"
}{%- endif %}
}
},
"tags": [
"setup"
]
},
{
"name": "check-cluster-health",
Expand Down Expand Up @@ -140,14 +148,21 @@
"name": "append-index-only-with-ingest-pipeline",
"description": "Indexes the whole document corpus using Elasticsearch default settings. We only adjust the number of replicas as we benchmark a single node cluster and Rally will only start the benchmark if the cluster turns green. Document ids are unique so all index operations are append only. Runs the documents through an ingest node pipeline to parse the http logs. May require --elasticsearch-plugins='ingest-geoip' ",
"schedule": [
{{ rally.collect(parts="common/setup-schedule.json") }},
{
"operation": "delete-index"
},
{
{%- if index_mode == "logsdb" or index_type == "data_stream" %}
{%- set indexing_operation_type = "create-data-stream" %}
{%- endif %}
"operation": {
"operation-type": "create-index",
"settings": {{index_settings | default({}) | tojson}}
}
"operation-type": {{ indexing_operation_type | default("create-index") | tojson }},
"settings": {%- if index_settings is defined %} {{index_settings | tojson}} {%- else %} {
"index.sort.field": "@timestamp",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the template, not sure what else can be a good sorting field without knowing the data, I think just sorting by timestamp is good.

Copy link
Contributor

@salvatore-campagna salvatore-campagna Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is basically nothing there @timestamp, message, clientip, request, status, size and geoip. At the end it depends on queries...the idea is that we want to sort to favor query latency. But sorting will also determine effective doc value compression.

Copy link
Contributor

@salvatore-campagna salvatore-campagna Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result I think we can use this as an example to udnerstand what happens if host.name is missing. I would just remove the index sorting configuration and rely on defaults. In practice since host.name is missing it would result in just sorting on @timestamp (the host.name field is injected and will be empty). LogsDB handles gracefully the fact that the host.name field might be missing.

"index.sort.order": "desc"
}{%- endif %}
},
"tags": [
"setup"
]
},
{
"name": "check-cluster-health",
Expand Down Expand Up @@ -201,10 +216,9 @@
},
{
"name": "update",
"description": "Perform bulk update operations. The update challenge is for standard index use only.",
"schedule": [
{
"operation": "delete-index"
},
{{ rally.collect(parts="common/setup-schedule.json") }},
{
"operation": {
"operation-type": "create-index",
Expand Down Expand Up @@ -268,14 +282,21 @@
"name": "append-no-conflicts-index-reindex-only",
"description": "Indexes the whole document corpus using Elasticsearch default settings. We only adjust the number of replicas as we benchmark a single node cluster and Rally will only start the benchmark if the cluster turns green. Document ids are unique so all index operations are append only. After indexing, same data are reindexed.",
"schedule": [
{{ rally.collect(parts="common/setup-schedule.json") }},
{
"operation": "delete-index"
},
{
{%- if index_mode == "logsdb" or index_type == "data_stream" %}
{%- set indexing_operation_type = "create-data-stream" %}
{%- endif %}
"operation": {
"operation-type": "create-index",
"settings": {{index_settings | default({}) | tojson}}
}
"operation-type": {{ indexing_operation_type | default("create-index") | tojson }},
"settings": {%- if index_settings is defined %} {{index_settings | tojson}} {%- else %} {
"index.sort.field": "@timestamp",
"index.sort.order": "desc"
}{%- endif %}
},
"tags": [
"setup"
]
},
{
"name": "check-cluster-health",
Expand Down
100 changes: 77 additions & 23 deletions http_logs/index-runtime-fields.json → http_logs/index-template.json
Original file line number Diff line number Diff line change
@@ -1,30 +1,84 @@
{
"settings": {
{# non-serverless-index-settings-marker-start #}{%- if build_flavor != "serverless" or serverless_operator == true -%}
"index.number_of_shards": {{ number_of_shards | default(5) }},
"index.number_of_replicas": {{ number_of_replicas | default(0) }},
"index.requests.cache.enable": false
{%- endif -%}{# non-serverless-index-settings-marker-end #}
},
"mappings": {
"dynamic": "strict",
"_source": {
"enabled": {{ source_enabled | default(true) | tojson }}
"priority": 101,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used 101 to ensure the template took priority over the built-in one for logs-*. Should it be 100 for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the fact that the standard ones should have 100...so it should be fine for this to have 101 (and have higher priority). I saw some CI failures saying that there is a template priority issue while merging templates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

Cannot run task [create-all-templates]: Request returned an error. Error type: api, Description: illegal_argument_exception ({'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'index template [rally-http_logs] has index patterns [logs-*, reindexed-logs] matching patterns from existing templates [apm-index-template] with patterns (apm-index-template => [traces-apm*, logs-apm*, metrics-apm*]) that have the same priority [101], multiple index templates may not match during index creation, please use a different priority'}], 'type': 'illegal_argument_exception', 'reason': 'index template [rally-http_logs] has index patterns [logs-*, reindexed-logs] matching patterns from existing templates [apm-index-template] with patterns (apm-index-template => [traces-apm*, logs-apm*, metrics-apm*]) that have the same priority [101], multiple index templates may not match during index creation, please use a different priority'}, 'status': 400}), HTTP Status: 400

"index_patterns": ["logs-*", "reindexed-logs"],
{%- if index_mode == "logsdb" or index_type == "data_stream" %}
"data_stream": {},
{%- endif %}
"template": {
"settings": {
{%- if index_mode %}
"mode": {{ index_mode | tojson }},
{%- endif -%}
{# non-serverless-index-settings-marker-start -#}
{%- if build_flavor != "serverless" %}
"index.number_of_replicas": {{ number_of_replicas | default(0) | tojson }},
{%- endif -%}
{%- if build_flavor != "serverless" or serverless_operator == true %}
"index.number_of_shards": {{ number_of_shards | default(5) | tojson }},
"index.requests.cache.enable": false
{%- endif -%}
{# non-serverless-index-settings-marker-end #}
},
"properties": {
"@timestamp": {
"format": "strict_date_optional_time",
"type": "date"
},
"message": {
"type": "wildcard",
"fields": {
"keyword": {
"type": "keyword"
"mappings": {
"dynamic": "strict",
{%- if index_mode != "logsdb" %}
"_source": {
"enabled": {{ source_enabled | default(true) | tojson }}
},
{%- endif %}
"properties": {
"@timestamp": {
{%- if (ingest_pipeline is defined and ingest_pipeline == "grok") or runtime_fields is defined %}
"format": "strict_date_optional_time",
{%- else %}
"format": "epoch_second",
{%- endif %}
"type": "date"
},
{%- if runtime_fields is defined %}
"message": {
"type": "wildcard",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
{%- else %}
"message": {
"type": "keyword",
"index": false,
"doc_values": false
},
{%- endif %}
"clientip": {
"type": "ip"
},
"request": {
"type": "match_only_text",
"fields": {
"raw": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"status": {
"type": "integer"
},
"size": {
"type": "integer"
},
"geoip" : {
"properties" : {
"country_name": { "type": "keyword" },
"city_name": { "type": "keyword" },
"location" : { "type" : "geo_point" }
}
}
}
},
}
{%- if runtime_fields is defined %},
"runtime": {
{%- set sources = [('source', 'message.source'), ('wildcard', 'message'), ('keyword', 'message.keyword')] %}
{%- for source_type, field in sources %}
Expand Down Expand Up @@ -97,6 +151,6 @@
"type": "keyword",
"script": "emit(params._source.message)"
}
}
}{% endif %}
}
}
55 changes: 0 additions & 55 deletions http_logs/index.json

This file was deleted.

Loading
Loading