From ebec1a2fe2bc2b9fc40401074dbbb0dbcdc800bd Mon Sep 17 00:00:00 2001 From: Salvatore Campagna <93581129+salvatore-campagna@users.noreply.github.com> Date: Thu, 24 Oct 2024 18:25:38 +0200 Subject: [PATCH] Improve Logsdb docs including default values (#115205) This PR adds detailed documentation for `logsdb` mode, covering several key aspects of its default behavior and configuration options. It includes: - default settings for index sorting (`index.sort.field`, `index.sort.order`, etc.). - usage of synthetic `_source` by default. - information about specialized codecs and how users can override them. - default behavior for `ignore_malformed` and `ignore_above` settings, including precedence rules. - explanation of how fields without `doc_values` are handled and what we do if they are missing. --- docs/reference/data-streams/logs.asciidoc | 180 +++++++++++++++++++++- 1 file changed, 172 insertions(+), 8 deletions(-) diff --git a/docs/reference/data-streams/logs.asciidoc b/docs/reference/data-streams/logs.asciidoc index e870289bcf7be..6bb98684544a3 100644 --- a/docs/reference/data-streams/logs.asciidoc +++ b/docs/reference/data-streams/logs.asciidoc @@ -8,14 +8,6 @@ A logs data stream is a data stream type that stores log data more efficiently. In benchmarks, log data stored in a logs data stream used ~2.5 times less disk space than a regular data stream. The exact impact will vary depending on your data set. -The following features are enabled in a logs data stream: - -* <>, which omits storing the `_source` field. When the document source is requested, it is synthesized from document fields upon retrieval. - -* Index sorting. This yields a lower storage footprint. By default indices are sorted by `host.name` and `@timestamp` fields at index time. - -* More space efficient compression for fields with <> enabled. - [discrete] [[how-to-use-logsds]] === Create a logs data stream @@ -50,3 +42,175 @@ DELETE _index_template/my-index-template ---- // TEST[continued] //// + +[[logsdb-default-settings]] + +[discrete] +[[logsdb-synthtic-source]] +=== Synthetic source + +By default, `logsdb` mode uses <>, which omits storing the original `_source` +field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few +restrictions which you can read more about in the <> section dedicated to it. + +NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values +are preserved for <> reconstruction. In `logsdb`, the default value is `arrays`, +which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to +array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the +case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events. + +For more details on this setting and ways to refine or bypass it, check out <>. + +[discrete] +[[logsdb-sort-settings]] +=== Index sort settings + +The following settings are applied by default when using the `logsdb` mode for index sorting: + +* `index.sort.field`: `["host.name", "@timestamp"]` + In `logsdb` mode, indices are sorted by `host.name` and `@timestamp` fields by default. For data streams, the + `@timestamp` field is automatically injected if it is not present. + +* `index.sort.order`: `["desc", "desc"]` + The default sort order for both fields is descending (`desc`), prioritizing the latest data. + +* `index.sort.mode`: `["min", "min"]` + The default sort mode is `min`, ensuring that indices are sorted by the minimum value of multi-value fields. + +* `index.sort.missing`: `["_first", "_first"]` + Missing values are sorted to appear first (`_first`) in `logsdb` index mode. + +`logsdb` index mode allows users to override the default sort settings. For instance, users can specify their own fields +and order for sorting by modifying the `index.sort.field` and `index.sort.order`. + +When using default sort settings, the `host.name` field is automatically injected into the mappings of the +index as a `keyword` field to ensure that sorting can be applied. This guarantees that logs are efficiently sorted and +retrieved based on the `host.name` and `@timestamp` fields. + +NOTE: If `subobjects` is set to `true` (which is the default), the `host.name` field will be mapped as an object field +named `host`, containing a `name` child field of type `keyword`. On the other hand, if `subobjects` is set to `false`, +a single `host.name` field will be mapped as a `keyword` field. + +Once an index is created, the sort settings are immutable and cannot be modified. To apply different sort settings, +a new index must be created with the desired configuration. For data streams, this can be achieved by means of an index +rollover after updating relevant (component) templates. + +If the default sort settings are not suitable for your use case, consider modifying them. Keep in mind that sort +settings can influence indexing throughput, query latency, and may affect compression efficiency due to the way data +is organized after sorting. For more details, refer to our documentation on +<>. + +NOTE: For <>, the `@timestamp` field is automatically injected if not already present. +However, if custom sort settings are applied, the `@timestamp` field is injected into the mappings, but it is not +automatically added to the list of sort fields. + +[discrete] +[[logsdb-specialized-codecs]] +=== Specialized codecs + +`logsdb` index mode uses the `best_compression` <> by default, which applies {wikipedia}/Zstd[ZSTD] +compression to stored fields. Users are allowed to override it and switch to the `default` codec for faster compression +at the expense of slightly larger storage footprint. + +`logsdb` index mode also adopts specialized codecs for numeric doc values that are crafted to optimize storage usage. +Users can rely on these specialized codecs being applied by default when using `logsdb` index mode. + +Doc values encoding for numeric fields in `logsdb` follows a static sequence of codecs, applying each one in the +following order: delta encoding, offset encoding, Greatest Common Divisor GCD encoding, and finally Frame Of Reference +(FOR) encoding. The decision to apply each encoding is based on heuristics determined by the data distribution. +For example, before applying delta encoding, the algorithm checks if the data is monotonically non-decreasing or +non-increasing. If the data fits this pattern, delta encoding is applied; otherwise, the next encoding is considered. + +The encoding is specific to each Lucene segment and is also re-applied at segment merging time. The merged Lucene segment +may use a different encoding compared to the original Lucene segments, based on the characteristics of the merged data. + +The following methods are applied sequentially: + +* **Delta encoding**: + a compression method that stores the difference between consecutive values instead of the actual values. + +* **Offset encoding**: + a compression method that stores the difference from a base value rather than between consecutive values. + +* **Greatest Common Divisor (GCD) encoding**: + a compression method that finds the greatest common divisor of a set of values and stores the differences + as multiples of the GCD. + +* **Frame Of Reference (FOR) encoding**: + a compression method that determines the smallest number of bits required to encode a block of values and uses + bit-packing to fit such values into larger 64-bit blocks. + +For keyword fields, **Run Length Encoding (RLE)** is applied to the ordinals, which represent positions in the Lucene +segment-level keyword dictionary. This compression is used when multiple consecutive documents share the same keyword. + +[discrete] +[[logsdb-ignored-settings]] +=== `ignore_malformed`, `ignore_above`, `ignore_dynamic_beyond_limit` + +By default, `logsdb` index mode sets `ignore_malformed` to `true`. This setting allows documents with malformed fields +to be indexed without causing indexing failures, ensuring that log data ingestion continues smoothly even when some +fields contain invalid or improperly formatted data. + +Users can override this setting by setting `index.mapping.ignore_malformed` to `false`. However, this is not recommended +as it might result in documents with malformed fields being rejected and not indexed at all. + +In `logsdb` index mode, the `index.mapping.ignore_above` setting is applied by default at the index level to ensure +efficient storage and indexing of large keyword fields.The index-level default for `ignore_above` is set to 8191 +**characters**. If using UTF-8 encoding, this results in a limit of 32764 bytes, depending on character encoding. +The mapping-level `ignore_above` setting still takes precedence. If a specific field has an `ignore_above` value +defined in its mapping, that value will override the index-level `index.mapping.ignore_above` value. This default +behavior helps to optimize indexing performance by preventing excessively large string values from being indexed, while +still allowing users to customize the limit, overriding it at the mapping level or changing the index level default +setting. + +In `logsdb` index mode, the setting `index.mapping.total_fields.ignore_dynamic_beyond_limit` is set to `true` by +default. This allows dynamically mapped fields to be added on top of statically defined fields without causing document +rejection, even after the total number of fields exceeds the limit defined by `index.mapping.total_fields.limit`. The +`index.mapping.total_fields.limit` setting specifies the maximum number of fields an index can have (static, dynamic +and runtime). When the limit is reached, new dynamically mapped fields will be ignored instead of failing the document +indexing, ensuring continued log ingestion without errors. + +NOTE: When automatically injected, `host.name` and `@timestamp` contribute to the limit of mapped fields. When +`host.name` is mapped with `subobjects: true` it consists of two fields. When `host.name` is mapped with +`subobjects: false` it only consists of one field. + +[discrete] +[[logsdb-nodocvalue-fields]] +=== Fields without doc values + +When `logsdb` index mode uses synthetic `_source`, and `doc_values` are disabled for a field in the mapping, +Elasticsearch may set the `store` setting to `true` for that field as a last resort option to ensure that the field's +data is still available for reconstructing the document’s source when retrieving it via +<>. + +For example, this happens with text fields when `store` is `false` and there is no suitable multi-field available to +reconstruct the original value in <>. + +This automatic adjustment allows synthetic source to work correctly, even when doc values are not enabled for certain +fields. + +[discrete] +[[logsdb-settings-summary]] +=== LogsDB settings summary + +The following is a summary of key settings that apply when using `logsdb` index mode in Elasticsearch: + +* **`index.mode`**: `"logsdb"` + +* **`index.mapping.synthetic_source_keep`**: `"arrays"` + +* **`index.sort.field`**: `["host.name", "@timestamp"]` + +* **`index.sort.order`**: `["desc", "desc"]` + +* **`index.sort.mode`**: `["min", "min"]` + +* **`index.sort.missing`**: `["_first", "_first"]` + +* **`index.codec`**: `"best_compression"` + +* **`index.mapping.ignore_malformed`**: `true` + +* **`index.mapping.ignore_above`**: `8191` + +* **`index.mapping.total_fields.ignore_dynamic_beyond_limit`**: `true`