Merge branch 'main' into 41938-filestream-do-not-run-duplicated-id

elastic · Dec 30, 2024 · bea2d8e · bea2d8e
2 parents 4c1db16 + 111a480
commit bea2d8e
Show file tree

Hide file tree

Showing 78 changed files with 8,557 additions and 4,273 deletions.
diff --git a/CHANGELOG-developer.next.asciidoc b/CHANGELOG-developer.next.asciidoc
@@ -108,6 +108,7 @@ The list below covers the major changes between 7.0.0-rc2 and main only.
 - AWS CloudWatch Metrics record previous endTime to use for next collection period and change log.logger from cloudwatch to aws.cloudwatch. {pull}40870[40870]
 - Fix flaky test in cel and httpjson inputs of filebeat. {issue}40503[40503] {pull}41358[41358]
 - Fix documentation and implementation of raw message handling in Filebeat http_endpoint by removing it. {pull}41498[41498]
+- Fix flaky test in filebeat Okta entity analytics provider. {issue}42059[42059] {pull}42123[42123]
 
 ==== Added
 

diff --git a/CHANGELOG.next.asciidoc b/CHANGELOG.next.asciidoc
@@ -54,6 +54,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
 - Filebeat fails to start if there is any input with a duplicated ID. It logs the duplicated IDs and the offending inputs configurations. {pull}41731[41731]
 - Filestream inputs with duplicated IDs will fail to start. An error is logged showing the ID and the full input configuration. {issue}41938[41938] {pull}41954[41954]
 - Filestream inputs can define `allow_deprecated_id_duplication: true` to run keep the previous behaviour of running inputs with duplicated IDs. {issue}41938[41938] {pull}41954[41954]
+- The Filestream input only starts to ingest a file when it is >= 1024 bytes in size. This happens because the fingerprint` is the default file identity now. To restore the previous behaviour, set `file_identity.native: ~` and `prospector.scanner.fingerprint.enabled: false` {issue}40197[40197] {pull}41762[41762]
 
 *Heartbeat*
 
@@ -199,6 +200,8 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
 - Further rate limiting fix in the Okta provider of the Entity Analytics input. {issue}40106[40106] {pull}41977[41977]
 - Fix streaming input handling of invalid or empty websocket messages. {pull}42036[42036]
 - Fix awss3 document ID construction when using the CSV decoder. {pull}42019[42019]
+- The `_id` generation process for S3 events has been updated to incorporate the LastModified field. This enhancement ensures that the `_id` is unique. {pull}42078[42078]
+- Fix Netflow Template Sharing configuration handling. {pull}42080[42080]
 
 *Heartbeat*
 
@@ -233,7 +236,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
 - Log Cisco Meraki `getDevicePerformanceScores` errors without stopping metrics collection. {pull}41622[41622]
 - Don't skip first bucket value in GCP metrics metricset for distribution type metrics {pull}41822[41822]
 - Fixed `creation_date` scientific notation output in the `elasticsearch.index` metricset. {pull}42053[42053]
-
+- Fix bug where metricbeat unintentionally triggers Windows ASR. {pull}42177[42177]
 
 *Osquerybeat*
 
@@ -373,8 +376,10 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
 - Add support for SSL and Proxy configurations for websoket type in streaming input. {pull}41934[41934]
 - AWS S3 input registry cleanup for untracked s3 objects. {pull}41694[41694]
 - The environment variable `BEATS_AZURE_EVENTHUB_INPUT_TRACING_ENABLED: true` enables internal logs tracer for the azure-eventhub input. {issue}41931[41931] {pull}41932[41932]
+- The Filestream input now uses the `fingerprint` file identity by default. The state from files are automatically migrated if the previous file identity was `native` (the default) or `path`. If the `file_identity` is explicitly set, there is no change in behaviour. {issue}40197[40197] {pull}41762[41762]
 - Rate limiting operability improvements in the Okta provider of the Entity Analytics input. {issue}40106[40106] {pull}41977[41977]
 - Added default values in the streaming input for websocket retries and put a cap on retry wait time to be lesser than equal to the maximum defined wait time. {pull}42012[42012]
+- Rate limiting fault tolerance improvements in the Okta provider of the Entity Analytics input. {issue}40106[40106] {pull}42094[42094]
 
 *Auditbeat*
 
@@ -388,6 +393,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
 
 - Added status to monitor run log report.
 - Upgrade node to latest LTS v18.20.3. {pull}40038[40038]
+- Add support for RFC7231 methods to http monitors. {pull}41975[41975]
 
 *Metricbeat*
 

diff --git a/NOTICE.txt b/NOTICE.txt
diff --git a/filebeat/_meta/config/filebeat.global.reference.yml.tmpl b/filebeat/_meta/config/filebeat.global.reference.yml.tmpl
@@ -15,6 +15,8 @@
 # batch of events has been published successfully. The default value is 1s.
 #filebeat.registry.flush: 1s
 
+# The interval which to run the registry clean up
+#filebeat.registry.cleanup_interval: 5m
 
 # Starting with Filebeat 7.0, the registry uses a new directory format to store
 # Filebeat state. After you upgrade, Filebeat will automatically migrate a 6.x

diff --git a/filebeat/_meta/config/filebeat.inputs.reference.yml.tmpl b/filebeat/_meta/config/filebeat.inputs.reference.yml.tmpl
@@ -303,7 +303,7 @@ filebeat.inputs:
   # If enabled, instead of relying on the device ID and inode values when comparing files,
   # compare hashes of the given byte ranges in files. A file becomes an ingest target
   # when its size grows larger than offset+length (see below). Until then it's ignored.
-  #prospector.scanner.fingerprint.enabled: false
+  #prospector.scanner.fingerprint.enabled: true
 
   # If fingerprint mode is enabled, sets the offset from the beginning of the file
   # for the byte range used for computing the fingerprint value.
@@ -438,8 +438,9 @@ filebeat.inputs:
   #clean_removed: true
 
   # Method to determine if two files are the same or not. By default
-  # the Beat considers two files the same if their inode and device id are the same.
-  #file_identity.native: ~
+  # a fingerprint is generated using the first 1024 bytes of the file,
+  # if the fingerprints match, then the files are considered equal.
+  #file_identity.fingerprint: ~
 
   # Optional additional fields. These fields can be freely picked
   # to add additional information to the crawled log files for filtering
@@ -770,25 +771,57 @@ filebeat.inputs:
 # Journald input is experimental.
 #- type: journald
   #enabled: true
-  #id: service-foo
 
-  # You may wish to have separate inputs for each service. You can use
-  # include_matches.or to specify a list of filter expressions that are
-  # applied as a logical OR. You may specify filter
-  #include_matches.match:
-    #- _SYSTEMD_UNIT=foo.service
+  # Unique ID among all inputs, if the ID changes, all entries
+  # will be re-ingested
+  id: my-journald-id
 
-  # List of syslog identifiers
-  #syslog_identifiers: ["audit"]
+  # Specify paths to read from custom journal files.
+  # Leave it unset to read the system's journal
+  # Glob based paths.
+  #paths:
+    #- /var/log/custom.journal
+
+  # The position to start reading from the journal, valid options are:
+  #  - head: Starts reading at the beginning of the journal.
+  #  - tail: Starts reading at the end of the journal.
+  #    This means that no events will be sent until a new message is written.
+  #  - since: Use also the `since` option to determine when to start reading from.
+  #seek: head
+
+  # A time offset from the current time to start reading from.
+  # To use since, seek option must be set to since.
+  #since: -24h
 
   # Collect events from the service and messages about the service,
   # including coredumps.
-  #units: ["docker.service"]
+  #units:
+    #- docker.service
+
+  # List of syslog identifiers
+  #syslog_identifiers: ["audit"]
 
   # The list of transports (_TRANSPORT field of journald entries)
   #transports: ["audit"]
 
-  # Parsers are also supported, here is an example of the multiline
+  # Filter logs by facilities, they must be specified using their numeric code.
+  #facilities:
+    #- 1
+    #- 2
+
+  # You may wish to have separate inputs for each service. You can use
+  # include_matches.or to specify a list of filter expressions that are
+  # applied as a logical OR.
+  #include_matches.match:
+    #- _SYSTEMD_UNIT=foo.service
+
+  # Uses the original hostname of the entry instead of the one
+  # from the host running jounrald
+  #save_remote_hostname: false
+
+  # Parsers are also supported, the possible parsers are:
+  # container, include_message, multiline, ndjson, syslog.
+  # Here is an example of the multiline
   # parser.
   #parsers:
   #- multiline:

diff --git a/filebeat/_meta/config/filebeat.inputs.yml.tmpl b/filebeat/_meta/config/filebeat.inputs.yml.tmpl
@@ -41,3 +41,26 @@ filebeat.inputs:
   #fields:
   #  level: debug
   #  review: 1
+
+# journald is an input for collecting logs from Journald
+- type: journald
+
+  # Unique ID among all inputs, if the ID changes, all entries
+  # will be re-ingested
+  id: my-journald-id
+
+  # The position to start reading from the journal, valid options are:
+  #  - head: Starts reading at the beginning of the journal.
+  #  - tail: Starts reading at the end of the journal.
+  #    This means that no events will be sent until a new message is written.
+  #  - since: Use also the `since` option to determine when to start reading from.
+  #seek: head
+
+  # A time offset from the current time to start reading from.
+  # To use since, seek option must be set to since.
+  #since: -24h
+
+  # Collect events from the service and messages about the service,
+  # including coredumps.
+  #units:
+    #- docker.service
diff --git a/filebeat/docs/faq.asciidoc b/filebeat/docs/faq.asciidoc
@@ -19,6 +19,10 @@ We do not recommend reading log files from network volumes. Whenever possible, i
 send the log files directly from there. Reading files from network volumes (especially on Windows) can have unexpected side
 effects. For example, changed file identifiers may result in {beatname_uc} reading a log file from scratch again.
 
+If it is not possible to read from the host, then using the
+<<filebeat-input-filestream-file-identity-fingerprint, `fingerprint`>>
+file identity is the next best option.
+
 [[filebeat-not-collecting-lines]]
 === {beatname_uc} isn't collecting lines from a file
 
@@ -71,6 +75,13 @@ By default states are never removed from the registry file. To resolve the inode
 
 You can use <<{beatname_lc}-input-log-clean-removed,`clean_removed`>> for files that are removed from disk. Be aware that `clean_removed` cleans the file state from the registry whenever a file cannot be found during a scan. If the file shows up again later, it will be sent again from scratch.
 
+Aside from that you should also change the
+<<filebeat-input-filestream-file-identity, `file_identity`>> to
+<<filebeat-input-filestream-file-identity-fingerprint,
+`fingerprint`>>. If you were using `native` (the default) or `path`,
+the state of the files will be automatically migrated to
+`fingerprint`.
+
 include::filebeat-log-rotation.asciidoc[]
 
 [[windows-file-rotation]]

diff --git a/filebeat/docs/inputs/input-filestream-file-options.asciidoc b/filebeat/docs/inputs/input-filestream-file-options.asciidoc
@@ -165,9 +165,9 @@ The default setting is 10s.
 [id="{beatname_lc}-input-{type}-scan-fingerprint"]
 ===== `prospector.scanner.fingerprint`
 
-Instead of relying on the device ID and inode values when comparing files, compare hashes of the given byte ranges of files.
-
-Enable this option if you're experiencing data loss or data duplication due to unstable file identifiers provided by the file system.
+Instead of relying on the device ID and inode values when comparing
+files, compare hashes of the given byte ranges of files. This is the
+default behaviour for {beatname_uc}.
 
 Following are some scenarios where this can happen:
 
@@ -557,34 +557,71 @@ indirectly set higher priorities on certain inputs by assigning a higher
 limit of harvesters.
 
 [float]
+[id="{beatname_lc}-input-{type}-file-identity"]
 ===== `file_identity`
 
 Different `file_identity` methods can be configured to suit the
 environment where you are collecting log messages.
 
-WARNING: Changing `file_identity` methods between runs may result in
-duplicated events in the output.
+IMPORTANT: Changing `file_identity` is only supported from `native` or
+`path` to `fingerprint`. On those cases {beatname_uc} will
+automatically migrate the state of the file when {type} starts.
+
+WARNING: Any unsupported change in `file_identity` methods between
+runs may result in duplicated events in the output.
+
+[id="{beatname_lc}-input-{type}-file-identity-fingerprint"]
+*`fingerprint`*:: The default behaviour of {beatname_uc} is to
+identify files based on content by hashing a specific range (0 to 1024
+bytes by default).
+
+WARNING: In order to use this file identity option, you must enable
+the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint
+option in the scanner>>. Once this file identity is enabled, changing
+the fingerprint configuration (offset, length, or other settings) will
+lead to a global re-ingestion of all files that match the paths
+configuration of the input.
+
+Please refer to the
+<<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint
+configuration for details>>.
+
+[source,yaml]
+----
+file_identity.fingerprint: ~
+----
 
-*`native`*:: The default behaviour of {beatname_uc} is to differentiate
-between files using their inodes and device ids.
+*`native`*:: Differentiates between files using their inodes and
+device ids.
 +
 In some cases these values can change during the lifetime of a file. 
-For example, when using the Linux link:https://en.wikipedia.org/wiki/Logical_Volume_Manager_%28Linux%29[LVM] (Logical Volume Manager), device numbers are allocated dynamically at module load (refer to link:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/logical_volume_manager_administration/lv#persistent_numbers[Persistent Device Numbers] in the Red Hat Enterprise Linux documentation). To avoid the possibility of data duplication in this case, you can set `file_identity` to `path` rather than `native`.
+For example, when using the Linux
+link:https://en.wikipedia.org/wiki/Logical_Volume_Manager_%28Linux%29[LVM]
+(Logical Volume Manager), device numbers are allocated dynamically at
+module load (refer to
+link:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/logical_volume_manager_administration/lv#persistent_numbers[Persistent
+Device Numbers] in the Red Hat Enterprise Linux documentation). To
+avoid the possibility of data duplication in this case, you can set
+`file_identity` to `fingerprint` rather than the default `native`.
++
+The states of files generated by `native` file identity can be migrated to `fingerprint`.
 
 [source,yaml]
 ----
 file_identity.native: ~
 ----
 
 *`path`*:: To identify files based on their paths use this strategy.
-
++
 WARNING: Only use this strategy if your log files are rotated to a folder
 outside of the scope of your input or not at all. Otherwise you end up
 with duplicated events.
-
++
 WARNING: This strategy does not support renaming files.
 If an input file is renamed, {beatname_uc} will read it again if the new path
 matches the settings of the input.
++
+The states of files generated by `path` file identity can be migrated to `fingerprint`.
 
 [source,yaml]
 ----
@@ -593,25 +630,14 @@ file_identity.path: ~
 
 *`inode_marker`*:: If the device id changes from time to time, you must use
 this method to distinguish files. This option is not supported on Windows.
-
++
 Set the location of the marker file the following way:
 
 [source,yaml]
 ----
 file_identity.inode_marker.path: /logs/.filebeat-marker
 ----
 
-*`fingerprint`*:: To identify files based on their content byte range.
-
-WARNING: In order to use this file identity option, you must enable the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint option in the scanner>>. Once this file identity is enabled, changing the fingerprint configuration (offset, length, or other settings) will lead to a global re-ingestion of all files that match the paths configuration of the input.
-
-Please refer to the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint configuration for details>>.
-
-[source,yaml]
-----
-file_identity.fingerprint: ~
-----
-
 [[filestream-log-rotation-support]]
 [float]
 === Log rotation
@@ -624,6 +650,7 @@ When reading from rotating files make sure the paths configuration includes
 both the active file and all rotated files.
 
 By default, {beatname_uc} is able to track files correctly in the following strategies:
+
 * create: new active file with a unique name is created on rotation
 * rename: rotated files are renamed
 

diff --git a/filebeat/docs/inputs/input-filestream.asciidoc b/filebeat/docs/inputs/input-filestream.asciidoc
@@ -34,6 +34,11 @@ The `log` writes the complete file state.
 
 7. Stale entries can be removed from the registry, even if there is no active input.
 
+8. The default behaviour is to identify files based on their contents
+using the <<filebeat-input-filestream-file-identity-fingerprint,
+`fingerprint`>> <<filebeat-input-filestream-file-identity,
+`file_identity`>> This solves data duplication caused by inode reuse.
+
 To configure this input, specify a list of glob-based <<filestream-input-paths,`paths`>>
 that must be crawled to locate and fetch the log lines.
 
@@ -86,20 +91,32 @@ multiple input sections:
 [[filestream-file-identity]]
 ==== Reading files on network shares and cloud providers
 
-WARNING: Filebeat does not support reading from network shares and cloud providers.
+WARNING: Some file identity methods do not support reading from
+network shares and cloud providers, to avoid duplicating events, use
+the default `file_identity`: `fingerprint`.
+
+IMPORTANT: Changing `file_identity` is only supported when
+migrating from `native` or `path` to `fingerprint`.
+
+WARNING: Any unsupported change in `file_identity` methods between
+runs may result in duplicated events in the output.
 
-However, one of the limitations of these data sources can be mitigated
-if you configure Filebeat adequately.
+`fingerprint` is the default and recommended file identity because it does not
+rely on the file system/OS, it generates a hash from a portion of the
+file (the first 1024 bytes, by default) and uses that to identify the
+file. This works well with log rotation strategies that move/rename
+the file and on Windows as file identifiers might be more
+volatile. The downside is that {beatname_uc} will wait until the file
+reaches 1024 bytes before start ingesting any file.
 
-By default, {beatname_uc} identifies files based on their inodes and
-device IDs. However, on network shares and cloud providers these
-values might change during the lifetime of the file. If this happens
-{beatname_uc} thinks that file is new and resends the whole content
-of the file. To solve this problem you can configure the `file_identity` option. Possible
-values besides the default `inode_deviceid` are `path`, `inode_marker` and `fingerprint`.
+WARNING: Once this file identity is enabled, changing
+the fingerprint configuration (offset, length, etc) will lead to a
+global re-ingestion of all files that match the paths configuration of
+the input.
 
-WARNING: Changing `file_identity` methods between runs may result in
-duplicated events in the output.
+Please refer to the
+<<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint
+configuration for details>>.
 
 Selecting `path` instructs {beatname_uc} to identify files based on their
 paths. This is a quick way to avoid rereading files if inode and device ids
@@ -117,13 +134,6 @@ example oneliner generates a hidden marker file for the selected mountpoint `/lo
 Please note that you should not use this option on Windows as file identifiers might be
 more volatile.
 
-Selecting `fingerprint` instructs {beatname_uc} to identify files based on their
-content byte range.
-
-WARNING: In order to use this file identity option, one must enable the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint option in the scanner>>. Once this file identity is enabled, changing the fingerprint configuration (offset, length, etc) will lead to a global re-ingestion of all files that match the paths configuration of the input.
-
-Please refer to the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint configuration for details>>.
-
 ["source","sh",subs="attributes"]
 ----
 $ lsblk -o MOUNTPOINT,UUID | grep /logs | awk '{print $2}' >> /logs/.filebeat-marker