Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simdutf: simdutf_connector: in_tail: Implement UTF-16LE/UTF-16BE encoder #9468

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Oct 7, 2024

In Windows, there are lots of using UTF-16LE programs. This is because Unicode on Windows means UTF-16LE with BOM(Byte Order Mark).
In addition, there is lots of differences between UTF-16LE/UTF-16BE and UTF-8.
I added some of C, J and subdivision flags test cases for converting from UTF-16LE/UTF-16BE to UTF-8 in unit tests for in_tail plugin. This is because in_tail is the main usages to process non-UTF-8 encodings.
At first, we need to process UTF-16LE and UTF-16BE encodings.

Note that simdutf library is written in C++. So, we also provide an option (FLB_UNICODE_ENCODER) to turn on/off this feature.

Closes #9321


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
[SERVICE]
   flush           1
   log_level       trace

[INPUT]
   Name              tail
   Path              <path/to/non-UTF-8_encoded_file.log>
   Read_from_Head    True
   Unicode.Encoding  auto

[OUTPUT]
   Name  stdout
   Match *
  • Debug log output from testing the change
Fluent Bit v4.0.0
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____ 
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/ 


[2025/01/15 10:27:22] [ info] Configuration:
[2025/01/15 10:27:22] [ info]  flush time     | 1.000000 seconds
[2025/01/15 10:27:22] [ info]  grace          | 5 seconds
[2025/01/15 10:27:22] [ info]  daemon         | 0
[2025/01/15 10:27:22] [ info] ___________
[2025/01/15 10:27:22] [ info]  inputs:
[2025/01/15 10:27:22] [ info]      tail
[2025/01/15 10:27:22] [ info] ___________
[2025/01/15 10:27:22] [ info]  filters:
[2025/01/15 10:27:22] [ info] ___________
[2025/01/15 10:27:22] [ info]  outputs:
[2025/01/15 10:27:22] [ info]      stdout.0
[2025/01/15 10:27:22] [ info] ___________
[2025/01/15 10:27:22] [ info]  collectors:
[2025/01/15 10:27:22] [ info] [fluent bit] version=4.0.0, commit=6d00ba1fde, pid=1537587
[2025/01/15 10:27:22] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2025/01/15 10:27:22] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/01/15 10:27:22] [ info] [simd    ] SSE2
[2025/01/15 10:27:22] [ info] [cmetrics] version=0.9.9
[2025/01/15 10:27:22] [ info] [ctraces ] version=0.5.7
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] initializing
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2025/01/15 10:27:22] [debug] [tail:tail.0] created event channels: read=25 write=26
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] adjusted buf_max_size to 4001
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] adjusted buf_chunk_size to 4001
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inotify watch fd=31
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170643 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log, inode 43170643
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log'
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170624 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log, inode 43170624
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log'
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170625 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log, inode 43170625
[2025/01/15 10:27:22] [ info] [output:stdout:stdout.0] worker #0 started
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log'
[2025/01/15 10:27:22] [debug] [stdout:stdout.0] created event channels: read=35 write=36
[2025/01/15 10:27:22] [ info] [sp] stream processor started
[2025/01/15 10:27:22] [trace] [input chunk] update output instances with new chunk size diff=123, records=1, input=tail.0
[2025/01/15 10:27:22] [trace] [input chunk] update output instances with new chunk size diff=109, records=1, input=tail.0
[2025/01/15 10:27:22] [trace] [input chunk] update output instances with new chunk size diff=196, records=1, input=tail.0
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] [static files] processed 290b
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170643 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log promote to TAIL_EVENT
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170643 watch_fd=1 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170624 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log promote to TAIL_EVENT
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170624 watch_fd=2 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170625 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log promote to TAIL_EVENT
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170625 watch_fd=3 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2025/01/15 10:27:22] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:22] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [trace] [task 0x617ac40] created (id=0)
[2025/01/15 10:27:23] [debug] [task] created task=0x617ac40 id=0 OK
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] sqlerrorlog: [[1736904442.640693144, {}], {"log"=>"🏴󠁧󠁢󠁥󠁮󠁧󠁿🏴󠁧󠁢󠁳󠁣󠁴󠁿🏴󠁧󠁢󠁷󠁬󠁳󠁿"}]
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[1] sqlerrorlog: [[1736904442.666284429, {}], {"log"=>"用汉字在 Fluent Bit 中处理日志,就像是一个梦一样😀"}]
[2] sqlerrorlog: [[1736904442.668104080, {}], {"log"=>"にほんごテストログふぁいる。文字エンコーディングをUnicodeにできる!?☕😀⚪⚫🔴🔵🟠🟡🟢🟣🟤🇺🇸🇯🇵"}]
[2025/01/15 10:27:23] [debug] [out flush] cb_destroy coro_id=0
[2025/01/15 10:27:23] [trace] [coro] destroy coroutine=0x6180fa0 data=0x6180fc0
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [trace] [engine] [task event] task_id=0 out_id=0 return=OK
[2025/01/15 10:27:23] [debug] [task] destroy task=0x617ac40 (task_id=0)
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:24] [trace] [sched] 0 timer coroutines destroyed
^C[2025/01/15 10:27:24] [engine] caught signal (SIGINT)
[2025/01/15 10:27:24] [trace] [engine] flush enqueued data
[2025/01/15 10:27:24] [ warn] [engine] service will shutdown in max 5 seconds
[2025/01/15 10:27:24] [ info] [input] pausing tail.0
[2025/01/15 10:27:24] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:24] [ info] [engine] service has stopped (0 pending tasks)
[2025/01/15 10:27:24] [ info] [input] pausing tail.0
[2025/01/15 10:27:24] [debug] [input:tail:tail.0] inode=43170643 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2025/01/15 10:27:24] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2025/01/15 10:27:24] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43170643 watch_fd=1
[2025/01/15 10:27:24] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:24] [debug] [input:tail:tail.0] inode=43170624 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2025/01/15 10:27:24] [ info] [output:stdout:stdout.0] thread worker #0 stopped
[2025/01/15 10:27:24] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43170624 watch_fd=2
[2025/01/15 10:27:24] [debug] [input:tail:tail.0] inode=43170625 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2025/01/15 10:27:24] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43170625 watch_fd=3
  • Attached Valgrind output that shows no leaks or memory corruption was found
==1537587== 
==1537587== HEAP SUMMARY:
==1537587==     in use at exit: 0 bytes in 0 blocks
==1537587==   total heap usage: 3,465 allocs, 3,465 frees, 1,062,937 bytes allocated
==1537587== 
==1537587== All heap blocks were freed -- no leaks are possible
==1537587== 
==1537587== For lists of detected and suppressed errors, rerun with: -s
==1537587== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

fluent/fluent-bit-docs#1471

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-try-to-bundle-simdutf-amalgamation branch from d1b404a to 4053bbd Compare October 7, 2024 07:13
@cosmo0920 cosmo0920 force-pushed the cosmo0920-try-to-bundle-simdutf-amalgamation branch from 4053bbd to 2a515ea Compare October 7, 2024 07:17
Signed-off-by: Hiroshi Hatake <[email protected]>
Signed-off-by: Hiroshi Hatake <[email protected]>
…s not fully support C++11

Signed-off-by: Hiroshi Hatake <[email protected]>
Plus, waiting for relatively longer for the ordinary test cases.
This is because these test cases for unicode need to read contents from
filesystem.

Signed-off-by: Hiroshi Hatake <[email protected]>
@cosmo0920
Copy link
Contributor Author

It's seem's that PR is freezed ?

It's just postponed to be merged.

@tguenneguez
Copy link

Do you have any visibility on the agent version that will integrate this evolution ?

@patrick-stephens
Copy link
Contributor

Do you have any visibility on the agent version that will integrate this evolution ?

master is currently targeting 4.0 release: https://github.com/fluent/fluent-bit/wiki/Fluent-Bit-Roadmap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-required ok-package-test Run PR packaging tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for reading files encoded in UTF-16 for Tail Input
6 participants