Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter nest/lift is broken on Windows in 2.1.10 and 2.2.0 =>logs corrupted #8137

Closed
janlonsetteig opened this issue Nov 6, 2023 · 32 comments

Comments

@janlonsetteig
Copy link

Bug Report

Describe the bug
After upgrading from Fluent Bit 2.1.9 to 2.1.10 the nest/lift filter is no longer working and corrupts the logs

My case are Windows pods in k8s

Fluent Bit reads the log files and adds Kubernetes information. Then it nest/lift the Kubernetes fields

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc.cluster.local:443
        DNS_Retries         10
        DNS_Wait_Time       30
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
        Labels              On
        Annotations         Off
    [FILTER]
        Name                nest
        Match               kube.*
        Operation           lift
        Nested_under        kubernetes
        Add_prefix          _kubernetes.

The result in cr.fluentbit.io/fluent/fluent-bit:windows-2019-2.1.10 would look something like this:

{"time"=>"2023-11-06T14:04:33.8847011Z", "stream"=>"stdout", "_p"=>"F", "RequestHttpVerb"=>"GET", "RequestUri"=>"http://art.staging.firmglobal.net/identity/.well-known/openid-configuration", "ResponseStatusCode"=>"200", "TimeTakenMicros"=>"12622", "Service"=>"Confirmit.Identity", "ServiceVersion"=>"2.24.1753", "CorrelationId"=>"23015ddcfebe4f7e9fc2d180318e0adb", "InitiatingService"=>"Confirmit.Identity", "EventTime"=>"2023-11-06T14:04:33.863+00", "Severity"=>"INFO", "Logger"=>"Confirmit.Logging.Web.AccessLogger", "RequestHeaders"=>"Host={art.staging.firmglobal.net}", "_kubernetes.pod_name�identity3-api-7c5c56447f-hb6rv�|_kubernetes.namespace_n"=>97, 109=>101, "horizons"=>"_kubernetes.pod_id�$09cc3c99-a4ac-4f54-a884-02bf196b5069�<_k", 117=>98, 101=>114, 110=>101, 116=>101, 115=>46, 108=>97}

As you can see there are some weird characters in the log after doing nest/lift. And also some weird 117=>98, 101=>114, 110=>101, 116=>101, 115=>46, 108=>97.

All this works fine when I revert the container image to cr.fluentbit.io/fluent/fluent-bit:windows-2019-2.1.9

This is how the output looks in 2.1.9

{"time"=>"2023-11-06T14:34:58.7061335Z", "stream"=>"stdout", "_p"=>"F", "RequestHttpVerb"=>"GET", "RequestUri"=>"http://10.241.74.235/identity/.well-known/openid-configuration", "ResponseStatusCode"=>"200", "TimeTakenMicros"=>"34615", "Service"=>"Confirmit.Identity", "ServiceVersion"=>"2.24.1753", "CorrelationId"=>"b3f7e7854ae041ce9bb3baaeeeade543", "InitiatingService"=>"Confirmit.Identity", "EventTime"=>"2023-11-06T14:34:58.684+00", "Severity"=>"INFO", "Logger"=>"Confirmit.Logging.Web.AccessLogger", "RequestHeaders"=>"Connection={close}, Accept={*/*}, Host={10.241.74.235:80}", "_kubernetes.pod_name"=>"identity3-api-7c5c56447f-jkhhf", "_kubernetes.namespace_name"=>"horizons", "_kubernetes.pod_id"=>"b56eef39-ac81-4efc-959f-bea7dd3b0193", "_kubernetes.labels"=>{"app"=>"identity3-api", "app.confirmit.com/template"=>"iis-app", "app.confirmit.com/template-version"=>"11.2.0", "app.kubernetes.io/instance"=>"horizons", "app.kubernetes.io/managed-by"=>"Helm", "app.kubernetes.io/name"=>"identity3-api", "app.kubernetes.io/version"=>"2.24.1753", "helm.sh/chart"=>"identity3-api-2.24.1753", "pod-template-hash"=>"7c5c56447f", "version"=>"2.24.1753"}, "_kubernetes.host"=>"akswinu7f0000t2", "_kubernetes.container_name"=>"identity3-api", "_kubernetes.docker_id"=>"0a4d2404db4070675163eac1bf4cc7062d82ffd54c71fadbc14045436c5cee24", "_kubernetes.container_hash"=>"confirmithorizonsdev.azurecr.io/confirmit/identity3-api@sha256:0b6349dab35536371cc88385bacc33cfb1138ef9ddc99729bc6efb2a71870f9f", "_kubernetes.container_image"=>"confirmithorizonsdev.azurecr.io/confirmit/identity3-api:2.24.1753"}

For Linux containers nest/lift works as before after the upgrade. But I need it working for for Windows pods also

Let me know if you need more information to reproduce the issue

@janlonsetteig
Copy link
Author

anyone? :-)

@janlonsetteig janlonsetteig changed the title Filter nest/lift is broken on Windows in 2.1.10 Filter nest/lift is broken on Windows in 2.1.10 =>logs corrupted Nov 20, 2023
@janlonsetteig janlonsetteig changed the title Filter nest/lift is broken on Windows in 2.1.10 =>logs corrupted Filter nest/lift is broken on Windows in 2.1.10 and 2.2.0 =>logs corrupted Nov 20, 2023
@janlonsetteig
Copy link
Author

Confirmed also broken on version 2.2.0. Last working version is 2.1.9

@janlonsetteig
Copy link
Author

@cosmo0920 @nokute78
Hi. Just trying to tag some of you that contributed to the 2.1.10 release. Could you please take a look at this issue or tag someone that could know more about it. At the moment we are stuck with version 2.1.9 on the Windows nodes in our Kubernetes clusters. Both version 2.10 and 2.2.0 have this issue.

Let me know if I can provider more information to help out

@nokute78
Copy link
Collaborator

@cosmo0920 I don't have build environment on windows, but I tested it.
v2.1.10 also didn't work using following configuration.

[SERVICE]
    Log_Level debug

[INPUT]
    Name dummy
    Dummy {"map_data":{"key":"value", "id":123}}
    samples 1

[FILTER]
    Name nest
    Match *
    Operation lift
    Nested_under map_data
    Add_prefix _map_data.

[OUTPUT]
    name stdout
    match *

It will fail cause of decoder error and new chunk size is different.
chunk io was updated from v2.1.10. Is it a cause of this issue ?

v2.1.10:

/11/23 09:40:09] [debug] [filter:nest:nest.0] Lift : Outer map size is 1, will be 2, lifting 1 record(s)
[2023/11/23 09:40:09] [debug] [input chunk] update output instances with new chunk size diff=61, records=1, input=dummy.0
[2023/11/23 09:40:10] [debug] [task] created task=0000027358FB7D50 id=0 OK
[2023/11/23 09:40:10] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/11/23 09:40:10] [error] [output:stdout:stdout.0] Log event decoder error : -12

v2.1.9

[2023/11/23 09:38:41] [debug] [filter:nest:nest.0] Lift : Outer map size is 1, will be 2, lifting 1 record(s)
[2023/11/23 09:38:41] [debug] [input chunk] update output instances with new chunk size diff=60, records=1, input=dummy.0
[2023/11/23 09:38:42] [debug] [task] created task=00000140C6B53EE0 id=0 OK
[0] dummy.0: [[1700699921.774481500, {}], {"_map_data.key"=>"value", "_map_data.id"=>123}]

@janlonsetteig
Copy link
Author

@nokute78 Thanks for looking into this and providing a repro

Any idea on how we can get it fixed?

@anasalsalah
Copy link

+1 facing this issue as well. Could you please fix? Thank you!

@jfaale
Copy link

jfaale commented Dec 8, 2023

+1 Thank you!

@maksnester
Copy link

Could someone pick this up, please? 🙏

@nokute78
Copy link
Collaborator

nokute78 commented Dec 9, 2023

Note: Removing Add_prefix is working on v2.1.10.
Hmm.

[SERVICE]
    Log_Level debug

[INPUT]
    Name dummy
    Dummy {"map_data":{"key":"value", "id":123}}
    samples 1

[FILTER]
    Name nest
    Match *
    Operation lift
    Nested_under map_data
#    Add_prefix _map_data.

[OUTPUT]
    name stdout
    match *

@nokute78
Copy link
Collaborator

nokute78 commented Dec 10, 2023

I tested out_file to get raw msgpack output.
v2.1.10 is wrong str8 for "_map_data.key" that length is 13.

Version Raw Note
v2.1.9 ad 5f 6d 61 70 5f 64 61 74 61 2e 6b 65 79 ad means fixstr and length is 13(0xd)
v2.1.10 d9 22 5f 6d 61 70 5f 64 61 74 61 2e 6b d9 22 means str8 and length is 34 (0x22)

Output in hex is
v2.1.9:

00000000  dd 00 00 00 02 dd 00 00  00 02 d7 00 00 00 00 02  |................|
00000010  00 61 b0 98 80 df 00 00  00 02 ad 5f 6d 61 70 5f  |.a........._map_|
00000020  64 61 74 61 2e 6b 65 79  a5 76 61 6c 75 65 ac 5f  |data.key.value._|
00000030  6d 61 70 5f 64 61 74 61  2e 69 64 7b              |map_data.id{|
0000003c

v2.1.10:

00000000  dd 00 00 00 02 dd 00 00  00 02 d7 00 00 00 00 02  |................|
00000010  00 41 32 a0 80 df 00 00  00 02 d9 22 5f 6d 61 70  |.A2........"_map|
00000020  5f 64 61 74 61 2e 6b 65  79 a5 76 61 6c 75 65 ba  |_data.key.value.|
00000030  5f 6d 61 70 5f 64 61 74  61 2e 69 64 7b           |_map_data.id{|
0000003d
[SERVICE]
    Log_Level debug

[INPUT]
    Name dummy
    Dummy {"map_data":{"key":"value", "id":123}}
    samples 1
    Start_time_sec 1

[FILTER]
    Name nest
    Match *
    Operation lift
    Nested_under map_data
    Add_prefix _map_data.

[OUTPUT]
    Name file
    Format msgpack

@nokute78
Copy link
Collaborator

@leonardo-albertovich Could you check this issue ?
This issue was happended from v2.1.10 only on Windows.

We modified log_event_encoder issue for Windows. I think it relates this issue.
#7971
https://fluentbit.io/announcements/v2.1.10/

The issue is that Add_prefix of filter_nest doesn't work.
helper_pack_string_add_prefix may not work that using flb_log_event_encoder.
https://github.com/fluent/fluent-bit/blob/v2.1.10/plugins/filter_nest/nest.c#L196

static void helper_pack_string_add_prefix(struct flb_log_event_encoder *log_encoder,
        struct filter_nest_ctx *ctx,
        const char *str,
        int len)
{
    flb_log_event_encoder_append_body_values(
        log_encoder,
        FLB_LOG_EVENT_STRING_LENGTH_VALUE(ctx->prefix_len + len),
        FLB_LOG_EVENT_STRING_BODY_VALUE(ctx->prefix, ctx->prefix_len),
        FLB_LOG_EVENT_STRING_BODY_VALUE(str, len));
}

The flb_log_event_encoder patch is to fix following flb_log_event_encoder_append_body_values and string issue.
It is similar to helper_pack_string_add_prefix.
#7736 (comment)

@edsiper
Copy link
Member

edsiper commented Dec 11, 2023

assigned to @leonardo-albertovich

@ThomasRikardsen
Copy link

+1

@TSead
Copy link

TSead commented Dec 18, 2023

+1 Thank you

@janlonsetteig
Copy link
Author

@leonardo-albertovich @edsiper
Are there any ETA on a fix for this issue? We are still blocked from upgrading fluent-bit on our Windows nodes

@janlonsetteig
Copy link
Author

Anyone that can pick up this one?

@nokute78
Copy link
Collaborator

nokute78 commented Feb 4, 2024

@leonardo-albertovich I think this issue is caused by a type of flb_log_event_encoder_size_t.
It is a size_t* on Windows x64.

Following function adds flb_log_event_encoder_size_t. ctx->prefix_len + len

static void helper_pack_string_add_prefix(struct flb_log_event_encoder *log_encoder,
        struct filter_nest_ctx *ctx,
        const char *str,
        int len)
{
    flb_log_event_encoder_append_body_values(
        log_encoder,
        FLB_LOG_EVENT_STRING_LENGTH_VALUE(ctx->prefix_len + len),
        FLB_LOG_EVENT_STRING_BODY_VALUE(ctx->prefix, ctx->prefix_len),
        FLB_LOG_EVENT_STRING_BODY_VALUE(str, len));
}

ctx->prefix_len is 10 and len is 3 when fluent-bit uses following configuration.
#8137 (comment)
However ctx->prefix_len + len is 34.

This is caused by 10 + sizeof(size_t) * 3 since flb_log_event_encoder_size_t is a pointer of size_t.
sizeof(size_t) is 8 on Windows x64.

@nokute78
Copy link
Collaborator

nokute78 commented Feb 4, 2024

@leonardo-albertovich
Following code is a test code for this issue.
It will output Len: size=34

#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <stdint.h>

typedef size_t* flb_log_event_encoder_size_t;
#define FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE           1
#define FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE             2

#define FLB_LOG_EVENT_STRING_LENGTH_VALUE(length) \
            (int) FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE, \
            (flb_log_event_encoder_size_t) length
#define FLB_LOG_EVENT_STRING_BODY_VALUE(buffer, length) \
            (int) FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE, \
            (char *) buffer, \
            (flb_log_event_encoder_size_t) length
void Test(int loop_cnt, ...)
{
    va_list arg;
    int type1;
    flb_log_event_encoder_size_t size;
    char* str;
    int i;
    va_start(arg, loop_cnt);

    for (i = 0; i < loop_cnt; i++) {
        printf("i=%d\n", i);

        type1 = va_arg(arg, int);
        if (type1 != FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE && type1 != FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE) {
            printf("%d:type is not length or body\n", i);
        }

        if (type1 == FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE) {
            size = va_arg(arg, flb_log_event_encoder_size_t);
            printf("Len: size=%ld\n", size);
        }
        else if (type1 == FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE) {
            str = va_arg(arg, char*);
            printf("Body: str=%s\n", str);

            size = va_arg(arg, flb_log_event_encoder_size_t);
            printf("Body: size=%ld\n", size);
            
        }
    }
    va_end(arg);
}

int main(void) {
    int cnt = 3;
    size_t prefix_len = 10;
    size_t len = 3;

    Test(cnt,
        FLB_LOG_EVENT_STRING_LENGTH_VALUE(prefix_len+len),
        FLB_LOG_EVENT_STRING_BODY_VALUE("_map_data.", prefix_len),
        FLB_LOG_EVENT_STRING_BODY_VALUE("key", len)
         );
    return 0;
}

@nokute78
Copy link
Collaborator

nokute78 commented Feb 4, 2024

I don't have build environment for Windows but I sent a patch #8454
Could you try it ?

@janlonsetteig
Copy link
Author

@nokute78 Is there a GitHub build for that so that I can pull a container image with the patch for testing?

@nokute78
Copy link
Collaborator

@janlonsetteig Sorry I don't know.

@patrick-stephens Any ideas to get a following patched container image for windows ?
#8454 ?

@patrick-stephens
Copy link
Contributor

We do not currently build Windows images for PRs (they take an absolute age and Windows runners cost more as well). Potentially you may be able to tweak the PR to include a CI update to build them though.

I'll check if I can trigger a build manually as well, there are some additional workflows for that but not used them for a Windows container before...

@patrick-stephens
Copy link
Contributor

@nokute78 if we have a branch on this repo (the main OSS one) then we can trigger a build of all images via: https://github.com/fluent/fluent-bit/actions/workflows/build-branch-containers.yaml

These will then be build and uploaded with a custom tag for people to pull from.

@patrick-stephens
Copy link
Contributor

Or you can trigger the same workflow on your side @nokute78? It'll be a public image under ghcr.io//... but that should not matter.

@nokute78
Copy link
Collaborator

@patrick-stephens Thank you for information.
I tried to run action https://github.com/nokute78/fluent-bit/actions/runs/8012956051

@nokute78
Copy link
Collaborator

@patrick-stephens Action was failed because of OpenSSL. Could you check it ?
https://github.com/nokute78/fluent-bit/actions/runs/8012956051/job/21889201105

@cosmo0920
Copy link
Contributor

Hi!! Recent changes could fix this OpenSSL breakage:
9652b0d

Could you rebase your branch, @nokute78 ?

@nokute78
Copy link
Collaborator

@cosmo0920 Thank you for information.

I rebased and run the action.
https://github.com/nokute78/fluent-bit/pkgs/container/fluent-bit%2Ftest%2Ffix_8137

@janlonsetteig
Copy link
Author

@nokute78 I'm running ghcr.io/nokute78/fluent-bit/test/fix_8137:windows-2019-fix_8137 in one of our clusters now. It seems to fix the issue. I see logs working again now with nest/lift

@janlonsetteig
Copy link
Author

@nokute78 Any progress in verifying and releasing the fix?

@patrick-stephens
Copy link
Contributor

It is in the review queue, once merged it will go into the next release.

@janlonsetteig
Copy link
Author

Confirmed fixed in latest version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests