Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot convert msgpack generated from nginx access log #32

Open
okkez opened this issue May 11, 2020 · 2 comments
Open

Cannot convert msgpack generated from nginx access log #32

okkez opened this issue May 11, 2020 · 2 comments
Assignees

Comments

@okkez
Copy link
Contributor

okkez commented May 11, 2020

I want to use msgpack format with fluent-plugin-s3 <format> section.
But I cannot see *parquet files on S3.

I can create a small reproducible case for this issue.

$ ~/wc/src/github.com/reproio/columnify/columnify -recordType msgpack -schemaType avro -schemaFile nginx_access_log.avsc x.msgpack
PAR12020/05/11 17:32:33 Failed to write: reflect: call of reflect.Value.Elem on uint8 Value
$ echo $?
1

single.msgpack.log

I use the following Avro schema definition:

{
  "name": "NginxAccessLog",
  "type": "record",
  "fields": [
    {
      "name": "container_id",
      "type": "string"
    },
    {
      "name": "container_name",
      "type": "string"
    },
    {
      "name": "source",
      "type": "string"
    },
    {
      "name": "log",
      "type": "string"
    },
    {
      "name": "__fluentd_address__",
      "type": "string"
    },
    {
      "name": "__fluentd_host__",
      "type": "string"
    },
    {
      "name": "role",
      "type": "string"
    },
    {
      "name": "host",
      "type": "string"
    },
    {
      "name": "remote_ip",
      "type": "string"
    },
    {
      "name": "request_host",
      "type": "string"
    },
    {
      "name": "user",
      "type": "string"
    },
    {
      "name": "method",
      "type": "string"
    },
    {
      "name": "path",
      "type": "string"
    },
    {
      "name": "status",
      "type": "string"
    },
    {
      "name": "size",
      "type": "string"
    },
    {
      "name": "referer",
      "type": "string"
    },
    {
      "name": "agent",
      "type": "string"
    },
    {
      "name": "duration",
      "type": "string"
    },
    {
      "name": "country_code",
      "type": "string"
    },
    {
      "name": "token_param",
      "type": "string"
    },
    {
      "name": "idfv_param",
      "type": "string"
    },
    {
      "name": "tag",
      "type": "string"
    },
    {
      "name": "time",
      "type": "string"
    }
  ]
}
@syucream syucream self-assigned this May 11, 2020
@syucream
Copy link
Contributor

@okkez As a summary, it was caused by mismatched types. __fluentd_address__ and __fluentd_host__ field values are actually encoded as bytes type in msgpack. So once you modify avro schema, then it should work fine. And also I think columnify must handle given schema is correct and convert records if necessary, in this case bytes -> string.

These values have 0xc4 as a first byte in the msgpack message.

$ hexdump -C single.msgpack.log | less
...
000001f0  30 34 39 20 2d 20 2d 20  2d b3 5f 5f 66 6c 75 65  |049 - - -.__flue|
00000200  6e 74 64 5f 61 64 64 72  65 73 73 5f 5f c4 0c 31  |ntd_address__..1|
00000210  37 32 2e 33 31 2e 39 2e  31 35 38 b0 5f 5f 66 6c  |72.31.9.158.__fl|
00000220  75 65 6e 74 64 5f 68 6f  73 74 5f 5f c4 34 73 66  |uentd_host__.4sf|
00000230  72 2d 72 65 70 72 6f 2d  64 65 76 2d 73 74 61 67  |r-repro-dev-stag|
...

ref. https://github.com/msgpack/msgpack/blob/master/spec.md#bin-format-family

Once I changed these field types in .avsc to bytes,

...
    },
    {
      "name": "__fluentd_address__",
      "type": "bytes"
    },
    {
      "name": "__fluentd_host__",
      "type": "bytes"
    },
    {
...

Then it passed.

$ ./columnify -recordType msgpack -schemaType avro -schemaFile accesslog.avsc single.msgpack.log > /dev/null
[ryo@Macintosh] $ echo $?
0

So you can make it working correctly by modifying these schema for now. Additionally I believe columnify should handle schema more correctly ( ref. #27 ) e.g. to report schema mismatching errors to users appropriately and convert value types in records if possible. Current implementation relies it to parquet-go but that doesn't provide us helpful error causes.

@okkez
Copy link
Contributor Author

okkez commented May 12, 2020

Thank you for describing! I got it.
I want the error message like below:

error: the value type is invalid for the key: KEYNAME. expected type is EXPECTED_TYPE, but got INVALID_TYPE.

And additional info about line number or dumped record that has the mismatched value type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants