Cannot convert msgpack generated from nginx access log #32

okkez · 2020-05-11T08:48:28Z

I want to use msgpack format with fluent-plugin-s3 <format> section.
But I cannot see *parquet files on S3.

I can create a small reproducible case for this issue.

$ ~/wc/src/github.com/reproio/columnify/columnify -recordType msgpack -schemaType avro -schemaFile nginx_access_log.avsc x.msgpack
PAR12020/05/11 17:32:33 Failed to write: reflect: call of reflect.Value.Elem on uint8 Value
$ echo $?
1

single.msgpack.log

I use the following Avro schema definition:

{
  "name": "NginxAccessLog",
  "type": "record",
  "fields": [
    {
      "name": "container_id",
      "type": "string"
    },
    {
      "name": "container_name",
      "type": "string"
    },
    {
      "name": "source",
      "type": "string"
    },
    {
      "name": "log",
      "type": "string"
    },
    {
      "name": "__fluentd_address__",
      "type": "string"
    },
    {
      "name": "__fluentd_host__",
      "type": "string"
    },
    {
      "name": "role",
      "type": "string"
    },
    {
      "name": "host",
      "type": "string"
    },
    {
      "name": "remote_ip",
      "type": "string"
    },
    {
      "name": "request_host",
      "type": "string"
    },
    {
      "name": "user",
      "type": "string"
    },
    {
      "name": "method",
      "type": "string"
    },
    {
      "name": "path",
      "type": "string"
    },
    {
      "name": "status",
      "type": "string"
    },
    {
      "name": "size",
      "type": "string"
    },
    {
      "name": "referer",
      "type": "string"
    },
    {
      "name": "agent",
      "type": "string"
    },
    {
      "name": "duration",
      "type": "string"
    },
    {
      "name": "country_code",
      "type": "string"
    },
    {
      "name": "token_param",
      "type": "string"
    },
    {
      "name": "idfv_param",
      "type": "string"
    },
    {
      "name": "tag",
      "type": "string"
    },
    {
      "name": "time",
      "type": "string"
    }
  ]
}

The text was updated successfully, but these errors were encountered:

syucream · 2020-05-11T14:02:07Z

@okkez As a summary, it was caused by mismatched types. __fluentd_address__ and __fluentd_host__ field values are actually encoded as bytes type in msgpack. So once you modify avro schema, then it should work fine. And also I think columnify must handle given schema is correct and convert records if necessary, in this case bytes -> string.

These values have 0xc4 as a first byte in the msgpack message.

$ hexdump -C single.msgpack.log | less
...
000001f0  30 34 39 20 2d 20 2d 20  2d b3 5f 5f 66 6c 75 65  |049 - - -.__flue|
00000200  6e 74 64 5f 61 64 64 72  65 73 73 5f 5f c4 0c 31  |ntd_address__..1|
00000210  37 32 2e 33 31 2e 39 2e  31 35 38 b0 5f 5f 66 6c  |72.31.9.158.__fl|
00000220  75 65 6e 74 64 5f 68 6f  73 74 5f 5f c4 34 73 66  |uentd_host__.4sf|
00000230  72 2d 72 65 70 72 6f 2d  64 65 76 2d 73 74 61 67  |r-repro-dev-stag|
...

ref. https://github.com/msgpack/msgpack/blob/master/spec.md#bin-format-family

Once I changed these field types in .avsc to bytes,

...
    },
    {
      "name": "__fluentd_address__",
      "type": "bytes"
    },
    {
      "name": "__fluentd_host__",
      "type": "bytes"
    },
    {
...

Then it passed.

$ ./columnify -recordType msgpack -schemaType avro -schemaFile accesslog.avsc single.msgpack.log > /dev/null
[ryo@Macintosh] $ echo $?
0

So you can make it working correctly by modifying these schema for now. Additionally I believe columnify should handle schema more correctly ( ref. #27 ) e.g. to report schema mismatching errors to users appropriately and convert value types in records if possible. Current implementation relies it to parquet-go but that doesn't provide us helpful error causes.

okkez · 2020-05-12T01:46:35Z

Thank you for describing! I got it.
I want the error message like below:

error: the value type is invalid for the key: KEYNAME. expected type is EXPECTED_TYPE, but got INVALID_TYPE.

And additional info about line number or dumped record that has the mismatched value type.

syucream self-assigned this May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot convert msgpack generated from nginx access log #32

Cannot convert msgpack generated from nginx access log #32

okkez commented May 11, 2020

syucream commented May 11, 2020

okkez commented May 12, 2020

Cannot convert msgpack generated from nginx access log #32

Cannot convert msgpack generated from nginx access log #32

Comments

okkez commented May 11, 2020

syucream commented May 11, 2020

okkez commented May 12, 2020