Adding timestamp to records with structured data #922

hariso · 2023-03-01T17:01:07Z

hariso
Mar 1, 2023
Maintainer

This is in context of #866.

The underlying issue is with standalone connectors using gRPC and google.protobuf.Struct to represent a record. The mentioned Protobuf message doesn't support timestamps. The supported values are documented here.

I see a few options:

Custom marshallers (which would automatically serialize a timestamp into "something" and also deserialize it).
A special instance of google.protobuf.Struct which describes a timestamp (e.g. {"type": "timestamp", "value": "12345"})
Completely replace google.protobuf.Struct with a message of similar structure but with a timestamp.

It looks like custom marshallers are not possible (see this and this).

Encoding timestamps using special structs is possible, but I'm worried about performance. It would require recursively going into a record to check most of its fields but also nested fields.

Replacing google.protobuf.Struct would look something like this:

// Copied from google.protobuf.Struct
+message MyStruct {
+  map<string, MyValue> fields = 1;
+}

// copied from google.protobuf.Value
message MyValue {
  // The kind of value.
  oneof kind {
    // Represents a null value.
    NullValue null_value = 1;
    // Represents a double value.
    double number_value = 2;
    // Represents a string value.
    string string_value = 3;
    // Represents a boolean value.
    bool bool_value = 4;
    // Represents a structured value.
    Struct struct_value = 5;
    // Represents a repeated `Value`.
    ListValue list_value = 6;
+    // Represents a timestamp
+    TimestampValue timestamp_value = 7
  }
}

The above is also backwards compatible (connectors using the current version of the API, i.e. google.protobuf.Struct are still able to send it since the field tags are kept). This buys us some time until all connectors are updated.

It also gives us more control over the struct and more flexibility to change it in future (if needed).

lovromazgon · 2023-03-01T18:10:51Z

lovromazgon
Mar 1, 2023
Maintainer

I like the option in which we replace google.protobuf.Struct, it's the cleanest and safest. Instead of creating our own TimestampValue we can import google/protobuf/timestamp.proto and use the Timestamp message defined there. That said, I'm afraid this is not the main problem.

I think the bigger issue is the support for time.Time. We still have to convert any time.Time value to timestamppb.Timestamp (or whatever type we choose to encode it). We can make it simple for the connector developer and convert all time.Time values automatically, but as you said, we would take a performance hit because of that (even in records that don't carry any time object). The other option is to put that burden on the developer and say that if they want to pass a timestamp to Conduit they need to convert it to timestamppb.Timestamp. Regardless of what we do on the connector side, we will have to still recurse over the record in Conduit and decode any proto timestamp to time.Time, I'm afraid we just don't get around this 😕

The last option would be to do nothing. By "do nothing" I mean that we would document the supported types in a record (which are the types that are supported by google.protobuf.Struct) and tell connector developers to make sure StructuredData contains only supported types. We will need to do this anyway, because even if we add support for time.Time there's a ton of other types that we don't support. The core of the problem is that StructuredData currently makes it seem like it supports just any type, which is not the case. If only Go had union types...

I think it's valid to ask "What do we gain if we support time.Time?". It seems to me that I would only care about having a time.Time value if I wanted to do some transformation on it. Then again, if I want to transform it I can just as well parse the timestamp myself in the transform 🤷‍♂️

I'm wondering if record schemas would help us solve this issue. If a source would supply a schema for the payload and we cache it, we could safely figure out if there's a timestamp anywhere in the record and encode/decode it with less overhead (first time we see a schema we can prepare the encoding/decoding function and then reuse it). At that point we could support any type we wanted, since the encoding and the type information would be separated.

0 replies

hariso · 2023-03-02T13:05:58Z

hariso
Mar 2, 2023
Maintainer Author

I think it's valid to ask "What do we gain if we support time.Time?".

I'd say primarily developer experience and to a smaller extent also making it easier to auto-generate schemas (which we do in the Kafka Connect wrapper). I believe that, ideally, the SDK makes it possible to use all the built-in types (minus some which make no sense, such as channels). Timestamps are used often and with that, I think it makes sense to make them available. In most Java frameworks, for example, this is done out of the box, and if customization is needed, it can easily be done with custom (un)marshallers, which are applied as JSON is being parsed (so there's no performance hit).

I also realized that what was being used as a workaround in some connectors (to use raw payloads - JSON strings) is actually not the best solution. We can do better by returning structured data and using a custom serialization for just the fields which are not supported (e.g. timestamps). The conversion of timestamps to strings happens anyway in the raw payloads so we're not losing anything on that side but still get structured data.

The last option would be to do nothing. By "do nothing" I mean that we would document the supported types in a record (which are the types that are supported by google.protobuf.Struct) and tell connector developers to make sure StructuredData contains only supported types.

I'm now for this option.:) As I mentioned above, I believe that having the built-in types available makes sense, but it's worth the performance cost. I tried finding a way to write a custom marshaller, but failed to do so.

So with the above, I think not doing anything

Developers having to transform from and to timestamppb.Timestamp (or whatever type we choose) is not much different from simply letting them serialize and deserialize timestamps as strings for example.

3 replies

hariso Mar 2, 2023
Maintainer Author

Sidenote: I also noticed that using lists and maps as values in StructuredData is not possible. E.g. something like:

{
    "a_slice": ["foo", "bar"]
}

is not possible, even though google.protobuf.Struct itself supports it.

lovromazgon Mar 3, 2023
Maintainer

Ok not supporting a slice is really bad 😞 Apparently we need to do something here. I have an idea, but I need to check if it would work.

hariso Mar 3, 2023
Maintainer Author

As we're working on a Databricks destination connector, I found that Databricks has an array, a map, and a struct type.

hariso · 2023-03-06T14:05:37Z

hariso
Mar 6, 2023
Maintainer Author

Related to this: golang/protobuf#414

0 replies

lovromazgon · 2023-04-19T14:06:47Z

lovromazgon
Apr 19, 2023
Maintainer

I found some more information on this:

Maps and slices are supported, as long as the values are of type interface{} (i.e. map[string]interface{} and []interface{}). A full list of supported types is available in godocs for structpb.NewValue.
We use structpb.NewStruct to create the protobuf value, which already traverses the whole payload to create a record out of it (here). So if we decided that we want to support time.Time we could go with the initially proposed solution and write our own NewValue function that also supports time.Time.
There is a closed issue for adding support for custom types and an explanation why they refused to add it (types/known:structpb: support more custom types in constructors golang/protobuf#1302). There is also a proposed workaround to use json.Marshal and structpb.Value.UnmarshalJSON (meta: add helper to convert custom types to a Value type lyft/clutch#1426).

I think at this point the question is if time.Time is special enough to add additional handling to it. I'm currently leaning towards not treating it in a special way and instead using the approach through JSON. We will provide full support for the types mentioned in godocs for structpb.NewValue, any other type information will be lost and converted to one of the supported types using the json package. Specifically, this means that:

time.Time values will be converted to a string using format RFC3339Nano (2006-01-02T15:04:05.999999999Z07:00)
Slices of concrete types will be converted to []interface{}
Maps of concrete types will be converted to map[string]interface{}

What do we think about this approach?

0 replies

lovromazgon · 2024-08-08T14:42:16Z

lovromazgon
Aug 8, 2024
Maintainer

Support for timestamps was added as part of schema support.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding timestamp to records with structured data #922

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Adding timestamp to records with structured data #922

hariso Mar 1, 2023 Maintainer

Replies: 5 comments · 3 replies

lovromazgon Mar 1, 2023 Maintainer

hariso Mar 2, 2023 Maintainer Author

hariso Mar 2, 2023 Maintainer Author

lovromazgon Mar 3, 2023 Maintainer

hariso Mar 3, 2023 Maintainer Author

hariso Mar 6, 2023 Maintainer Author

lovromazgon Apr 19, 2023 Maintainer

lovromazgon Aug 8, 2024 Maintainer

hariso
Mar 1, 2023
Maintainer

Replies: 5 comments 3 replies

lovromazgon
Mar 1, 2023
Maintainer

hariso
Mar 2, 2023
Maintainer Author

hariso Mar 2, 2023
Maintainer Author

lovromazgon Mar 3, 2023
Maintainer

hariso Mar 3, 2023
Maintainer Author

hariso
Mar 6, 2023
Maintainer Author

lovromazgon
Apr 19, 2023
Maintainer

lovromazgon
Aug 8, 2024
Maintainer