Link between Data Contracts and Metadata #184

nehiljain · 2023-02-09T00:43:12Z

nehiljain
Feb 9, 2023

It is hard to enforce data contracts. There is little to no technical tooling for it either. Is there a world to add something around data contracts in recap?

criccomini · 2023-02-10T15:20:30Z

criccomini
Feb 10, 2023

Yes, I've been thinking about recap more as a metadata toolkit rather than a data catalog. In that lens, I can see contract support.

Two things come to mind:

Ability to convert between schemas.
Ability to check compatibility between two schemas.

I think these two things would be good building blocks. There's obviously a lot more, but these would give you enough to do compatibility enforcement.

4 replies

nehiljain Feb 10, 2023
Author

I love those building blocks. I think they will be super useful.

nehiljain Feb 11, 2023
Author

https://dataproducts.substack.com/p/an-engineers-guide-to-data-contracts This is a design for CDC event stream for application data producers. I think it will be hard to solve for batch MDS use cases like fivetran to ingest data into snowflake, transform in dbt and used in notebooks and looker.

criccomini Feb 12, 2023

Yea, it's tricky because the problem lives between all those systems. 🙂

I've been toying with cloning Kafka Connect's Schema class in Python, and writing converters for the major schemas. It's a very robust design and has proven to work with both DB schemas and IDLs like proto, Avro, and thrift. I think that would handle part 1 in my list.

Part 2 would then be writing projection and diff functions on top of the Schema class.

nehiljain Feb 14, 2023
Author

I was reading you detailed commit message for the latest change fa5477c. I was wondering why you chose to remove the analyzers? We will need a way to collect all sorts of metadata from infrastructure/systems right?

criccomini · 2023-02-14T06:17:19Z

criccomini
Feb 14, 2023

Yea, this is subtle, and not something I articulated well in the PR. I spent the last week fiddling around with the API in a lot of different ways, and what I concluded was that I want Recap to be opinionated about things. My gripe with the existing analyzers is that their metadata schemas are all one-off. The SQLAlchemy columns schema is different from the frictionless columns schema, which is in-turn different from the duckdb schema. On and on.

So that's thought 1.

Thought 2 was your data contract comment on #184. I had some conversations with some other data contract folks, and the theme of conversion between schemas keeps coming up. @gunnarmorling also mentioned something similar a while ago. This led me to the idea to clone Kafka Connect's Schema class, and write converters for the various schemas (much like the Kafka Connect converters). This would give Recap an opinionated schema that would work with files, databases, and messaging IDLs (proto, avro, and such).

Thought 3 is that there are really only a fairly tractable set of interesting things about data that I think we want to cover. Namely, schema, lineage, access, data profiles, and a couple other things. Probably 7+= 2. Again, seems like another argument for a strong opinion rather than a ton of flexibility. Moreover, many of these things already have generic representations that people have thought about (scheam = KC schema, lineage = open lineage or equiv (there are others), access = rbac/iam/etc). My inclination is to pick a generic version for each, and use those as the opinionated data models for Recap metadata.

So, where does that leave us with analyzers? What I'm thinking is that we'll have 7 += 2 analyzer interfaces that map 1:1 with the 7 += 2 metadata models (e.g. SchemaAnalyzer, LineageAnalyzer, and so on). We can then implement the analyzers for various systems. And where does that leave us for the stuff that doesn't fit in the model I've described (the SQLAlchemy analyzers had a bunch of index, foreign key, partition stuff). I think some of it will just get cut and others will get folded into something more generic (e.g. db indexes look a lot like cluster/partition keys in BigQuery if you squint).

That's my current line of thinking. My hope is that these changes will make Recap a lot more usable as a Python API, make the REST API more approachable, and make Recap useful as more than "just" a data catalog service.

WDYT? cc @nahumsa @gunnarmorling

3 replies

gunnarmorling Feb 14, 2023

Thx for the tag. I like an opinionated design as the default. Just to be sure, I still could create custom analyzers for instance for Decodable, right?

I think some of it will just get cut and others will get folded into something more generic

IIRC, the KC Schema mechanism has a generic "schema parameters" escape hatch, where you can add additional attributes to a schema in an unstructured (i.e. stringly-typed) form. Having support for something like this would be great, as it'd allow analyzers to add custom metadata, which would be displayed/passed-through in Recap in a generic way.

criccomini Feb 14, 2023

Just to be sure, I still could create custom analyzers for instance for Decodable, right?

Yes, so that works in two ways:

You can write a Decodable analyzer.
You can push metadata to the REST API. That can happen in any fashion you wish.

IIRC, the KC Schema mechanism has a generic "schema parameters" escape hatch, where you can add additional attributes to a schema in an unstructured (i.e. stringly-typed) form. Having support for something like this would be great

This is exactly what I was thinking. In fact, Open Lineage has a similar concept (they call them facets) to allow arbitrary data to be attached. So, the models would be very standard (schema, lineage, access, etc), but they'd each have a dict[str, str] escape hatch. On the other hand, the core models, themselves, won't be pluggable; you won't be able to add top-level metadata. Just the pre-defined schema, lineage, access, etc. Does that make sense?

gunnarmorling Feb 14, 2023

Yes, that sounds great. I think it would be great to have a Decodable analyzer, it might help to validate the concepts here and also serve as a real-world example for other folks wishing to build their own analyzers. How much effort do you think it is for a non-Python do do this? Or maybe I should go the REST API route indeed? So tempted to make the time for this and give this a try 🤓.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link between Data Contracts and Metadata #184

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Link between Data Contracts and Metadata #184

nehiljain Feb 9, 2023

Replies: 2 comments · 7 replies

criccomini Feb 10, 2023

nehiljain Feb 10, 2023 Author

nehiljain Feb 11, 2023 Author

criccomini Feb 12, 2023

nehiljain Feb 14, 2023 Author

criccomini Feb 14, 2023

gunnarmorling Feb 14, 2023

criccomini Feb 14, 2023

gunnarmorling Feb 14, 2023

nehiljain
Feb 9, 2023

Replies: 2 comments 7 replies

criccomini
Feb 10, 2023

nehiljain Feb 10, 2023
Author

nehiljain Feb 11, 2023
Author

nehiljain Feb 14, 2023
Author

criccomini
Feb 14, 2023