Replies: 2 comments 7 replies
-
Yes, I've been thinking about recap more as a metadata toolkit rather than a data catalog. In that lens, I can see contract support. Two things come to mind:
I think these two things would be good building blocks. There's obviously a lot more, but these would give you enough to do compatibility enforcement. |
Beta Was this translation helpful? Give feedback.
-
Yea, this is subtle, and not something I articulated well in the PR. I spent the last week fiddling around with the API in a lot of different ways, and what I concluded was that I want Recap to be opinionated about things. My gripe with the existing analyzers is that their metadata schemas are all one-off. The SQLAlchemy columns schema is different from the frictionless columns schema, which is in-turn different from the duckdb schema. On and on. So that's thought 1. Thought 2 was your data contract comment on #184. I had some conversations with some other data contract folks, and the theme of conversion between schemas keeps coming up. @gunnarmorling also mentioned something similar a while ago. This led me to the idea to clone Kafka Connect's Schema class, and write converters for the various schemas (much like the Kafka Connect converters). This would give Recap an opinionated schema that would work with files, databases, and messaging IDLs (proto, avro, and such). Thought 3 is that there are really only a fairly tractable set of interesting things about data that I think we want to cover. Namely, schema, lineage, access, data profiles, and a couple other things. Probably 7+= 2. Again, seems like another argument for a strong opinion rather than a ton of flexibility. Moreover, many of these things already have generic representations that people have thought about (scheam = KC schema, lineage = open lineage or equiv (there are others), access = rbac/iam/etc). My inclination is to pick a generic version for each, and use those as the opinionated data models for Recap metadata. So, where does that leave us with analyzers? What I'm thinking is that we'll have 7 += 2 analyzer interfaces that map 1:1 with the 7 += 2 metadata models (e.g. SchemaAnalyzer, LineageAnalyzer, and so on). We can then implement the analyzers for various systems. And where does that leave us for the stuff that doesn't fit in the model I've described (the SQLAlchemy analyzers had a bunch of index, foreign key, partition stuff). I think some of it will just get cut and others will get folded into something more generic (e.g. db indexes look a lot like cluster/partition keys in BigQuery if you squint). That's my current line of thinking. My hope is that these changes will make Recap a lot more usable as a Python API, make the REST API more approachable, and make Recap useful as more than "just" a data catalog service. WDYT? cc @nahumsa @gunnarmorling |
Beta Was this translation helpful? Give feedback.
-
It is hard to enforce data contracts. There is little to no technical tooling for it either. Is there a world to add something around data contracts in recap?
Beta Was this translation helpful? Give feedback.
All reactions