Replies: 36 comments
-
This is a great write-up. Thank you for taking the time to do it! First, I think it might be helpful for me to clarify some things.
Some comments inline:
💯 Here's what I propose: Avro, Protobuf, JSON schema, Parquet Again, I don't think Recap should have 100% coverage of all of these schemas. I think it needs to have enough coverage of each to do:
Indeed. Arrow's data types look pretty robust. The Recap spec covers most of them. I do like that Arrow models "large" and "small" for things like string, binary, and list. I think this addresses a bit of your One thing to be conscious of is that these are OLAP/DB-centric. It's worth going through the exercise of mapping this schema set to the IDLs I outlined above (Avro, Proto, and JSON schema). At a glance I think it should map fairly well. though I notice it lacks
I think this is left to the transpiler for types that are unsupported. For supported types, the coercion should be defined in the Recap spec.
Yea, It does strike me that your definition of "schema" is a bit OLAP/DB centric. I'm not sure IDL folks would agree with your definition. I'll try and clean it up.
I need to think on this. At face-value, I agree. OTOH, a complex type system can, itself, be hard to implement properly (yes, there can be test suites, but when I look at something like CUE lang, I don't think I'd want to implement it in another language. In fact, that's probably why it's only in Go right now 🤷 ). /cc @gunnarmorling |
Beta Was this translation helpful? Give feedback.
-
This is great and very helpful for me to understand the scope. I had in my mind a schema that has to be equally expressive for transportation and storage/querying.
100% agree on this, the tricky and important part is finding the right balance. CUE has a rich type system but it also ends up offering a non Turing-complete language to do anything and that complicates things a lot. I do think though that having the goal of making the transpiler as simple as possible it's important from a DX perspective as it will allows a developer to iterate faster and avoid bugs at the same time.
100% agree on this too but the semantics have to be explicit, that's another reason I prefer to not have the transpiler decide some of that stuff. Think of the case where someone builds pipelines for financial data, let's say SEC requires nanosecond granularity and by accident someone transforms the data into millisecond granularity. Can we safeguard the user from introducing these bugs?
This looks good! (btw Athena is based on Trino so supporting one of the two should cover the other). I had to edit my comment to add this but, we can probably just substitute BQ, Snowflake, Athena, Redshift, MySQL, PSQL and Trino with standard (ANSI) SQL. It will simplify design and it should cover the majority of what these systems have.
this looks great!
yeah olap/dbs do not have enums and they are an important part of IDLs but I think you have defined Union right? Enums are syntactic sugar over Unions (happy to be corrected by PL people on that).
That's my main objection but mainly because I've been burned so many times by issues with data quality and I feel that safety when transforming from one instance is really important.
it's completely OLAP/DB centric!! That's where I come from :D
I totally agree that a CUE lang situation should be avoided (although I like CUE). Finding the right balance between expressivity and simplicity is very important in building a successful spec and it will def require iterations. But it's a fun and exciting opportunity! |
Beta Was this translation helpful? Give feedback.
-
Was thinking more on this last night. Two thoughts:
💯 Excellent. So then: Avro, Protobuf, JSON schema, Parquet, ANSI SQL, Arrow. I am putting a Google doc together that lists the types for each schema format, and how they should convert to/from Recap. |
Beta Was this translation helpful? Give feedback.
-
OK, so I have a Google doc that compares Avro, Protobuf, JSON Schema, ANSI SQL, Parquet, Arrow, and CUE. It's a little fuzzy and hand-wavy in places, but it was an informative exercise. https://docs.google.com/spreadsheets/d/1_gXOf8yjodZGFuNpskKd0KBL7Zv6Qjg_Ed-4vfEerTk/edit?usp=sharing Some takeaways:
|
Beta Was this translation helpful? Give feedback.
-
Noodling on (2) a bit more, I think there are two approaches to describing schemas:
CUE relies heavily on type systems, and can model like .. everything. Arrow is more pragmatic. Defines common stuff, lets chips fall where they may. I think I favor the latter. I'm going to try to fiddle with Recap to get it to look like Arrow. |
Beta Was this translation helpful? Give feedback.
-
I totally agree. I think Arrow's pragmatic approach together with its adoption makes it the best choice for what you try to achieve. |
Beta Was this translation helpful? Give feedback.
-
w.r.t. Arrow, there's still a question over whether to model things as the Schema.fbs or the language types (e.g. Python) For example (Schema.fbs):
vs. Arrow's Python API:
The former falls between CUE and more traditional IDLs. The latter is definitely a standard IDL style. |
Beta Was this translation helpful? Give feedback.
-
Here's what types look like with explicit lengths set (e.g.
|
Beta Was this translation helpful? Give feedback.
-
Here is the
It's much more compact. I think with sane defaults (e.g. |
Beta Was this translation helpful? Give feedback.
-
Why not have both? type = "int" as the foundation and then provide standard types like int32 mapped to the above for easiness. Some kind of syntactic sugar to make working with the type system easier for developers who use it to implement concrete services and also have the more expressive version that can be used by library developers who want to extend the type system. |
Beta Was this translation helpful? Give feedback.
-
The more I think about it, the more I believe that when it comes to the design of a type system like the one of recap, you need to consider two types of users of it.
These two have slightly different needs. (1) needs an experience with more guardrails, the type system there is to guide her to the right direction, removing boilerplate and ensuring that bugs won't happen. (2) needs more expressivity and a system that can accommodate future requirements, i.e. now you want to add support for Thrift and it has some important features that were not encountered in the previous specs. This person would trade guardrails for expressivity. Imo both are important for the success of the project and if you can address both needs in an elegant way, it would be awesome. I think arrow does something similar but in their case there are very clear boundaries between (1) and (2). For (1) has only to interface with the host language of the spec, so the abstractions are created there while (2) deals with the core spec more where the system is more expressive. |
Beta Was this translation helpful? Give feedback.
-
Hah! I have been noodling this as well over the last couple of days.
100%. The only wrinkle I've been wrestling with has been that I want to introduce some primitive check constraints to the SDL. This is something Arrow doesn't have to wrestle with; checks are encoded as "bits" and "signed" attributes in I bounce back and forth about whether we want (2) to just use a CUE-like type system or box them in with Arrow types. The upside of CUE-like (or KCL-like) check constraints is that it's pretty easy to add new types as you add new schemas. The downside is that it feels like it makes it harder for maintainers to implement the actual library (they need to implement some boolean logic parser for check constraints). I might need to prototype it out to prove that point. For (1), for sure they need to deal with concrete types (float32, uint32, etc). |
Beta Was this translation helpful? Give feedback.
-
Who's going to be enforcing the constraints? I think an issue to consider here is performance. CUE is focusing on validating configurations mainly, that allows them to have a rich constraint definition mechanism that doesn't have to query much about the performance of validation. Now consider the opposite extreme, a firewall. The constraints you want to apply to packets has to be extremely predictable and low latency. If you plan to validate schemas only, you don't have to worry about perf but if the plan is to also validate data, then this has to be also explicitly considered as part of the design of the system. |
Beta Was this translation helpful? Give feedback.
-
I think Recap should be agnostic to when validation occurs. I think there are three levels:
As far as expressiveness goes, I think I only want to cover very basic checks. Basic boolean checks plus maybe So, performance should be inline with the runtime checks that the validators I enumerated in (2) do. I think that's do-able. |
Beta Was this translation helpful? Give feedback.
-
Catching up on all of this, good thread
Also in favor of this, avoid a lot of added complexity and still cover 90% of what people realistically want to do.
I like this as well - similar to the comment above I think you should design with the 90% use case in mind, and this lets you do that while still preserving flexibility. Someone looking at a large schema file full of
This seems pretty reasonable to me - behavior of supported types is part of the base spec, behavior of unsupported types either gets added to the spec along with a validation suite, or is undefined. As part of the spec, transpilers are required to warn the user in some way when they've implemented an unsupported type not defined in the spec (maybe fail by default, and require a special flag so there's no way someone can accidentally use it?). |
Beta Was this translation helpful? Give feedback.
-
And, even MORE CUE-like: I removed the |
Beta Was this translation helpful? Give feedback.
-
I really like the shape it takes! great job! |
Beta Was this translation helpful? Give feedback.
-
Thanks! I'm working on a Python MVP right now to validate that it's ergonomic for transpilers. |
Beta Was this translation helpful? Give feedback.
-
Ok, so I spent some time in Python today trying to implement the spec. Here's what the types look like: https://github.com/recap-cloud/recap-schema-python/blob/main/recap/schema/__init__.py And here's what an Avro transpiler might look like: https://github.com/recap-cloud/recap-schema-python/blob/main/recap/schema/avro.py It's by no means complete, and it's definitely got some bugs and TODOs. Still, it was an informative exercise. It does seem like it'll work. The main pain-point in the implementation was, unsurprisingly, having to deal with the constraints. I only implemented the |
Beta Was this translation helpful? Give feedback.
-
This is a schema transpiler only, right? The output is going to be a valid Avro schema? How is this going to be used by the user? I'm wondering if the transpiling phase is just a step of the overall workflow where after that the recap tooling can also provider integration with the rest of the Avro tooling (i.e. using the schema to generate code in a target language). Regarding the constraints, yes it's kind of ugly and feels brittle. I was wondering if having more abstraction in place would help here. Constraints at the end are boolean expressions chained with and/or predicates. Maybe constraints on the schema level are represented as a boolean expression that has to be true to satisfy and then have a small DSL for these expressions specifically. I think Databricks and the Delta Table spec has one of the richest constraint system I've seen. In this case SQL is the constraint DSL. I do think that what Delta-Databricks has is overkill and not the best experience but it might give some ideas. |
Beta Was this translation helpful? Give feedback.
-
Yes. Avro -> Recap -> Avro.
from_avro's output is a Recap in-memory class hierarchy. to_avro is a valid Avro schema.
Yep, that's what I'm doing.
Yea 100%. I'm (re)learning the AST stuff on the fly, so my impl is certainly not best. I had a fever dream last night and am re-working things a bit to make the typing more pure.
Yea, this is what I was trying to avoid because it seems nasty to have to implement for every library maintainer (me). But now I'm discovering NOT implementing that is even worse. :P Quickly discovering exactly why CUE has the operators it has.
I think this is actually a sign of a failure. A rich constraint system (or at least, one with a lot of operators) is a sign of a less "pure" type system. Compared to something like CUE, which is much more compact, but incredibly expressive. The trick is making something like a CUE type system approachable to mere mortals like me. I'm experimenting right now with eliminating the Still tinkering on exactly the right balance between user and library. For example, if I eliminate |
Beta Was this translation helpful? Give feedback.
-
So, one thought is to decouple the in-memory AST from the Recap YAML/TOML/ I had been thinking about things this way anyway, but not making it explicit. |
Beta Was this translation helpful? Give feedback.
-
... if you squint, this is kinda what arrow does with their Schema.fbs vs. language APIs. Their Schema.fbs is just a much more constrained type system than I think we want. |
Beta Was this translation helpful? Give feedback.
-
(I'm also willing to admit that I've been locked in a garage talking to myself for 2 months. Maybe we should just go back to explicit types-- |
Beta Was this translation helpful? Give feedback.
-
hahaha or maybe I should come and take you out of the garage and go for a beer! That stuff usually accelerate development at the end! Jokes aside, I do agree that the Databricks approach is a sign of failure and that the right type system can improve the experience A LOT. Although DB also has to deal with the SQL reality which is another thing. |
Beta Was this translation helpful? Give feedback.
-
Okay! I've implemented a pretty pure type system: https://github.com/recap-cloud/recap-schema-python/blob/main/recap/schema/__init__.py It supports all of the types described in the spec, and it supports the following constraints:
It passes some very basic smoke tests for converting to and from Avro. There are for sure some bugs in the type system, as I haven't implemented any tests on it, but it does seem to work in the REPL for happy-path cases. (You might notice that this looks very similar to CUE's operators. It's not by accident. The only thing missing are a couple of the functions that CUE supports--length and such.) |
Beta Was this translation helpful? Give feedback.
-
Ok, more progress. I've implemented a fairly full-featured I even managed to implement enums using the type system: case EnumSchema():
return schema.Union(
types=[
schema.String64(constraints=[schema.Equal(symbol)])
for symbol in avro_schema.symbols
]
) (If it's an Avro EnumSchema then use a union of constant string types--one for each Avro Enum symbol). Still a ton to do, but I'm growing more confident that this approach is going to work. 😄 If I can do Avro and something database-ish, I think we can move forward with the spec. |
Beta Was this translation helpful? Give feedback.
-
If y'all want to put around, the types are here: https://github.com/recap-cloud/recap-schema-python/blob/main/recap/schema/__init__.py And the Avro transpilers are here: https://github.com/recap-cloud/recap-schema-python/blob/main/recap/schema/avro.py Side note: transpiler or converter? |
Beta Was this translation helpful? Give feedback.
-
I would go with converter instead of transpiler. I'll play around with the code hopefully tomorrow. Good job! |
Beta Was this translation helpful? Give feedback.
-
Oky! I've moved the Python implementation over to Recap: See the notes in the commit for where things stand. I'm going to move this awesome thread over to https://github.com/recap-cloud/recap and close this repo out. As for the text-based spec, I'm going to leave that for another day. All I really need for Recap right now is the in-memory model. I also made the Recap base |
Beta Was this translation helpful? Give feedback.
-
Intro
a useful distinction for data schemas is the following:
The first category includes representations like Avro, Protobuf, Thrift and JSON
The second category includes anything relational and serializations like Parquet and ORC.
The reason this distinction is important is because the standards for each category tend to focus on different things as they try to address different use cases and needs.
But, being able to transform from one category to other, is very important. Data will be shared through an API in JSON, moved into a Kafka topic in Avro, land to S3 as Avro and from there transformed into Parquet and from there into different relational pipelines until the data will be served as JSON through a cache or something similar to be consumed by an app.
The above high level data lifecycle is common in every org that is seriously working with data and it takes many different forms, depending on the use case but what remains the same is the need to switch from one data serialization to the other.
Type System
When it comes to defining a type system that can cover the schema and potentially the data transformation from any of the above serializations to another, we need to consider the the minimum possible subset of types that can allow us to do that.
To do that, I find useful to distinguish types into the following categories:
This is important!!
Figuring out the set of types that can serve us well for the purposes of Recap, it is useful to first consider a fixed set of formats that we want to be able to represent using it. For example, Thrift supports a Set type but Avro and Protobuf doesn't. Does it make sense to include a Set type in recap?
My suggestion here is to start by defining a set of serializations we want to be always compatible with and try to keep this set as minimal as possible while maximizing the surface area of use cases.
Parameterized Types
In SQL it's pretty common to have parameterized types, think of VARCHAR(120) for example or many of the timestamp types. Do we want the type system to support these and if yes, how we can deal with type coercion? Trino supports timestamps with up to picosecond resolution, no other engine does but how we would map a timestamp(picos) to timestamp(milliseconds) ?
Primitive Types
I believe a good idea here is to consider Arrow core types as a guide to design the primitive types of recap. The main reason is being opportunistic around the adoption of Arrow. As more and more systems adopt it, the more the core types of Arrow will be used and it will make it easier to maintain recap and new mappings in the future.
The recap Schema
The use of schema in recap is a bit confusing. To me a schema is a logical grouping of unique entities but the way it is used in recap a schema can also be considered what usually is referred as a record in other languages.
I would recommend to try and me consistent with the more generally adopted semantics around these terms to avoid confusion and help with the adoption of the spec. So,
Transpilers
I believe the transpilers should be as simple as possible and as much of the transpiling logic as possible should be part of the type system and schema spec.
Allowing the transpiler to decide how to transform a type can easily lead to semantic issues and reduces the value that a standard like recap can offer.
For this reason I think that some of the types like enums that are core supported by pretty much every serialization used today, should be a core type of the spec.
In general, transpilers should be as dumb as possible and the spec should encapsulate as many semantics as possible.
Beta Was this translation helpful? Give feedback.
All reactions