-
Hi there! Wanted to start out by thanking all the individuals that have contributed to this project thus far and have delivered much value to the Spark community. I had a few questions that are sort of related. Model Stability: A common model I've seen with organizations leveraging Spline is to essentially only use the Spark-Agent component and implement a custom dispatcher that maps the Spline ExecutionPlan to their own DTO and does whatever needed past that. Based off of the documentation, it looks like modeling straight off the ExecutionPlan is basically an anti-pattern since it is not leveraging one of the Spline "public APIs". This is further evidenced by heavy breaking changes from lower Spline versions to upper ones. (Performing a migration in my case has broken our transformation model quite a bit). Our biggest challenge here was the removal of the domain model constants / classes that appear to be just represented as maps in the 1.1+ data types. Are there any plans to do similar changes in the future? The inclusion of the versioned ModelMapper seems to be helpful in handling this in the future but any mapping based on current version's DTOs seems a little dangerous from an upgrade perspective that has bitten us from the 0.3 -> 0.7 upgrade. Based on some digging around I've done in the recent 1.2 model changes it looks a bit more stable but I worry about having to redo lineage mapping on essentially every Spline upgrade. Per Column Lineage: With the model present in 1.1+, do you have any suggestions as to what the cleanest way to derive out per column lineage from the ExecutionPlan (WRT finding only what columns influence one specific write column, eschewing any care for the type of operation)? Looking at the model I could either go top down from the WriteOperation -> DataOperations -> ReadOperations and its children or go bottom up from the ReadOperations -> DataOperations -> WriteOperation, but I see a cardinality issue like below: (Fields left out for brevity, this is close to an example output) {
"operations": {
"write": {
"id": "op-0",
"childIds": [
"op-1"
]
},
"reads": [
{
"id": "op-3",
"output": [
"attr-0",
"attr-1"
]
}
],
"other": [
{
"id": "op-2",
"output": [
"attr-2",
"attr-3"
]
}
]
}
}
In the above example, I could easily crawl from op-0 down and see that attr-2 and attr-3 in my WriteOperation are derived from both attr-0 and attr-1 from op-3's ReadOperation. But since a transformation happens, how do I explicitly know that attr-0 -> attr-2 and attr-1 -> attr-3, and not something like attr-0 -> attr3? Even looking through the expressions and functions this is unclear to me. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
|
Beta Was this translation helpful? Give feedback.
-
@rycowhi We had a similar requirement to derive out per column lineage for any write operation: #937. We have been using the below to get the same , posting it here for reference.
|
Beta Was this translation helpful? Give feedback.
Spline is still a fairly young project and changes in its model and architecture can happen. For the Producer API however, we strive to maintain mutual backward compatibility between different versions of the agent and the server, specifically to make upgrades as easy as possible for users. The ultimate goal is to support all versions of Producer model starting from 1.0 (corresponds to Spline version 0.4) upwards for as long as possible. This requirement is dictated by the vision of Spline server as a company central lineage tracking system. We understand that it might be difficult to control versions of agent instances used for potentially large number of jobs throughout a big organiza…