Seeking advice for parsing column level lineage + Suggestion around model stability #1015

rycowhi · 2022-02-03T23:13:47Z

rycowhi
Feb 3, 2022

Hi there! Wanted to start out by thanking all the individuals that have contributed to this project thus far and have delivered much value to the Spark community.

I had a few questions that are sort of related.

Model Stability:

A common model I've seen with organizations leveraging Spline is to essentially only use the Spark-Agent component and implement a custom dispatcher that maps the Spline ExecutionPlan to their own DTO and does whatever needed past that. Based off of the documentation, it looks like modeling straight off the ExecutionPlan is basically an anti-pattern since it is not leveraging one of the Spline "public APIs". This is further evidenced by heavy breaking changes from lower Spline versions to upper ones. (Performing a migration in my case has broken our transformation model quite a bit). Our biggest challenge here was the removal of the domain model constants / classes that appear to be just represented as maps in the 1.1+ data types.

Are there any plans to do similar changes in the future? The inclusion of the versioned ModelMapper seems to be helpful in handling this in the future but any mapping based on current version's DTOs seems a little dangerous from an upgrade perspective that has bitten us from the 0.3 -> 0.7 upgrade. Based on some digging around I've done in the recent 1.2 model changes it looks a bit more stable but I worry about having to redo lineage mapping on essentially every Spline upgrade.

Per Column Lineage:

With the model present in 1.1+, do you have any suggestions as to what the cleanest way to derive out per column lineage from the ExecutionPlan (WRT finding only what columns influence one specific write column, eschewing any care for the type of operation)? Looking at the model I could either go top down from the WriteOperation -> DataOperations -> ReadOperations and its children or go bottom up from the ReadOperations -> DataOperations -> WriteOperation, but I see a cardinality issue like below:

(Fields left out for brevity, this is close to an example output)

{
    "operations": {
        "write": {
            "id": "op-0",
            "childIds": [
                "op-1"
            ]
        },
        "reads": [
            {
                "id": "op-3",
                "output": [
                    "attr-0",
                    "attr-1"
                ]
            }
        ],
        "other": [
            {
                "id": "op-2",
                "output": [
                    "attr-2",
                    "attr-3"
                ]
            }
        ]
    }
}

In the above example, I could easily crawl from op-0 down and see that attr-2 and attr-3 in my WriteOperation are derived from both attr-0 and attr-1 from op-3's ReadOperation. But since a transformation happens, how do I explicitly know that attr-0 -> attr-2 and attr-1 -> attr-3, and not something like attr-0 -> attr3? Even looking through the expressions and functions this is unclear to me.

Answered by wajda

Feb 4, 2022

Spline is still a fairly young project and changes in its model and architecture can happen. For the Producer API however, we strive to maintain mutual backward compatibility between different versions of the agent and the server, specifically to make upgrades as easy as possible for users. The ultimate goal is to support all versions of Producer model starting from 1.0 (corresponds to Spline version 0.4) upwards for as long as possible. This requirement is dictated by the vision of Spline server as a company central lineage tracking system. We understand that it might be difficult to control versions of agent instances used for potentially large number of jobs throughout a big organiza…

View full answer

wajda · 2022-02-04T01:08:39Z

wajda
Feb 4, 2022
Maintainer

Spline is still a fairly young project and changes in its model and architecture can happen. For the Producer API however, we strive to maintain mutual backward compatibility between different versions of the agent and the server, specifically to make upgrades as easy as possible for users. The ultimate goal is to support all versions of Producer model starting from 1.0 (corresponds to Spline version 0.4) upwards for as long as possible. This requirement is dictated by the vision of Spline server as a company central lineage tracking system. We understand that it might be difficult to control versions of agent instances used for potentially large number of jobs throughout a big organization, especially if some agents were customized, shaded and/or packed into the job apps (e.g. fat-jars). Likewise, when some teams need to upgrade an agent (e.g. to migrate to a newer Spark version or have a bug fixed) it should not lead to cascaded upgrade of the central Spline server and all the rest of the Spline agents used elsewhere. That is our philosophy. Unfortunately the Producer API versioning has only been introduced in Spline 0.5, and there were quite a big change in the project architecture between versions 0.3 and 0.4, so I think that could explain difficulties you experienced during Spline 0.3 to 0.7 upgrade.
Will the users use Spline agent alone or in combination with the Spline server it shouldn't make any difference in that regard. I wouldn't call using the agent alone an anti-pattern. It really depends on the use-case. The main benefit of using the Spline server is that it stores all historical lineage data in a single persistent graph making it easier to build and analyze different views on end-to-end lineage for complex data pipelines. Also we'll be gradually adding more interesting features to the Spline UI in the future, which again require the lineage data to be stored in the Spline server. But if all a user wants is to get a sensible JSON representation of the Spark execution plan and do whatever he/she wants with that, then just using a Spline agent alone would be a legitimate way to go. The confusion often happens when people use the embedded HttpLineageDispatcher blindly and don't get attribute-level lineage for example, or wonder why is the output in an old format. That's because HttpLineageDispatcher expects the Spline server on the other end, and tries to negotiate the protocol version. If the server doesn't return special headers the agent assumes that it's talking to the Spline server version 0.4 (the oldest server version the agent can communicate with), and converts the model accordingly. While writing this explanation I realized that it would probably be useful to make it possible to disable the protocol version negotiation and set the desired protocol version in the configuration, similar to how it was recently implemented in KafkaLineageDispatcher.
Not sure I got the question right - what exactly was unclear there? The attribute (column)-level lineage forms a DAG in a similar way the operation-level lineage forms a DAG. To build an attribute-level lineage you need to traverse from the target attribute through the expressions via childRefs to its source attributes, and so on.

1 reply

rycowhi Feb 4, 2022
Author

Thanks for the response Alex!

I have noticed the backwards compatibility starting with what was listed for the ModelMapper for v1 and v1.1 so on going that will definitely be useful in the future once we've performed our migration for this version. Based on your answer it looks like going forward the model / ability to map to prior models will be stable to avoid the cascading upgrade scenario you described.

Your analysis on the difficulty of managing version of agent instances is certainly on point. 😄

(Even if the send to the Spline API isn't directly configurable in every dispatcher just yet a wrapper for such behavior isn't too difficult to implement on the user side as well in the interim).

As for the DAG bit - I think it's just unclear to me as I'm not as familiar with the new organization of the data model. The change to the how the attributes map with the childRefs in this data model makes more sense to me now.

abhishekshenoy · 2022-02-08T04:26:55Z

abhishekshenoy
Feb 8, 2022

@rycowhi We had a similar requirement to derive out per column lineage for any write operation: #937.

We have been using the below to get the same , posting it here for reference.
@wajda let us know if we are missing anything

WITH executionPlan, executes, operation, follows, emits, schema, consistsOf, attribute
LET execPlan = DOCUMENT("executionPlan", '586132ae-4b4a-59e6-875c-9b0deba0f383')
LET ops = (
    FOR op IN operation
        FILTER op._belongsTo == execPlan._id
        RETURN op
    )
LET edges = (
    FOR f IN follows
        FILTER f._belongsTo == execPlan._id
        RETURN f
    )
LET schemaIds = (
    FOR op IN ops
        FOR schema IN 1
            OUTBOUND op emits
            RETURN DISTINCT schema._id
    )
LET attributes = (
    FOR sid IN schemaIds
        FOR a IN 1
            OUTBOUND sid consistsOf
            RETURN DISTINCT {
                "id"   : a._key,
                "name" : a.name,
                "dataTypeId" : a.dataType
            }
    )
LET inputs = FLATTEN(
    FOR op IN ops
        FILTER op.type == "Read"
        LET columns = (FOR attr, e IN 2 OUTBOUND op emits, consistsOf SORT e.index RETURN attr)
        RETURN op.inputSources[* RETURN {
            op,
            columnInfo: (FOR col in columns
                                RETURN {
                                    col
                                }
                        )       
        }]
    )
LET output = FIRST(
    FOR op IN ops
        FILTER op.type == "Write"
        LET columns = (FOR attr, e IN 2 OUTBOUND op emits, consistsOf SORT e.index RETURN attr)
        RETURN {
            op : op,
            columnInfo : (FOR col in columns
                                LET colLineage = (FOR df IN derivesFrom
                                                    FILTER df._from == col._id
                                                        FOR a IN attribute
                                                            FILTER df._to == a._id
                                                                FOR i in inputs
                                                                    FOR j in i.columnInfo[*].col
                                                                        FILTER j._id == a._id
                                                                            RETURN { 
                                                                                    sourceCol: a,
                                                                                    sourceColFrom: i.op
                                                                                    })

                                LET newCol =  ( FOR i in inputs
                                                                    FOR j in i.columnInfo[*].col
                                                                        FILTER j._id == col._id
                                                                            RETURN {
                                                                                    sourceCol : j,
                                                                                    sourceColFrom: i.op
                                                                            })
                                RETURN {
                                        col,    
                                        lineage: (LENGTH(colLineage) != 0 ? colLineage : newCol)
                                        })
            })
RETURN execPlan && {
    "graph": {
        "nodes": ops[* RETURN {
                "_id"  : CURRENT._key,
                "_type": CURRENT.type,
                "name" : CURRENT.name || CURRENT.type
            }],
        "edges": edges[* RETURN {
                "source": PARSE_IDENTIFIER(CURRENT._to).key,
                "target": PARSE_IDENTIFIER(CURRENT._from).key
            }]
    },
    "executionPlan": {
        "_id"       : execPlan._key,
        "systemInfo": execPlan.systemInfo,
        "agentInfo" : execPlan.agentInfo,
        "name"      : execPlan.name || execPlan._key,
        "extra"     : MERGE(
                         execPlan.extra,
                         { attributes },
                         { "appName"  : execPlan.name || execPlan._key }
                      ),
        "inputs"    : inputs,
        "output"    : output
    }
}

1 reply

wajda Feb 8, 2022
Maintainer

Looks OK at a glance. Cannot immediately tell if it's the easiest or fastest way, but conceptually looks correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking advice for parsing column level lineage + Suggestion around model stability #1015

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Seeking advice for parsing column level lineage + Suggestion around model stability #1015

rycowhi Feb 3, 2022

Replies: 2 comments · 2 replies

wajda Feb 4, 2022 Maintainer

rycowhi Feb 4, 2022 Author

abhishekshenoy Feb 8, 2022

wajda Feb 8, 2022 Maintainer

rycowhi
Feb 3, 2022

Replies: 2 comments 2 replies

wajda
Feb 4, 2022
Maintainer

rycowhi Feb 4, 2022
Author

abhishekshenoy
Feb 8, 2022

wajda Feb 8, 2022
Maintainer