Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/dbt): dbt column-level lineage #8991

Merged
merged 26 commits into from
Nov 14, 2023

Conversation

hsheth2
Copy link
Collaborator

@hsheth2 hsheth2 commented Oct 11, 2023

Changes stacked on top of #8989

Caveats

  • if you're referencing tables directly (outside of dbt's ref or source), those won't get CLL

TODOs

  • add tests for schema inference
  • add tests for CLL generation

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@maggiehays maggiehays added the hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ label Oct 31, 2023
Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core functionality looked like it was in _infer_schemas_and_update_cll, which made enough sense but I didn't really get a full picture of the process. I don't like how this logic is decently different from our CLL in other sources, but I'm guessing there's some key differences between the two.

and should_fetch_target_node_schema
and graph
):
schema_metadata = graph.get_aspect(target_node_urn, SchemaMetadata)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't cached right? Is it possible we end up querying this multiple times for the same urn? Could we do a bulk fetch instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it won't ever query for the same urn multiple times

if self.config.include_column_lineage and sql_result:
# We only save the debug info here. We're report errors based on it later, after
# applying the configured node filters.
node.cll_debug_info = sql_result.debug_info
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attaching lineage info to nodes is different than we do most other sources. How come you're doing it this way here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lineage is determined by the view definition and the schema of the upstreams. in most other sources, we have the schemas available, but in the case of dbt ephemeral models, we have to infer the schemas. that means we need to do it topographical order, so it's easier to do it all in one go

@hsheth2 hsheth2 merged commit 19aa215 into datahub-project:master Nov 14, 2023
51 checks passed
@hsheth2 hsheth2 deleted the dbt-cll branch November 14, 2023 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants