Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/datahub): Add way to filter soft deleted entities #11738

Merged
merged 6 commits into from
Oct 30, 2024

Conversation

treff7es
Copy link
Contributor

Add a way to filter soft-deleted entities
Add a way to filter aspects
Fix SQL query pagination to not have duplicate entries

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Add way to filter aspects
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 29, 2024
exclude_aspects: Set[str] = Field(
default_factory=set,
description="Set of aspect names to exclude from ingestion",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we use a set here instead of an AllowDenyPattern so that we can push down the filters to sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

{"" if self.config.include_all_versions else "AND mav.version = 0"}
{"" if not self.config.exclude_aspects else f"AND LOWER(mav.aspect) NOT IN %(exclude_aspects)s"}
AND mav.createdon >= %(since_createdon)s
ORDER BY createdon, urn, aspect, version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we indent the subquery here - otherwise the sql is hard to read

also - I like how we format sql queries in the fivetran source

- I'd like to move towards this type of formatting everywhere

AND mav.createdon >= %(since_createdon)s
ORDER BY createdon, urn, aspect, version
) as t
where row_id > %(last_id)s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we doing manual pagination instead of just using db cursors?

@@ -92,6 +92,23 @@ def __init__(
**connection_config.options,
)

@property
def soft_deleted_urns(self) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def soft_deleted_urns(self) -> str:
def soft_deleted_urns_query(self) -> str:

"Cannot exclude soft deleted entities without a database connection"
)
soft_deleted_urns = [
row[0] for row in database_reader.get_soft_deleted_rows()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this read the soft-deleted urns list twice?

@treff7es
Copy link
Contributor Author

Merging as the error is unrelated

@treff7es treff7es merged commit b33ad0a into datahub-project:master Oct 30, 2024
73 of 75 checks passed
@treff7es treff7es deleted the datahub_source_improvements branch October 30, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants