Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/teradata): Teradata speed up changes #9059

Merged
merged 19 commits into from
Nov 24, 2023

Conversation

treff7es
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this out so fast. I'd like to hold off on merging this because it's a bit hacky -- perhaps we should refactor our SQL common connector into overridable methods instead. But great to have this

Comment on lines +350 to +741
setattr( # noqa: B010
inspector,
"get_table_names",
lambda schema: [
i.name
for i in filter(
lambda t: t.object_type != "View", self._tables_cache[schema]
)
],
)
yield from super().loop_tables(inspector, schema, sql_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems kinda hacky lol, do we need to make our SQL common source more extensible?

Comment on lines 234 to 239
WHERE DatabaseName NOT IN ('All', 'Crashdumps', 'DBC', 'dbcmngr',
'Default', 'External_AP', 'EXTUSER', 'LockLogShredder', 'PUBLIC',
'Sys_Calendar', 'SysAdmin', 'SYSBAR', 'SYSJDBC', 'SYSLIB',
'SystemFe', 'SYSUDTLIB', 'SYSUIF', 'TD_SERVER_DB', 'TDStats',
'TD_SYSGPL', 'TD_SYSXML', 'TDMaps', 'TDPUSER', 'TDQCD',
'tdwm', 'SQLJ', 'TD_SYSFNLIB', 'SYSSPATIAL')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to reuse this list


use_cached_metadata: bool = Field(
default=True,
description="Whether to use cached metadata. This reduce the number of queries to the database but requires to have cached metadata.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description="Whether to use cached metadata. This reduce the number of queries to the database but requires to have cached metadata.",
description="Whether to use cached metadata. This reduces the number of queries to the database but requires storing all tables in memory.",

Comment on lines 283 to 596
# self.loop_tables = self.cached_loop_tables
# self.loop_views = self.cached_loop_views
# self.get_table_properties = self.cached_get_table_properties
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# self.loop_tables = self.cached_loop_tables
# self.loop_views = self.cached_loop_views
# self.get_table_properties = self.cached_get_table_properties

Comment on lines 287 to 288
# self._get_columns = lambda dataset_name, inspector, schema, table: []
# self._get_foreign_keys = lambda dataset_name, inspector, schema, table: []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# self._get_columns = lambda dataset_name, inspector, schema, table: []
# self._get_foreign_keys = lambda dataset_name, inspector, schema, table: []

Comment on lines +327 to +710
# url = self.config.get_sql_alchemy_url(current_db=db)
# with create_engine(url, **self.config.options).connect() as conn:
# inspector = inspect(conn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# url = self.config.get_sql_alchemy_url(current_db=db)
# with create_engine(url, **self.config.options).connect() as conn:
# inspector = inspect(conn)

self.report.num_queries_parsed += 1
if self.report.num_queries_parsed % 1000 == 0:
logger.info(f"Parsed {self.report.num_queries_parsed} queries")

print(entry.query)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You left this print here, I don't know if it was intentional


if self.config.use_cached_metadata:
if self.config.include_tables or self.config.include_views:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not make any of these queries if include_tables and include_views are false?

) -> Iterable[MetadataWorkUnit]:
result = sqlglot_lineage(
sql=query,
# With this clever hack we can make the query parser to not fail on queries with CASESPECIFIC
sql=query.replace("(NOT CASESPECIFIC)", ""),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have a whole host of query parsing hacks to add, somewhere, lol

@treff7es treff7es merged commit 5ccb30e into datahub-project:master Nov 24, 2023
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants