New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

fix(ingest/teradata): Teradata speed up changes #9059

Merged

treff7es merged 19 commits into datahub-project:master from treff7es:teradata_speed_up

Nov 24, 2023

Contributor

treff7es commented Oct 20, 2023

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

github-actions bot added the ingestion label

vercel bot deployed to Preview

October 20, 2023 12:46

View deployment

vercel bot deployed to Preview

October 20, 2023 13:36

View deployment

vercel bot deployed to Preview

October 20, 2023 15:09

View deployment

asikowitz reviewed

View reviewed changes

Collaborator

asikowitz left a comment

Thanks for getting this out so fast. I'd like to hold off on merging this because it's a bit hacky -- perhaps we should refactor our SQL common connector into overridable methods instead. But great to have this

metadata-ingestion/src/datahub/ingestion/source/sql/teradata.py

Comment on lines +350 to +741

+                      setattr(  # noqa: B010
+                          inspector,
+                          "get_table_names",
+                          lambda schema: [
+                              i.name
+                              for i in filter(
+                                  lambda t: t.object_type != "View", self._tables_cache[schema]
+                              )
+                          ],
+                      )
+                      yield from super().loop_tables(inspector, schema, sql_config)

Collaborator

asikowitz Oct 20, 2023

This seems kinda hacky lol, do we need to make our SQL common source more extensible?

metadata-ingestion/src/datahub/ingestion/source/sql/teradata.py Outdated

Comment on lines 234 to 239

+              WHERE DatabaseName NOT IN ('All', 'Crashdumps', 'DBC', 'dbcmngr',
+                      'Default', 'External_AP', 'EXTUSER', 'LockLogShredder', 'PUBLIC',
+                      'Sys_Calendar', 'SysAdmin', 'SYSBAR', 'SYSJDBC', 'SYSLIB',
+                      'SystemFe', 'SYSUDTLIB', 'SYSUIF', 'TD_SERVER_DB', 'TDStats',
+                      'TD_SYSGPL', 'TD_SYSXML', 'TDMaps', 'TDPUSER', 'TDQCD',
+                      'tdwm', 'SQLJ', 'TD_SYSFNLIB', 'SYSSPATIAL')

Collaborator

asikowitz Oct 20, 2023

Would be nice to reuse this list

metadata-ingestion/src/datahub/ingestion/source/sql/teradata.py Outdated

+                  use_cached_metadata: bool = Field(
+                      default=True,
+                      description="Whether to use cached metadata. This reduce the number of queries to the database but requires to have cached metadata.",

Collaborator

asikowitz Oct 20, 2023

Suggested change

      
                    description="Whether to use cached metadata. This reduce the number of queries to the database but requires to have cached metadata.",
          
                    description="Whether to use cached metadata. This reduces the number of queries to the database but requires storing all tables in memory.",

metadata-ingestion/src/datahub/ingestion/source/sql/teradata.py Outdated

Comment on lines 283 to 596

+                          # self.loop_tables = self.cached_loop_tables
+                          # self.loop_views = self.cached_loop_views
+                          # self.get_table_properties = self.cached_get_table_properties

Collaborator

asikowitz Oct 20, 2023

Suggested change

      
                        # self.loop_tables = self.cached_loop_tables
          
                        # self.loop_views = self.cached_loop_views
          
                        # self.get_table_properties = self.cached_get_table_properties

metadata-ingestion/src/datahub/ingestion/source/sql/teradata.py Outdated

Comment on lines 287 to 288

		# self._get_columns = lambda dataset_name, inspector, schema, table: []
		# self._get_foreign_keys = lambda dataset_name, inspector, schema, table: []

Collaborator

asikowitz Oct 20, 2023

Suggested change

      
                        # self._get_columns = lambda dataset_name, inspector, schema, table: []
          
                        # self._get_foreign_keys = lambda dataset_name, inspector, schema, table: []

metadata-ingestion/src/datahub/ingestion/source/sql/teradata.py

Comment on lines +327 to +710

+                                  # url = self.config.get_sql_alchemy_url(current_db=db)
+                                  # with create_engine(url, **self.config.options).connect() as conn:
+                                  #    inspector = inspect(conn)

Collaborator

asikowitz Oct 20, 2023

Suggested change

      
                                # url = self.config.get_sql_alchemy_url(current_db=db)
          
                                # with create_engine(url, **self.config.options).connect() as conn:
          
                                #    inspector = inspect(conn)

vercel bot deployed to Preview

October 20, 2023 15:56

View deployment

vercel bot deployed to Preview

October 20, 2023 21:48

View deployment

treff7es commented

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/sql/teradata.py Outdated

                           self.report.num_queries_parsed += 1
                           if self.report.num_queries_parsed % 1000 == 0:
                               logger.info(f"Parsed {self.report.num_queries_parsed} queries")
+                          print(entry.query)

Contributor Author

treff7es Oct 24, 2023

You left this print here, I don't know if it was intentional

vercel bot deployed to Preview

October 24, 2023 22:27

View deployment

maggiehays added the hacktoberfest-accepted label

vercel bot deployed to Preview

October 26, 2023 19:12

View deployment

vercel bot deployed to Preview

October 26, 2023 22:01

View deployment

vercel bot deployed to Preview

November 9, 2023 23:19

View deployment

vercel bot deployed to Preview

November 17, 2023 13:59

View deployment

vercel bot deployed to Preview

November 20, 2023 11:53

View deployment

treff7es force-pushed the teradata_speed_up branch from 0b2ce01 to ef454ee Compare

November 20, 2023 14:22

treff7es and others added 14 commits

November 20, 2023 15:25


          Initial tearadata speed up hacks

e64bfd8


          Fixing linter issues

954f47d


          Isort fixes

6ed89f4


          option to load schema metadata from datahub

ba86877


          allow lineage filter; bug fixes?

986a8bd


          parse view lineage when not getting schema metadata

5fc8d42


          remove prints

edca5ac

wip

53cfd6a


          read full queries

a1a7b4b


          again

33dccbb

ugh

a4576d9


          clean up... not tested

e6f0ab6


          Couple of fixes and performance improvements for Teradata connector

78e1434


          Smaller improvements/cleanup

3833ca9


          Removing unneeded view lineage parsing as it is now part of sql common

4fb4d2a

Fixing empty lineage

treff7es force-pushed the teradata_speed_up branch from ef454ee to 4fb4d2a Compare

November 20, 2023 14:26

treff7es added 3 commits

November 20, 2023 15:30


          Restoring changed files

92cfd34


          Restoring vertica as well

874ae1e


          Removing commented code

eb5939e

vercel bot deployed to Preview

November 20, 2023 15:00

View deployment


          Merge branch 'master' into teradata_speed_up

vercel bot deployed to Preview

November 22, 2023 10:10

View deployment

asikowitz approved these changes

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/sql/teradata.py


		if self.config.use_cached_metadata:
		if self.config.include_tables or self.config.include_views:

Collaborator

asikowitz Nov 17, 2023

Do we not make any of these queries if include_tables and include_views are false?

metadata-ingestion/src/datahub/ingestion/source/sql/teradata.py

                   ) -> Iterable[MetadataWorkUnit]:
                       result = sqlglot_lineage(
-                          sql=query,
+                          # With this clever hack we can make the query parser to not fail on queries with CASESPECIFIC
+                          sql=query.replace("(NOT CASESPECIFIC)", ""),

Collaborator

asikowitz Nov 17, 2023

I actually have a whole host of query parsing hacks to add, somewhere, lol

treff7es merged commit 5ccb30e into datahub-project:master

52 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hacktoberfest-accepted ingestion