Enable users to store sample-lookup-optimized tables #573

samanvp · 2020-03-31T15:40:31Z

This PR includes the following changes:

Add a new input flag and validate its value.
Add BQ queries to flatten call column.
Extract the schema of the flatten table.
Add unit tests to verify the correctness of extracted schema.
Create tables using extracted schema which are partitioned on call_sample_id column.
Copy date from variant-lookup-optimized tables to sample-lookup-optimized tables.

We will add integration tests in a follow up PR.

tneymanov · 2020-05-14T12:42:59Z

cloudbuild_CI.yaml

@@ -42,9 +42,7 @@ steps:
      - '--project ${PROJECT_ID}'
      - '--image_tag ${COMMIT_SHA}'
      - '--run_unit_tests'
-      - '--run_preprocessor_tests'


Please restore.

tneymanov · 2020-05-14T13:32:39Z

gcp_variant_transforms/options/variant_transform_options.py

+                              sharding_config_path, append):
+    if (output_table_base_name !=
+        bigquery_util.get_table_base_name(output_table_base_name)):
+      raise ValueError(('Output table cannot contain "{}" we reserve this  '


nit from previous implementation: 'Output table cannot contain "{}". We reserve this ' (remove 1 space at the end, stop sentence at {})

tneymanov · 2020-05-14T13:44:10Z

gcp_variant_transforms/libs/bigquery_util.py

@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id):
      SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
  _run_table_creation_command(bq_command)

+class FlattenCallColumn(object):


This file is getting unruly, don't you think? Maybe move this to a new file? If you do so, please make sure to update copyright accordingly.

I agree, I'd like to move both this class and LoadAvro to a new module. Let's do this change in a separate PR, this one is big already.
Submitted #598

tneymanov · 2020-05-14T13:44:27Z

gcp_variant_transforms/libs/bigquery_util.py

@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id):
      SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
  _run_table_creation_command(bq_command)

+class FlattenCallColumn(object):


Please add a docstring.

tneymanov · 2020-05-15T00:55:57Z

gcp_variant_transforms/libs/bigquery_util.py

 _GCS_DELETE_FILES_COMMAND = 'gsutil -m rm -f -R {ROOT_PATH}'
-_BQ_LOAD_JOB_NUM_RETRIES = 5
+_BQ_NUM_RETRIES = 3


Why are we reducing retry times?

debug artifact, removed.

tneymanov · 2020-05-15T10:16:39Z

gcp_variant_transforms/libs/bigquery_util.py

+        break
+    logging.info('Copy to table query was successful: %s', output_table_id)
+
+  def _create_temp_flatten_table(self):


Hmm naming doesn't really represent what's happening here, but there is no really good name...

How about _create_temp_flatten_table_with_1_row at least?

tneymanov · 2020-05-15T10:18:21Z

gcp_variant_transforms/libs/bigquery_util.py

+        SCHEMA_FILE_PATH=schema_file_path)
+    result = os.system(bq_command)
+    if result != 0:
+      logging.error('Failed to extract flatten table schema using "%s" command',


We log errors but don't throw exceptions? Below makes more sense, but here?

This method returns a boolean to indicate the successful completion of its task. I throw an exception in the caller vcf_to_bq.py.

tneymanov · 2020-05-15T10:24:30Z

gcp_variant_transforms/options/variant_transform_options.py

+    parser.add_argument(
+        '--sample_lookup_optimized_output_table',
+        default='',
+        help=('In addition to the default output tables (which are optimized '


Does it make sense to allow either output_table, sample_lookup_optimized_output_table or both instead of enforcing output_table?

Not really, we fill sample_lookup_optimized_output_table using BQ queries based on the schema and content of output_table.
On a more conceptual level, we assume VT is expected to always create variant lookup optimized tables while sample lookup optimized tables are needed by only some of the users.

tneymanov · 2020-05-15T10:26:59Z

gcp_variant_transforms/options/variant_transform_options.py

+        client, parsed_args.output_table,
+        parsed_args.sharding_config_path, parsed_args.append)
+
+    if parsed_args.sample_lookup_optimized_output_table:


We should make sure sample_lookup_optimized_output_table does not equal output_table.

Thanks for spotting this corner case!

tneymanov · 2020-05-15T10:29:53Z

gcp_variant_transforms/options/variant_transform_options.py

+        help=('In addition to the default output tables (which are optimized '
+              'for variant look up queries), you can store a second copy of '
+              'your data in BigQuery tables that are optimized for sample '
+              'look up queries. Note that setting this option will double your '


Since the tables are going to be unflattened, are their storage costs going to be more than the original output_tables? Or do they end up roughly equal?

If data is not joint genotyped then it's approximately equal.
I updated the help to make this issue clear, thanks for spotting it.

samanvp

All done, thanks!

samanvp · 2020-05-15T22:03:06Z

gcp_variant_transforms/libs/bigquery_util.py

 _GCS_DELETE_FILES_COMMAND = 'gsutil -m rm -f -R {ROOT_PATH}'
-_BQ_LOAD_JOB_NUM_RETRIES = 5
+_BQ_NUM_RETRIES = 3


debug artifact, removed.

samanvp · 2020-05-15T22:12:19Z

gcp_variant_transforms/libs/bigquery_util.py

@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id):
      SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
  _run_table_creation_command(bq_command)

+class FlattenCallColumn(object):


I agree, I'd like to move both this class and LoadAvro to a new module. Let's do this change in a separate PR, this one is big already.
Submitted #598

samanvp · 2020-05-17T21:26:56Z

cloudbuild_CI.yaml

@@ -42,9 +42,7 @@ steps:
      - '--project ${PROJECT_ID}'
      - '--image_tag ${COMMIT_SHA}'
      - '--run_unit_tests'
-      - '--run_preprocessor_tests'


samanvp · 2020-05-18T04:53:30Z

gcp_variant_transforms/libs/bigquery_util.py

@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id):
      SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
  _run_table_creation_command(bq_command)

+class FlattenCallColumn(object):


samanvp · 2020-05-18T04:54:38Z

gcp_variant_transforms/libs/bigquery_util.py

+        break
+    logging.info('Copy to table query was successful: %s', output_table_id)
+
+  def _create_temp_flatten_table(self):


samanvp · 2020-05-18T04:56:52Z

gcp_variant_transforms/libs/bigquery_util.py

+        SCHEMA_FILE_PATH=schema_file_path)
+    result = os.system(bq_command)
+    if result != 0:
+      logging.error('Failed to extract flatten table schema using "%s" command',


This method returns a boolean to indicate the successful completion of its task. I throw an exception in the caller vcf_to_bq.py.

samanvp · 2020-05-18T05:00:35Z

gcp_variant_transforms/options/variant_transform_options.py

+    parser.add_argument(
+        '--sample_lookup_optimized_output_table',
+        default='',
+        help=('In addition to the default output tables (which are optimized '


Not really, we fill sample_lookup_optimized_output_table using BQ queries based on the schema and content of output_table.
On a more conceptual level, we assume VT is expected to always create variant lookup optimized tables while sample lookup optimized tables are needed by only some of the users.

samanvp · 2020-05-18T05:09:07Z

gcp_variant_transforms/options/variant_transform_options.py

+        help=('In addition to the default output tables (which are optimized '
+              'for variant look up queries), you can store a second copy of '
+              'your data in BigQuery tables that are optimized for sample '
+              'look up queries. Note that setting this option will double your '


If data is not joint genotyped then it's approximately equal.
I updated the help to make this issue clear, thanks for spotting it.

samanvp · 2020-05-18T05:13:33Z

gcp_variant_transforms/options/variant_transform_options.py

+        client, parsed_args.output_table,
+        parsed_args.sharding_config_path, parsed_args.append)
+
+    if parsed_args.sample_lookup_optimized_output_table:


Thanks for spotting this corner case!

samanvp · 2020-05-18T05:15:12Z

gcp_variant_transforms/options/variant_transform_options.py

+                              sharding_config_path, append):
+    if (output_table_base_name !=
+        bigquery_util.get_table_base_name(output_table_base_name)):
+      raise ValueError(('Output table cannot contain "{}" we reserve this  '


+ Add input flag and validate its value. + Add BQ queries to flatten call column. + Extract the schema of the flatten table. + Add unit tests to verify the correctness of extracted schema.

tneymanov

LGTM, aside from minor changes.

tneymanov · 2020-05-19T16:07:22Z

gcp_variant_transforms/libs/bigquery_util.py

@@ -508,6 +527,199 @@ def create_sample_info_table(output_table_id):
      SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
  _run_table_creation_command(bq_command)

+class FlattenCallColumn(object):
+  """Flattens call column to convert varinat opt tables to sample opt tables."""


nit: s/varinat/variant

tneymanov · 2020-05-19T16:33:05Z

gcp_variant_transforms/libs/bigquery_util.py

+                                          CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS)
+    cp_query += ' LIMIT 1'  # We need this table only to extract its schema.
+    self._copy_to_flatten_table(full_output_table_id, cp_query)
+    logging.info('A new table with 1 row was crated: %s', full_output_table_id)


nit: s/crated/created

tneymanov · 2020-05-19T17:06:27Z

gcp_variant_transforms/libs/bigquery_util.py

+    select_list = []
+    for column in column_names:
+      if column != ColumnKeyConstants.CALLS:
+        select_list.append(_MAIN_TABLE_ALIAS + '.' + column + ' AS `'+


This should be done in template.

tneymanov · 2020-05-19T17:08:01Z

gcp_variant_transforms/libs/bigquery_util.py

+                                          MAIN_TABLE_ALIAS=_MAIN_TABLE_ALIAS,
+                                          CALL_COLUMN=ColumnKeyConstants.CALLS,
+                                          CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS)
+    cp_query += ' LIMIT 1'  # We need this table only to extract its schema.


Maybe put this into const too.

I don't think this will be used in any other method in this module. So for now, if you don't mine, I will keep it here.

samanvp

All done, thanks!

samanvp · 2020-05-19T20:23:27Z

gcp_variant_transforms/libs/bigquery_util.py

@@ -508,6 +527,199 @@ def create_sample_info_table(output_table_id):
      SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
  _run_table_creation_command(bq_command)

+class FlattenCallColumn(object):
+  """Flattens call column to convert varinat opt tables to sample opt tables."""


samanvp · 2020-05-19T20:39:54Z

gcp_variant_transforms/libs/bigquery_util.py

+    select_list = []
+    for column in column_names:
+      if column != ColumnKeyConstants.CALLS:
+        select_list.append(_MAIN_TABLE_ALIAS + '.' + column + ' AS `'+


samanvp · 2020-05-19T20:41:00Z

gcp_variant_transforms/libs/bigquery_util.py

+                                          MAIN_TABLE_ALIAS=_MAIN_TABLE_ALIAS,
+                                          CALL_COLUMN=ColumnKeyConstants.CALLS,
+                                          CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS)
+    cp_query += ' LIMIT 1'  # We need this table only to extract its schema.


I don't think this will be used in any other method in this module. So for now, if you don't mine, I will keep it here.

samanvp · 2020-05-19T20:41:08Z

gcp_variant_transforms/libs/bigquery_util.py

+                                          CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS)
+    cp_query += ' LIMIT 1'  # We need this table only to extract its schema.
+    self._copy_to_flatten_table(full_output_table_id, cp_query)
+    logging.info('A new table with 1 row was crated: %s', full_output_table_id)


) * Enable users to store sample look up optimized tables + Add input flag and validate its value. + Add BQ queries to flatten call column. + Extract the schema of the flatten table. + Add unit tests to verify the correctness of extracted schema. * First round of comments * Sync with googlegenomics#596 * Second round of comments

samanvp force-pushed the sample_opt_tables branch from 4d177c5 to 6cb91f7 Compare March 31, 2020 21:55

samanvp force-pushed the sample_opt_tables branch 10 times, most recently from c74ad88 to 279906d Compare April 25, 2020 05:30

samanvp force-pushed the sample_opt_tables branch 14 times, most recently from 7dcad29 to 3f3666a Compare May 14, 2020 04:04

samanvp requested a review from tneymanov May 14, 2020 04:12

tneymanov reviewed May 15, 2020

View reviewed changes

samanvp mentioned this pull request May 15, 2020

Delete empty tables after AVRO copy stage #596

Merged

samanvp commented May 18, 2020

View reviewed changes

samanvp force-pushed the sample_opt_tables branch from 4f6e0f6 to 0197e0b Compare May 19, 2020 03:16

samanvp force-pushed the sample_opt_tables branch 2 times, most recently from 243eea9 to fab844d Compare May 19, 2020 03:50

samanvp added 2 commits May 19, 2020 07:37

Enable users to store sample look up optimized tables

26215fb

+ Add input flag and validate its value. + Add BQ queries to flatten call column. + Extract the schema of the flatten table. + Add unit tests to verify the correctness of extracted schema.

First round of comments

d54adb8

samanvp force-pushed the sample_opt_tables branch 3 times, most recently from 924ff45 to 79bbd6e Compare May 19, 2020 11:55

Sync with googlegenomics#596

d4d290e

samanvp force-pushed the sample_opt_tables branch from 79bbd6e to d4d290e Compare May 19, 2020 11:57

tneymanov approved these changes May 19, 2020

View reviewed changes

samanvp commented May 19, 2020

View reviewed changes

Second round of comments

5ae8f27

samanvp merged commit 0154b55 into googlegenomics:master May 19, 2020

samanvp deleted the sample_opt_tables branch May 19, 2020 20:57

Enable users to store sample-lookup-optimized tables #573

Enable users to store sample-lookup-optimized tables #573

Conversation

samanvp commented Mar 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samanvp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tneymanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samanvp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samanvp commented Mar 31, 2020 •

edited

Loading