-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable users to store sample-lookup-optimized tables #573
Conversation
4d177c5
to
6cb91f7
Compare
c74ad88
to
279906d
Compare
7dcad29
to
3f3666a
Compare
cloudbuild_CI.yaml
Outdated
@@ -42,9 +42,7 @@ steps: | |||
- '--project ${PROJECT_ID}' | |||
- '--image_tag ${COMMIT_SHA}' | |||
- '--run_unit_tests' | |||
- '--run_preprocessor_tests' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please restore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sharding_config_path, append): | ||
if (output_table_base_name != | ||
bigquery_util.get_table_base_name(output_table_base_name)): | ||
raise ValueError(('Output table cannot contain "{}" we reserve this ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit from previous implementation: 'Output table cannot contain "{}". We reserve this ' (remove 1 space at the end, stop sentence at {})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id): | |||
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH) | |||
_run_table_creation_command(bq_command) | |||
|
|||
class FlattenCallColumn(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is getting unruly, don't you think? Maybe move this to a new file? If you do so, please make sure to update copyright accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I'd like to move both this class and LoadAvro
to a new module. Let's do this change in a separate PR, this one is big already.
Submitted #598
@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id): | |||
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH) | |||
_run_table_creation_command(bq_command) | |||
|
|||
class FlattenCallColumn(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
_GCS_DELETE_FILES_COMMAND = 'gsutil -m rm -f -R {ROOT_PATH}' | ||
_BQ_LOAD_JOB_NUM_RETRIES = 5 | ||
_BQ_NUM_RETRIES = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we reducing retry times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug artifact, removed.
break | ||
logging.info('Copy to table query was successful: %s', output_table_id) | ||
|
||
def _create_temp_flatten_table(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm naming doesn't really represent what's happening here, but there is no really good name...
How about _create_temp_flatten_table_with_1_row at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
SCHEMA_FILE_PATH=schema_file_path) | ||
result = os.system(bq_command) | ||
if result != 0: | ||
logging.error('Failed to extract flatten table schema using "%s" command', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We log errors but don't throw exceptions? Below makes more sense, but here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method returns a boolean to indicate the successful completion of its task. I throw an exception in the caller vcf_to_bq.py
.
parser.add_argument( | ||
'--sample_lookup_optimized_output_table', | ||
default='', | ||
help=('In addition to the default output tables (which are optimized ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to allow either output_table, sample_lookup_optimized_output_table or both instead of enforcing output_table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, we fill sample_lookup_optimized_output_table
using BQ queries based on the schema and content of output_table
.
On a more conceptual level, we assume VT is expected to always create variant lookup optimized tables while sample lookup optimized tables are needed by only some of the users.
client, parsed_args.output_table, | ||
parsed_args.sharding_config_path, parsed_args.append) | ||
|
||
if parsed_args.sample_lookup_optimized_output_table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure sample_lookup_optimized_output_table does not equal output_table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting this corner case!
help=('In addition to the default output tables (which are optimized ' | ||
'for variant look up queries), you can store a second copy of ' | ||
'your data in BigQuery tables that are optimized for sample ' | ||
'look up queries. Note that setting this option will double your ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the tables are going to be unflattened, are their storage costs going to be more than the original output_tables? Or do they end up roughly equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If data is not joint genotyped then it's approximately equal.
I updated the help to make this issue clear, thanks for spotting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All done, thanks!
_GCS_DELETE_FILES_COMMAND = 'gsutil -m rm -f -R {ROOT_PATH}' | ||
_BQ_LOAD_JOB_NUM_RETRIES = 5 | ||
_BQ_NUM_RETRIES = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug artifact, removed.
@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id): | |||
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH) | |||
_run_table_creation_command(bq_command) | |||
|
|||
class FlattenCallColumn(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I'd like to move both this class and LoadAvro
to a new module. Let's do this change in a separate PR, this one is big already.
Submitted #598
cloudbuild_CI.yaml
Outdated
@@ -42,9 +42,7 @@ steps: | |||
- '--project ${PROJECT_ID}' | |||
- '--image_tag ${COMMIT_SHA}' | |||
- '--run_unit_tests' | |||
- '--run_preprocessor_tests' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id): | |||
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH) | |||
_run_table_creation_command(bq_command) | |||
|
|||
class FlattenCallColumn(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
break | ||
logging.info('Copy to table query was successful: %s', output_table_id) | ||
|
||
def _create_temp_flatten_table(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
SCHEMA_FILE_PATH=schema_file_path) | ||
result = os.system(bq_command) | ||
if result != 0: | ||
logging.error('Failed to extract flatten table schema using "%s" command', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method returns a boolean to indicate the successful completion of its task. I throw an exception in the caller vcf_to_bq.py
.
parser.add_argument( | ||
'--sample_lookup_optimized_output_table', | ||
default='', | ||
help=('In addition to the default output tables (which are optimized ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, we fill sample_lookup_optimized_output_table
using BQ queries based on the schema and content of output_table
.
On a more conceptual level, we assume VT is expected to always create variant lookup optimized tables while sample lookup optimized tables are needed by only some of the users.
help=('In addition to the default output tables (which are optimized ' | ||
'for variant look up queries), you can store a second copy of ' | ||
'your data in BigQuery tables that are optimized for sample ' | ||
'look up queries. Note that setting this option will double your ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If data is not joint genotyped then it's approximately equal.
I updated the help to make this issue clear, thanks for spotting it.
client, parsed_args.output_table, | ||
parsed_args.sharding_config_path, parsed_args.append) | ||
|
||
if parsed_args.sample_lookup_optimized_output_table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting this corner case!
sharding_config_path, append): | ||
if (output_table_base_name != | ||
bigquery_util.get_table_base_name(output_table_base_name)): | ||
raise ValueError(('Output table cannot contain "{}" we reserve this ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
4f6e0f6
to
0197e0b
Compare
243eea9
to
fab844d
Compare
+ Add input flag and validate its value. + Add BQ queries to flatten call column. + Extract the schema of the flatten table. + Add unit tests to verify the correctness of extracted schema.
924ff45
to
79bbd6e
Compare
79bbd6e
to
d4d290e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, aside from minor changes.
@@ -508,6 +527,199 @@ def create_sample_info_table(output_table_id): | |||
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH) | |||
_run_table_creation_command(bq_command) | |||
|
|||
class FlattenCallColumn(object): | |||
"""Flattens call column to convert varinat opt tables to sample opt tables.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/varinat/variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS) | ||
cp_query += ' LIMIT 1' # We need this table only to extract its schema. | ||
self._copy_to_flatten_table(full_output_table_id, cp_query) | ||
logging.info('A new table with 1 row was crated: %s', full_output_table_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/crated/created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
select_list = [] | ||
for column in column_names: | ||
if column != ColumnKeyConstants.CALLS: | ||
select_list.append(_MAIN_TABLE_ALIAS + '.' + column + ' AS `'+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done in template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
MAIN_TABLE_ALIAS=_MAIN_TABLE_ALIAS, | ||
CALL_COLUMN=ColumnKeyConstants.CALLS, | ||
CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS) | ||
cp_query += ' LIMIT 1' # We need this table only to extract its schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put this into const too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will be used in any other method in this module. So for now, if you don't mine, I will keep it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All done, thanks!
@@ -508,6 +527,199 @@ def create_sample_info_table(output_table_id): | |||
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH) | |||
_run_table_creation_command(bq_command) | |||
|
|||
class FlattenCallColumn(object): | |||
"""Flattens call column to convert varinat opt tables to sample opt tables.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
select_list = [] | ||
for column in column_names: | ||
if column != ColumnKeyConstants.CALLS: | ||
select_list.append(_MAIN_TABLE_ALIAS + '.' + column + ' AS `'+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
MAIN_TABLE_ALIAS=_MAIN_TABLE_ALIAS, | ||
CALL_COLUMN=ColumnKeyConstants.CALLS, | ||
CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS) | ||
cp_query += ' LIMIT 1' # We need this table only to extract its schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will be used in any other method in this module. So for now, if you don't mine, I will keep it here.
CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS) | ||
cp_query += ' LIMIT 1' # We need this table only to extract its schema. | ||
self._copy_to_flatten_table(full_output_table_id, cp_query) | ||
logging.info('A new table with 1 row was crated: %s', full_output_table_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
) * Enable users to store sample look up optimized tables + Add input flag and validate its value. + Add BQ queries to flatten call column. + Extract the schema of the flatten table. + Add unit tests to verify the correctness of extracted schema. * First round of comments * Sync with googlegenomics#596 * Second round of comments
This PR includes the following changes:
call_sample_id
column.We will add integration tests in a follow up PR.