Adds large tests to integration tests #132

bashir2 · 2018-03-08T08:10:37Z

Tested:
Manually ran deploy_and_run_tests.sh --include_large_tests

Issue: #99

coveralls · 2018-03-08T08:14:09Z

Pull Request Test Coverage Report for Build 407

1 of 8 (12.5%) changed or added relevant lines in 1 file are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.1%) to 90.082%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
gcp_variant_transforms/testing/integration/run_tests.py	1	8	12.5%

Files with Coverage Reduction	New Missed Lines	%
gcp_variant_transforms/testing/integration/run_tests.py	2	26.09%

Totals
Change from base Build 404:	-0.1%
Covered Lines:	3179
Relevant Lines:	3529

💛 - Coveralls

arostamianfar

Thanks, Bashir! Looks good! RE quota: We should have pretty large quota for our test project in the 'us-central1' region (the default region). However, I noticed that we run the 'pipelines runner' in us-west1, so perhaps you are hitting quotas there. I requested increase in that region as well.

arostamianfar · 2018-03-08T16:29:34Z

gcp_variant_transforms/testing/integration/large_tests/platinum_no_merge.json

+    },
+    {
+      "query": ["SUM_START_QUERY"],
+      "expected_result": {"sum_start": 24107891293483855}


huh, I did not know BQ can actually handle such large sums! cool!

arostamianfar · 2018-03-08T16:38:31Z

gcp_variant_transforms/testing/integration/large_tests/platinum_with_merge.json

+  "assertion_configs": [
+    {
+      "query": ["NUM_ROWS_QUERY"],
+      "expected_result": {"num_rows": 309551691}


Actually, just noticed that this should be updated as it's from no-merge option.

Right, forgot to update it and for this case it never reached table validation before. Thanks for pointing it out.

bashir2

Re. quota issues, the error I got was a disk space exceed in us-centerl1. It actually succeeded with the reduced number of workers (5*16) this morning (after I sent the PR) but took 38 minutes (just for the Dataflow job; ~1 hour total). The error did not make total sense to me because the disk quota I could see in the dashboard was an order of magnitude higher and I was too tired :-)
I will rerun with more workers and check again.

bashir2 · 2018-03-08T19:23:38Z

gcp_variant_transforms/testing/integration/large_tests/platinum_no_merge.json

+    },
+    {
+      "query": ["SUM_START_QUERY"],
+      "expected_result": {"sum_start": 24107891293483855}


bashir2 · 2018-03-08T19:24:29Z

gcp_variant_transforms/testing/integration/large_tests/platinum_with_merge.json

+  "assertion_configs": [
+    {
+      "query": ["NUM_ROWS_QUERY"],
+      "expected_result": {"num_rows": 309551691}


Right, forgot to update it and for this case it never reached table validation before. Thanks for pointing it out.

bashir2 · 2018-03-08T21:42:28Z

Update: This actually succeeded with 20*16 workers in ~26 minutes (this is without image building as I use --skip_build but even with that it should be sub 0.5 hour which I think is reasonable to be done once for every commit). I am rerunning to check if there is a flaky issue but it is quite possible that in the early morning hours I was too tired and I misinterpreted the error. If I cannot reproduce the Quota related issue in a few runs, I would like to commit this as is.

PTAL.

arostamianfar

Thanks! SG!

arostamianfar · 2018-03-08T21:47:20Z

gcp_variant_transforms/testing/integration/large_tests/platinum_no_merge.json

+  "input_pattern": "gs://genomics-public-data/platinum-genomes/vcf/*.vcf",
+  "runner": "DataflowRunner",
+  "worker_machine_type": "n1-standard-16",
+  "max_num_workers": "20",


Actually, I wonder if we can get faster performance with num_workers instead of max_num_workers (autoscaling takes adds several minutes).

The reason I chose max_num_workers is that using num_workers alone tends to produce more variability in terms of both run-time and resource usage. I guess part of the reason is that the actual number of workers can go higher (for example check this run with num_workers=20 but no max_num_workers).

I want this to be fairly consistent in terms of run-time such that performance metric graphs/alerts become more meaningful. So I ended up experimenting with setting both max_num_workers and num_workers to get both extra speed-up and fair consistency. The new config uses more resources (e.g., CPU) but that is expected because auto-scaling does a better job.

A platinum merge run with the new config taking ~19 mins.
A platinum merge run only with max_num_workers taking ~24 mins.

PTAL.

Hi Bashir (@bashir2) and Asha (@arostamianfar),

Though I'm sure you already know this, but even if you don't specify either num_workers or max_num_workers - but not both being specified - then the Dataflow service will actually set the unspecified one, as noted here (0 is the equivalent of unspecified for num_workers):

The link to this part of the documentation is the following:

https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options

You will need to inspect what are the currentNumWorkers for the job is with the getCurrentNumWorkers():

https://developers.google.com/resources/api-libraries/documentation/dataflow/v1b3/java/latest/com/google/api/services/dataflow/model/AutoscalingEvent.html

Hope it helps,
Paul

Thanks Paul (@pgrosu) for your comment. Yes, I think setting one of these and leaving the other unset will leave the choice to the autoscalling algorithm. One of the use cases for these large tests is to monitor the performance of the pipeline and catch regressions quickly. For that to work reliably, I am thinking that maybe we should even turn autoscalling off. But I think setting both of these to the same value has a similar effect.

Interesting! I did not realize that just setting num_workers can actually use autoscale as well. Should probably update our docs too.

Hi Bashir and Asha,

I totally agree auto-scaling must be off as that is zone-specific overhead that can be very unpredictable, which pre-allocation solves. Regarding the Dataflow performance metric, why is it taking 19 minutes, which seems kind of long for 30 GB of data (compressed). What does the Dataflow Monitor analysis indicate as a bottleneck? Are there shuffling opportunities or fusion possibilities for your execution graph? Are the cores maximally utilized? What I/O opportunities are overlooked? With 320 cores (20*16) and 1,200 GB RAM (20*60) this should go much faster than 26 minutes. Take a look at my analysis of 1000 Genomes with Dion here that could be done in seconds, especially since your transfer-rate is quite fast within a datacenter (zone):

googlegenomics/utils-java#62 (comment)

Regarding checking any regression cases, maybe utilizing the GATK test cases from here might help:

https://github.com/broadinstitute/gatk/tree/master/src/test

Others which I'm sure you know can be:

NCBI:

ftp://ftp.ncbi.nih.gov/snp/organisms/

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/

EBI:

ftp://ftp.ensembl.org/pub/release-91/variation/vcf/

Not sure if combining large-scale performance and regression is an optimal combination, since both can be significant rabbit-holes, and limiting the analysis to one variable while building up the pipeline might save you unexpected headaches. Also probably some scale-up benchmarks for different zones during different times of the day might help users, but that could be quite a lot of work to perform.

Hope it helps,
Paul

Thanks for mentioning the thread, Paul (@pgrosu). If I understand the reference pipeline correctly, I believe the main difference between the VT pipeline and the one you referenced is that we are not just reading data from VCF and are also performing several transformations like parsing all fields in each VCF record, merging variants across files, and writing to BQ. Just reading records from VCF (i.e. treating them as a 'text' file) is much faster and can be comparable to the pipeline you referenced (the Reads/Variants APIs already contained data in a parsed/merged form).
Having said that, we haven't yet enabled Cython (#110) and haven't really done any optimizations, so our code can definitely run faster.

Hi Asha (@arostamianfar), you might be surprised, but your pipeline can actually be fast even without Cython with branching/loop optimizations. Since you have so much memory, most of the data can just be mapped there even without an LRU optimization. Dataflow will do a lot for you, but sometimes it needs guidance. Parsing has equivalency to the reading text files, and merging(sort) is O(n*lg(n)) as a worst-case scenario. So you still can come close to my scenario :)

Interesting! We'll definitely consider this when considering optimizing the pipeline! For now, we've been mostly occupied on feature development and building foundational tasks like integration testing as majority of users are ok with the current performance, but want a robust pipeline.
Thanks so much for the feedback and comments though! Please do continue to engage with us as we're building VT. We really appreciate it!

Glad to help out :)

Adds large tests to integration tests

f623e1a

bashir2 requested review from arostamianfar and allieychen March 8, 2018 08:10

arostamianfar previously approved these changes Mar 8, 2018

View reviewed changes

arostamianfar reviewed Mar 8, 2018

View reviewed changes

Increased number of workers and fixed query result for merge case

5ddde12

bashir2 dismissed arostamianfar’s stale review via 5ddde12 March 8, 2018 19:35

bashir2 commented Mar 8, 2018

View reviewed changes

Merge branch 'master' into large_test

ec3c01c

arostamianfar previously approved these changes Mar 8, 2018

View reviewed changes

arostamianfar reviewed Mar 8, 2018

View reviewed changes

added num_workers

1591149

bashir2 dismissed arostamianfar’s stale review via 1591149 March 9, 2018 01:26

arostamianfar approved these changes Mar 9, 2018

View reviewed changes

bashir2 merged commit 28e5817 into googlegenomics:master Mar 9, 2018

bashir2 deleted the large_test branch March 9, 2018 17:53

bashir2 mentioned this pull request Mar 14, 2018

Fix time out issue of the integration test. #139

Merged

pgrosu mentioned this pull request Mar 27, 2018

Fix integration testing failures #151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds large tests to integration tests #132

Adds large tests to integration tests #132

bashir2 commented Mar 8, 2018 •

edited

Loading

coveralls commented Mar 8, 2018 •

edited

Loading

arostamianfar left a comment

arostamianfar Mar 8, 2018

bashir2 Mar 8, 2018

arostamianfar Mar 8, 2018

bashir2 Mar 8, 2018

bashir2 left a comment

bashir2 Mar 8, 2018

bashir2 Mar 8, 2018

bashir2 commented Mar 8, 2018

arostamianfar left a comment

arostamianfar Mar 8, 2018

bashir2 Mar 9, 2018 •

edited by arostamianfar

Loading

pgrosu Mar 9, 2018

bashir2 Mar 9, 2018

arostamianfar Mar 9, 2018

pgrosu Mar 9, 2018

arostamianfar Mar 9, 2018

pgrosu Mar 9, 2018

arostamianfar Mar 9, 2018

pgrosu Mar 9, 2018

Adds large tests to integration tests #132

Adds large tests to integration tests #132

Conversation

bashir2 commented Mar 8, 2018 • edited Loading

coveralls commented Mar 8, 2018 • edited Loading

Pull Request Test Coverage Report for Build 407

💛 - Coveralls

arostamianfar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashir2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashir2 commented Mar 8, 2018

arostamianfar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashir2 Mar 9, 2018 • edited by arostamianfar Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashir2 commented Mar 8, 2018 •

edited

Loading

coveralls commented Mar 8, 2018 •

edited

Loading

bashir2 Mar 9, 2018 •

edited by arostamianfar

Loading