Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds large tests to integration tests #132

Merged
merged 4 commits into from
Mar 9, 2018

Conversation

bashir2
Copy link
Member

@bashir2 bashir2 commented Mar 8, 2018

Tested:
Manually ran deploy_and_run_tests.sh --include_large_tests

Issue: #99

@coveralls
Copy link

coveralls commented Mar 8, 2018

Pull Request Test Coverage Report for Build 407

  • 1 of 8 (12.5%) changed or added relevant lines in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.1%) to 90.082%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/testing/integration/run_tests.py 1 8 12.5%
Files with Coverage Reduction New Missed Lines %
gcp_variant_transforms/testing/integration/run_tests.py 2 26.09%
Totals Coverage Status
Change from base Build 404: -0.1%
Covered Lines: 3179
Relevant Lines: 3529

💛 - Coveralls

arostamianfar
arostamianfar previously approved these changes Mar 8, 2018
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Bashir! Looks good! RE quota: We should have pretty large quota for our test project in the 'us-central1' region (the default region). However, I noticed that we run the 'pipelines runner' in us-west1, so perhaps you are hitting quotas there. I requested increase in that region as well.

},
{
"query": ["SUM_START_QUERY"],
"expected_result": {"sum_start": 24107891293483855}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, I did not know BQ can actually handle such large sums! cool!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed :-)

"assertion_configs": [
{
"query": ["NUM_ROWS_QUERY"],
"expected_result": {"num_rows": 309551691}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, just noticed that this should be updated as it's from no-merge option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, forgot to update it and for this case it never reached table validation before. Thanks for pointing it out.

Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. quota issues, the error I got was a disk space exceed in us-centerl1. It actually succeeded with the reduced number of workers (5*16) this morning (after I sent the PR) but took 38 minutes (just for the Dataflow job; ~1 hour total). The error did not make total sense to me because the disk quota I could see in the dashboard was an order of magnitude higher and I was too tired :-)
I will rerun with more workers and check again.

},
{
"query": ["SUM_START_QUERY"],
"expected_result": {"sum_start": 24107891293483855}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed :-)

"assertion_configs": [
{
"query": ["NUM_ROWS_QUERY"],
"expected_result": {"num_rows": 309551691}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, forgot to update it and for this case it never reached table validation before. Thanks for pointing it out.

@bashir2
Copy link
Member Author

bashir2 commented Mar 8, 2018

Update: This actually succeeded with 20*16 workers in ~26 minutes (this is without image building as I use --skip_build but even with that it should be sub 0.5 hour which I think is reasonable to be done once for every commit). I am rerunning to check if there is a flaky issue but it is quite possible that in the early morning hours I was too tired and I misinterpreted the error. If I cannot reproduce the Quota related issue in a few runs, I would like to commit this as is.

PTAL.

arostamianfar
arostamianfar previously approved these changes Mar 8, 2018
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! SG!

"input_pattern": "gs://genomics-public-data/platinum-genomes/vcf/*.vcf",
"runner": "DataflowRunner",
"worker_machine_type": "n1-standard-16",
"max_num_workers": "20",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wonder if we can get faster performance with num_workers instead of max_num_workers (autoscaling takes adds several minutes).

Copy link
Member Author

@bashir2 bashir2 Mar 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I chose max_num_workers is that using num_workers alone tends to produce more variability in terms of both run-time and resource usage. I guess part of the reason is that the actual number of workers can go higher (for example check this run with num_workers=20 but no max_num_workers).

I want this to be fairly consistent in terms of run-time such that performance metric graphs/alerts become more meaningful. So I ended up experimenting with setting both max_num_workers and num_workers to get both extra speed-up and fair consistency. The new config uses more resources (e.g., CPU) but that is expected because auto-scaling does a better job.

A platinum merge run with the new config taking ~19 mins.
A platinum merge run only with max_num_workers taking ~24 mins.

PTAL.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Bashir (@bashir2) and Asha (@arostamianfar),

Though I'm sure you already know this, but even if you don't specify either num_workers or max_num_workers - but not both being specified - then the Dataflow service will actually set the unspecified one, as noted here (0 is the equivalent of unspecified for num_workers):

workers

The link to this part of the documentation is the following:

https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options

You will need to inspect what are the currentNumWorkers for the job is with the getCurrentNumWorkers():

https://developers.google.com/resources/api-libraries/documentation/dataflow/v1b3/java/latest/com/google/api/services/dataflow/model/AutoscalingEvent.html

Hope it helps,
Paul

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Paul (@pgrosu) for your comment. Yes, I think setting one of these and leaving the other unset will leave the choice to the autoscalling algorithm. One of the use cases for these large tests is to monitor the performance of the pipeline and catch regressions quickly. For that to work reliably, I am thinking that maybe we should even turn autoscalling off. But I think setting both of these to the same value has a similar effect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I did not realize that just setting num_workers can actually use autoscale as well. Should probably update our docs too.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Bashir and Asha,

I totally agree auto-scaling must be off as that is zone-specific overhead that can be very unpredictable, which pre-allocation solves. Regarding the Dataflow performance metric, why is it taking 19 minutes, which seems kind of long for 30 GB of data (compressed). What does the Dataflow Monitor analysis indicate as a bottleneck? Are there shuffling opportunities or fusion possibilities for your execution graph? Are the cores maximally utilized? What I/O opportunities are overlooked? With 320 cores (20*16) and 1,200 GB RAM (20*60) this should go much faster than 26 minutes. Take a look at my analysis of 1000 Genomes with Dion here that could be done in seconds, especially since your transfer-rate is quite fast within a datacenter (zone):

googlegenomics/utils-java#62 (comment)

Regarding checking any regression cases, maybe utilizing the GATK test cases from here might help:

https://github.com/broadinstitute/gatk/tree/master/src/test

Others which I'm sure you know can be:

NCBI:

  • ftp://ftp.ncbi.nih.gov/snp/organisms/
  • ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/

EBI:

  • ftp://ftp.ensembl.org/pub/release-91/variation/vcf/

Not sure if combining large-scale performance and regression is an optimal combination, since both can be significant rabbit-holes, and limiting the analysis to one variable while building up the pipeline might save you unexpected headaches. Also probably some scale-up benchmarks for different zones during different times of the day might help users, but that could be quite a lot of work to perform.

Hope it helps,
Paul

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for mentioning the thread, Paul (@pgrosu). If I understand the reference pipeline correctly, I believe the main difference between the VT pipeline and the one you referenced is that we are not just reading data from VCF and are also performing several transformations like parsing all fields in each VCF record, merging variants across files, and writing to BQ. Just reading records from VCF (i.e. treating them as a 'text' file) is much faster and can be comparable to the pipeline you referenced (the Reads/Variants APIs already contained data in a parsed/merged form).
Having said that, we haven't yet enabled Cython (#110) and haven't really done any optimizations, so our code can definitely run faster.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Asha (@arostamianfar), you might be surprised, but your pipeline can actually be fast even without Cython with branching/loop optimizations. Since you have so much memory, most of the data can just be mapped there even without an LRU optimization. Dataflow will do a lot for you, but sometimes it needs guidance. Parsing has equivalency to the reading text files, and merging(sort) is O(n*lg(n)) as a worst-case scenario. So you still can come close to my scenario :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! We'll definitely consider this when considering optimizing the pipeline! For now, we've been mostly occupied on feature development and building foundational tasks like integration testing as majority of users are ok with the current performance, but want a robust pipeline.
Thanks so much for the feedback and comments though! Please do continue to engage with us as we're building VT. We really appreciate it!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to help out :)

@bashir2 bashir2 merged commit 28e5817 into googlegenomics:master Mar 9, 2018
@bashir2 bashir2 deleted the large_test branch March 9, 2018 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants