-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds large tests to integration tests #132
Conversation
Pull Request Test Coverage Report for Build 407
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Bashir! Looks good! RE quota: We should have pretty large quota for our test project in the 'us-central1' region (the default region). However, I noticed that we run the 'pipelines runner' in us-west1, so perhaps you are hitting quotas there. I requested increase in that region as well.
}, | ||
{ | ||
"query": ["SUM_START_QUERY"], | ||
"expected_result": {"sum_start": 24107891293483855} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, I did not know BQ can actually handle such large sums! cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed :-)
"assertion_configs": [ | ||
{ | ||
"query": ["NUM_ROWS_QUERY"], | ||
"expected_result": {"num_rows": 309551691} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, just noticed that this should be updated as it's from no-merge option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, forgot to update it and for this case it never reached table validation before. Thanks for pointing it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re. quota issues, the error I got was a disk space exceed in us-centerl1. It actually succeeded with the reduced number of workers (5*16) this morning (after I sent the PR) but took 38 minutes (just for the Dataflow job; ~1 hour total). The error did not make total sense to me because the disk quota I could see in the dashboard was an order of magnitude higher and I was too tired :-)
I will rerun with more workers and check again.
}, | ||
{ | ||
"query": ["SUM_START_QUERY"], | ||
"expected_result": {"sum_start": 24107891293483855} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed :-)
"assertion_configs": [ | ||
{ | ||
"query": ["NUM_ROWS_QUERY"], | ||
"expected_result": {"num_rows": 309551691} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, forgot to update it and for this case it never reached table validation before. Thanks for pointing it out.
Update: This actually succeeded with 20*16 workers in ~26 minutes (this is without image building as I use --skip_build but even with that it should be sub 0.5 hour which I think is reasonable to be done once for every commit). I am rerunning to check if there is a flaky issue but it is quite possible that in the early morning hours I was too tired and I misinterpreted the error. If I cannot reproduce the Quota related issue in a few runs, I would like to commit this as is. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! SG!
"input_pattern": "gs://genomics-public-data/platinum-genomes/vcf/*.vcf", | ||
"runner": "DataflowRunner", | ||
"worker_machine_type": "n1-standard-16", | ||
"max_num_workers": "20", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I wonder if we can get faster performance with num_workers
instead of max_num_workers
(autoscaling takes adds several minutes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I chose max_num_workers
is that using num_workers
alone tends to produce more variability in terms of both run-time and resource usage. I guess part of the reason is that the actual number of workers can go higher (for example check this run with num_workers=20
but no max_num_workers
).
I want this to be fairly consistent in terms of run-time such that performance metric graphs/alerts become more meaningful. So I ended up experimenting with setting both max_num_workers
and num_workers
to get both extra speed-up and fair consistency. The new config uses more resources (e.g., CPU) but that is expected because auto-scaling does a better job.
A platinum merge run with the new config taking ~19 mins.
A platinum merge run only with max_num_workers
taking ~24 mins.
PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Bashir (@bashir2) and Asha (@arostamianfar),
Though I'm sure you already know this, but even if you don't specify either num_workers
or max_num_workers
- but not both being specified - then the Dataflow service will actually set the unspecified one, as noted here (0 is the equivalent of unspecified for num_workers
):
The link to this part of the documentation is the following:
You will need to inspect what are the currentNumWorkers
for the job is with the getCurrentNumWorkers()
:
Hope it helps,
Paul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Paul (@pgrosu) for your comment. Yes, I think setting one of these and leaving the other unset will leave the choice to the autoscalling algorithm. One of the use cases for these large tests is to monitor the performance of the pipeline and catch regressions quickly. For that to work reliably, I am thinking that maybe we should even turn autoscalling off. But I think setting both of these to the same value has a similar effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I did not realize that just setting num_workers can actually use autoscale as well. Should probably update our docs too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Bashir and Asha,
I totally agree auto-scaling must be off as that is zone-specific overhead that can be very unpredictable, which pre-allocation solves. Regarding the Dataflow performance metric, why is it taking 19 minutes, which seems kind of long for 30 GB of data (compressed). What does the Dataflow Monitor analysis indicate as a bottleneck? Are there shuffling opportunities or fusion possibilities for your execution graph? Are the cores maximally utilized? What I/O opportunities are overlooked? With 320 cores (20*16) and 1,200 GB RAM (20*60) this should go much faster than 26 minutes. Take a look at my analysis of 1000 Genomes with Dion here that could be done in seconds, especially since your transfer-rate is quite fast within a datacenter (zone):
googlegenomics/utils-java#62 (comment)
Regarding checking any regression cases, maybe utilizing the GATK test cases from here might help:
https://github.com/broadinstitute/gatk/tree/master/src/test
Others which I'm sure you know can be:
NCBI:
- ftp://ftp.ncbi.nih.gov/snp/organisms/
- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/
EBI:
- ftp://ftp.ensembl.org/pub/release-91/variation/vcf/
Not sure if combining large-scale performance and regression is an optimal combination, since both can be significant rabbit-holes, and limiting the analysis to one variable while building up the pipeline might save you unexpected headaches. Also probably some scale-up benchmarks for different zones during different times of the day might help users, but that could be quite a lot of work to perform.
Hope it helps,
Paul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for mentioning the thread, Paul (@pgrosu). If I understand the reference pipeline correctly, I believe the main difference between the VT pipeline and the one you referenced is that we are not just reading data from VCF and are also performing several transformations like parsing all fields in each VCF record, merging variants across files, and writing to BQ. Just reading records from VCF (i.e. treating them as a 'text' file) is much faster and can be comparable to the pipeline you referenced (the Reads/Variants APIs already contained data in a parsed/merged form).
Having said that, we haven't yet enabled Cython (#110) and haven't really done any optimizations, so our code can definitely run faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Asha (@arostamianfar), you might be surprised, but your pipeline can actually be fast even without Cython with branching/loop optimizations. Since you have so much memory, most of the data can just be mapped there even without an LRU optimization. Dataflow will do a lot for you, but sometimes it needs guidance. Parsing has equivalency to the reading text files, and merging(sort) is O(n*lg(n)) as a worst-case scenario. So you still can come close to my scenario :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! We'll definitely consider this when considering optimizing the pipeline! For now, we've been mostly occupied on feature development and building foundational tasks like integration testing as majority of users are ok with the current performance, but want a robust pipeline.
Thanks so much for the feedback and comments though! Please do continue to engage with us as we're building VT. We really appreciate it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to help out :)
Tested:
Manually ran deploy_and_run_tests.sh --include_large_tests
Issue: #99