-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance of reading from GS buckets #1755
Comments
It's not a surprise that a local disk is faster than a remote one, but the
magnitude of the difference is a lot more than I would expect. I remember
at the time using direct GCS access to get the best possible performance in
the bit I was working on, but I don't remember exactly how much of a
difference it made.
From my desktop it takes 2m25s to download the whole file, so the ~6min
difference seems really excessive, something is broken. One thing to look
into is whether the sharding is working correctly (are we getting the
correct number of parallel downloads?)
Presumably this code is using the HDFS adapter. It'll be interesting to
compare vs the NIO version (and then the optimized NIO version once I write
it).
|
@akiezun Can you determine whether you're using the HDFS -> GCS adapter in your test case? The adapter historically did have performance problems of this magnitude. As @jean-philippe-martin mentioned, we should benchmark the new NIO -> GCS support as well. |
how? |
how? |
It's not clear to me what code path you're going through when using a |
Could this be related to having sliced objects in the gsutils buckets but not using a code path that goes through a native CRC implementation? I ask because I noticed that when I try to download the file
with gsutil, I get this error:
Could the GATK command path be computing all of the CRC hashes in Java code, slowing it down? |
Did we change the name of the files since the initial bug was filed? Because my earlier comment talks about downloading to my desktop in 2.5min, but retrying now it takes about an hour! The file is 34.56 GiB. I guess that's what I get for moving to a different office. Assuming this time is correct, I see that NIO (when multithreaded) matches gsutil performance. Output below. gsutil:
NIO (via ParallelCountBytes)
(3354s is 55min54s) Next step: try on a Google Compute Engine computer, to get datacenter speeds. |
I tried from a GCE instance and got 6m12s for the gsutil copy, and 5m21s for the NIO code. So we know that using the NIO code would match the gsutil performance (in this case, about 100MB/s). As to the original question of how this translates to Spark performance, well, this test just fails to prove that GCS/NIO are too slow; more investigation is needed. The cluster we're using has 10 machines so it may be able to run up to 10x faster than this single-machine test, ie. only about 30s to load the data via NIO. Of course, the program does more than that, and we still have to demonstrate that we're not going to bottleneck the GCS servers. |
It all depends on the network you are running from, as noted in the following discussion: googlegenomics/utils-java#9 (comment) So the closer you are to the data such as through GCE, and launching a GCE instance from a relatively similar zone such as ( $ gsutil ls -l gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2
1563675749 2014-04-24T20:26:25Z gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2
TOTAL: 1 objects, 1563675749 bytes (1.46 GiB)
$
$ time(gsutil cp -L transfer_statistics.txt gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2 . )
Copying gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2...
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
WARNING: Found no hashes to validate object downloaded to ./coverageRefScore-chr1-GS000015172-ASM.tsv.bz2. Integrity cannot be assured without hashes.
real 0m31.112s
user 0m25.286s
sys 0m21.582s
$ Hope it helps, |
So anyways, given than it takes 6min to download the file, the time to download + run on HDFS would be 7.15min, vs the listed time of 7.5min to run directly on GCS. That doesn't sound so bad, does it? |
Not bad, but why can't the BAM be split as multiple objects in the same bucket where the directory is the name of the BAM. I was having this discussion with Dion at the following thread: googlegenomics/utils-java#62 (comment) You can have a folder in the GS location be the name of the BAM, and even sort them like a distributed B-tree. This way you can even simultaneously process reads as new data is streaming in from the GS location. Since the Google disk IOPS are as follows, based on the following link: https://cloud.google.com/compute/docs/disks/performance#type_comparison
So it all depends on perspective of what folks prefer, which in this case means that we can minimize the 6 min component. Then comes the 1.5 min portion of HDFS, which can occur in parallel and could also be memory-mapped and/or SSD accessed. So there are still ways to improve the access and processing time, but it depends on how fast - or instantaneous - folks want to have the results processed and returned back. |
I vote to close. As per my earlier remark, performance reading from GCS buckets (at least in the case highlighted in this ticket) is just fine. The root cause was that the comparison was forgetting about the 6min it takes to place the GCS file onto the cluster. |
@jean-philippe-martin Will files always be preemptively be placed on GCS, so this time-delta will not be experienced? |
Latest results show that GCS buckets perform even better than fine. Closing this issue. |
@jean-philippe-martin This was refering to gs:// inputs in spark wasn't it? I think we still have work to do on that don't we? |
We do, but not on performance. The performance reading from GCS buckets is just fine, when we take into account the time it would otherwise take to copy the data over from the bucket. |
Running on GCS (cluster created by dataproc), the GATK spark tools run much faster on HDFS than on files stored on GS
HDFS 1.15 minutes
GCS 7.50 minutes
@lbergelson @jean-philippe-martin is this a known thing? If this expected
The text was updated successfully, but these errors were encountered: