Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of reading from GS buckets #1755

Closed
akiezun opened this issue Apr 21, 2016 · 16 comments
Closed

Poor performance of reading from GS buckets #1755

akiezun opened this issue Apr 21, 2016 · 16 comments
Assignees
Milestone

Comments

@akiezun
Copy link
Contributor

akiezun commented Apr 21, 2016

Running on GCS (cluster created by dataproc), the GATK spark tools run much faster on HDFS than on files stored on GS

HDFS 1.15 minutes

/gatk-launch CountReadsSpark -I /user/akiezun/CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam --apiKey <myAPIKEY> -- --sparkRunner GCS --cluster dataproc-cluster-3 --executor-cores 3 --executor-memory 25G --conf spark.yarn.executor.memoryOverhead=2500

GCS 7.50 minutes

./gatk-launch CountReadsSpark -I gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam --apiKey <myAPIKEY> -- --sparkRunner GCS --cluster dataproc-cluster-3 --executor-cores 3 --executor-memory 25G --conf spark.yarn.executor.memoryOverhead=2500

@lbergelson @jean-philippe-martin is this a known thing? If this expected

@jean-philippe-martin
Copy link
Contributor

jean-philippe-martin commented Apr 21, 2016 via email

@droazen
Copy link
Contributor

droazen commented Apr 21, 2016

@akiezun Can you determine whether you're using the HDFS -> GCS adapter in your test case? The adapter historically did have performance problems of this magnitude. As @jean-philippe-martin mentioned, we should benchmark the new NIO -> GCS support as well.

@akiezun
Copy link
Contributor Author

akiezun commented Apr 21, 2016

how?

@akiezun akiezun closed this as completed Apr 21, 2016
@akiezun akiezun reopened this Apr 21, 2016
@akiezun
Copy link
Contributor Author

akiezun commented Apr 21, 2016

how?

@droazen
Copy link
Contributor

droazen commented Apr 21, 2016

It's not clear to me what code path you're going through when using a gs:// URI for the input bam in your second test. CountReadsSpark calls GATKSparkTool.getReads() which calls JavaSparkContext.newAPIHadoopFile(), but the question is how Hadoop-BAM handles your gs:// URI. In other parts of the GATK (eg., ReferenceTwoBitSource) we call into BucketUtils.openFile(), which handles GCS URIs directly by calling into GcsUtil.open().

@cwhelan
Copy link
Member

cwhelan commented Jun 6, 2016

Could this be related to having sliced objects in the gsutils buckets but not using a code path that goes through a native CRC implementation? I ask because I noticed that when I try to download the file

gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam

with gsutil, I get this error:

CommandException: 
Downloading this composite object requires integrity checking with CRC32c,
but your crcmod installation isn't using the module's C extension, so the
hash computation will likely throttle download performance. For help
installing the extension, please see:

  $ gsutil help crcmod

To download regardless of crcmod performance or to skip slow integrity
checks, see the "check_hashes" option in your boto config file.

Could the GATK command path be computing all of the CRC hashes in Java code, slowing it down?

@jean-philippe-martin
Copy link
Contributor

Did we change the name of the files since the initial bug was filed? Because my earlier comment talks about downloading to my desktop in 2.5min, but retrying now it takes about an hour! The file is 34.56 GiB. I guess that's what I get for moving to a different office.

Assuming this time is correct, I see that NIO (when multithreaded) matches gsutil performance. Output below.

gsutil:

$ time gsutil cp gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam .
Copying gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam...
Downloading ..../CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam: 8.64 GiB/8.64 GiB      
Downloading ..../CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam: 8.64 GiB/8.64 GiB    
Downloading ..../CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam: 8.64 GiB/8.64 GiB    
Downloading ..../CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam: 8.64 GiB/8.64 GiB    

real  55m11.517s
user  11m27.584s
sys 13m44.240s

NIO (via ParallelCountBytes)

Reading the whole file using 4 threads...
Read all 37107966478 bytes in 3354s. 
The MD5 is: 0x73B5B38156DF9E2FF25FC57DD858DA0F

(3354s is 55min54s)

Next step: try on a Google Compute Engine computer, to get datacenter speeds.

@jean-philippe-martin
Copy link
Contributor

I tried from a GCE instance and got 6m12s for the gsutil copy, and 5m21s for the NIO code.

So we know that using the NIO code would match the gsutil performance (in this case, about 100MB/s).

As to the original question of how this translates to Spark performance, well, this test just fails to prove that GCS/NIO are too slow; more investigation is needed.

The cluster we're using has 10 machines so it may be able to run up to 10x faster than this single-machine test, ie. only about 30s to load the data via NIO. Of course, the program does more than that, and we still have to demonstrate that we're not going to bottleneck the GCS servers.

@pgrosu
Copy link

pgrosu commented Jun 23, 2016

It all depends on the network you are running from, as noted in the following discussion:

googlegenomics/utils-java#9 (comment)

So the closer you are to the data such as through GCE, and launching a GCE instance from a relatively similar zone such as (us-east1-b, us-east1-c, us-east1-d) the quicker the result. Sometimes the setup time to launch the instance might take some time as well. I don't have a setup as the Broad to run the same test and determine what might be happening, but I just re-ran the following test on an external (non-GCE) cluster and below are the results for a 1.46 GB file, which seem to come closer to @jean-philippe-martin's most recent results (and projected using my throughput, a 34.56 GB file would take about 13 min 38 sec, but not 55 min):

$ gsutil ls -l gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2

1563675749  2014-04-24T20:26:25Z  gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2
TOTAL: 1 objects, 1563675749 bytes (1.46 GiB)

$
$ time(gsutil cp -L transfer_statistics.txt gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2 . )

Copying gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2...
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
WARNING: Found no hashes to validate object downloaded to ./coverageRefScore-chr1-GS000015172-ASM.tsv.bz2. Integrity cannot be assured without hashes.

real    0m31.112s
user    0m25.286s
sys     0m21.582s

$

Hope it helps,
Paul

@jean-philippe-martin
Copy link
Contributor

So anyways, given than it takes 6min to download the file, the time to download + run on HDFS would be 7.15min, vs the listed time of 7.5min to run directly on GCS. That doesn't sound so bad, does it?

@pgrosu
Copy link

pgrosu commented Jul 26, 2016

Not bad, but why can't the BAM be split as multiple objects in the same bucket where the directory is the name of the BAM. I was having this discussion with Dion at the following thread:

googlegenomics/utils-java#62 (comment)

You can have a folder in the GS location be the name of the BAM, and even sort them like a distributed B-tree. This way you can even simultaneously process reads as new data is streaming in from the GS location. Since the Google disk IOPS are as follows, based on the following link:

https://cloud.google.com/compute/docs/disks/performance#type_comparison

Read Write
3000 IOPS 0 IOPS
2250 IOPS 3750 IOPS
1500 IOPS 7500 IOPS
750 IOPS 11250 IOPS
0 IOPS 15000 IOPS

So it all depends on perspective of what folks prefer, which in this case means that we can minimize the 6 min component.

Then comes the 1.5 min portion of HDFS, which can occur in parallel and could also be memory-mapped and/or SSD accessed.

So there are still ways to improve the access and processing time, but it depends on how fast - or instantaneous - folks want to have the results processed and returned back.

@jean-philippe-martin
Copy link
Contributor

I vote to close. As per my earlier remark, performance reading from GCS buckets (at least in the case highlighted in this ticket) is just fine. The root cause was that the comparison was forgetting about the 6min it takes to place the GCS file onto the cluster.

@pgrosu
Copy link

pgrosu commented Dec 5, 2016

@jean-philippe-martin Will files always be preemptively be placed on GCS, so this time-delta will not be experienced?

@jean-philippe-martin
Copy link
Contributor

Latest results show that GCS buckets perform even better than fine. Closing this issue.

@droazen droazen added this to the beta milestone Feb 27, 2017
@lbergelson
Copy link
Member

@jean-philippe-martin This was refering to gs:// inputs in spark wasn't it? I think we still have work to do on that don't we?

@jean-philippe-martin
Copy link
Contributor

We do, but not on performance. The performance reading from GCS buckets is just fine, when we take into account the time it would otherwise take to copy the data over from the bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants