Poor performance of reading from GS buckets #1755

akiezun · 2016-04-21T19:46:37Z

Running on GCS (cluster created by dataproc), the GATK spark tools run much faster on HDFS than on files stored on GS

HDFS 1.15 minutes

/gatk-launch CountReadsSpark -I /user/akiezun/CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam --apiKey <myAPIKEY> -- --sparkRunner GCS --cluster dataproc-cluster-3 --executor-cores 3 --executor-memory 25G --conf spark.yarn.executor.memoryOverhead=2500

GCS 7.50 minutes

./gatk-launch CountReadsSpark -I gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam --apiKey <myAPIKEY> -- --sparkRunner GCS --cluster dataproc-cluster-3 --executor-cores 3 --executor-memory 25G --conf spark.yarn.executor.memoryOverhead=2500

@lbergelson @jean-philippe-martin is this a known thing? If this expected

The text was updated successfully, but these errors were encountered:

jean-philippe-martin · 2016-04-21T20:08:46Z

It's not a surprise that a local disk is faster than a remote one, but the magnitude of the difference is a lot more than I would expect. I remember at the time using direct GCS access to get the best possible performance in the bit I was working on, but I don't remember exactly how much of a difference it made. From my desktop it takes 2m25s to download the whole file, so the ~6min difference seems really excessive, something is broken. One thing to look into is whether the sharding is working correctly (are we getting the correct number of parallel downloads?) Presumably this code is using the HDFS adapter. It'll be interesting to compare vs the NIO version (and then the optimized NIO version once I write it).

droazen · 2016-04-21T20:12:58Z

@akiezun Can you determine whether you're using the HDFS -> GCS adapter in your test case? The adapter historically did have performance problems of this magnitude. As @jean-philippe-martin mentioned, we should benchmark the new NIO -> GCS support as well.

akiezun · 2016-04-21T22:15:16Z

how?

akiezun · 2016-04-21T22:15:25Z

how?

droazen · 2016-04-21T23:07:22Z

It's not clear to me what code path you're going through when using a gs:// URI for the input bam in your second test. CountReadsSpark calls GATKSparkTool.getReads() which calls JavaSparkContext.newAPIHadoopFile(), but the question is how Hadoop-BAM handles your gs:// URI. In other parts of the GATK (eg., ReferenceTwoBitSource) we call into BucketUtils.openFile(), which handles GCS URIs directly by calling into GcsUtil.open().

cwhelan · 2016-06-06T14:52:13Z

Could this be related to having sliced objects in the gsutils buckets but not using a code path that goes through a native CRC implementation? I ask because I noticed that when I try to download the file

gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam

with gsutil, I get this error:

CommandException: 
Downloading this composite object requires integrity checking with CRC32c,
but your crcmod installation isn't using the module's C extension, so the
hash computation will likely throttle download performance. For help
installing the extension, please see:

  $ gsutil help crcmod

To download regardless of crcmod performance or to skip slow integrity
checks, see the "check_hashes" option in your boto config file.

Could the GATK command path be computing all of the CRC hashes in Java code, slowing it down?

jean-philippe-martin · 2016-06-21T23:51:46Z

Did we change the name of the files since the initial bug was filed? Because my earlier comment talks about downloading to my desktop in 2.5min, but retrying now it takes about an hour! The file is 34.56 GiB. I guess that's what I get for moving to a different office.

Assuming this time is correct, I see that NIO (when multithreaded) matches gsutil performance. Output below.

gsutil:

$ time gsutil cp gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam .
Copying gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam...
Downloading ..../CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam: 8.64 GiB/8.64 GiB      
Downloading ..../CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam: 8.64 GiB/8.64 GiB    
Downloading ..../CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam: 8.64 GiB/8.64 GiB    
Downloading ..../CEUTrio.HiSeq.WEx.b37.NA12892.readnamesort.bam: 8.64 GiB/8.64 GiB    

real  55m11.517s
user  11m27.584s
sys 13m44.240s

NIO (via ParallelCountBytes)

Reading the whole file using 4 threads...
Read all 37107966478 bytes in 3354s. 
The MD5 is: 0x73B5B38156DF9E2FF25FC57DD858DA0F

(3354s is 55min54s)

Next step: try on a Google Compute Engine computer, to get datacenter speeds.

jean-philippe-martin · 2016-06-22T23:17:10Z

I tried from a GCE instance and got 6m12s for the gsutil copy, and 5m21s for the NIO code.

So we know that using the NIO code would match the gsutil performance (in this case, about 100MB/s).

As to the original question of how this translates to Spark performance, well, this test just fails to prove that GCS/NIO are too slow; more investigation is needed.

The cluster we're using has 10 machines so it may be able to run up to 10x faster than this single-machine test, ie. only about 30s to load the data via NIO. Of course, the program does more than that, and we still have to demonstrate that we're not going to bottleneck the GCS servers.

pgrosu · 2016-06-23T00:08:38Z

It all depends on the network you are running from, as noted in the following discussion:

googlegenomics/utils-java#9 (comment)

So the closer you are to the data such as through GCE, and launching a GCE instance from a relatively similar zone such as (us-east1-b, us-east1-c, us-east1-d) the quicker the result. Sometimes the setup time to launch the instance might take some time as well. I don't have a setup as the Broad to run the same test and determine what might be happening, but I just re-ran the following test on an external (non-GCE) cluster and below are the results for a 1.46 GB file, which seem to come closer to @jean-philippe-martin's most recent results (and projected using my throughput, a 34.56 GB file would take about 13 min 38 sec, but not 55 min):

$ gsutil ls -l gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2

1563675749  2014-04-24T20:26:25Z  gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2
TOTAL: 1 objects, 1563675749 bytes (1.46 GiB)

$
$ time(gsutil cp -L transfer_statistics.txt gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2 . )

Copying gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/REF/coverageRefScore-chr1-GS000015172-ASM.tsv.bz2...
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
Downloading ..././coverageRefScore-chr1-GS000015172-ASM.tsv.bz2: 372.81 MiB/372.81 MiB
WARNING: Found no hashes to validate object downloaded to ./coverageRefScore-chr1-GS000015172-ASM.tsv.bz2. Integrity cannot be assured without hashes.

real    0m31.112s
user    0m25.286s
sys     0m21.582s

$

Hope it helps,
Paul

jean-philippe-martin · 2016-07-26T00:23:21Z

So anyways, given than it takes 6min to download the file, the time to download + run on HDFS would be 7.15min, vs the listed time of 7.5min to run directly on GCS. That doesn't sound so bad, does it?

pgrosu · 2016-07-26T00:51:00Z

Not bad, but why can't the BAM be split as multiple objects in the same bucket where the directory is the name of the BAM. I was having this discussion with Dion at the following thread:

googlegenomics/utils-java#62 (comment)

You can have a folder in the GS location be the name of the BAM, and even sort them like a distributed B-tree. This way you can even simultaneously process reads as new data is streaming in from the GS location. Since the Google disk IOPS are as follows, based on the following link:

https://cloud.google.com/compute/docs/disks/performance#type_comparison

Read	Write
3000 IOPS	0 IOPS
2250 IOPS	3750 IOPS
1500 IOPS	7500 IOPS
750 IOPS	11250 IOPS
0 IOPS	15000 IOPS

So it all depends on perspective of what folks prefer, which in this case means that we can minimize the 6 min component.

Then comes the 1.5 min portion of HDFS, which can occur in parallel and could also be memory-mapped and/or SSD accessed.

So there are still ways to improve the access and processing time, but it depends on how fast - or instantaneous - folks want to have the results processed and returned back.

jean-philippe-martin · 2016-12-05T22:36:48Z

I vote to close. As per my earlier remark, performance reading from GCS buckets (at least in the case highlighted in this ticket) is just fine. The root cause was that the comparison was forgetting about the 6min it takes to place the GCS file onto the cluster.

pgrosu · 2016-12-05T22:57:33Z

@jean-philippe-martin Will files always be preemptively be placed on GCS, so this time-delta will not be experienced?

jean-philippe-martin · 2017-02-27T20:29:11Z

Latest results show that GCS buckets perform even better than fine. Closing this issue.

lbergelson · 2017-02-27T21:03:46Z

@jean-philippe-martin This was refering to gs:// inputs in spark wasn't it? I think we still have work to do on that don't we?

jean-philippe-martin · 2017-02-27T21:20:06Z

We do, but not on performance. The performance reading from GCS buckets is just fine, when we take into account the time it would otherwise take to copy the data over from the bucket.

akiezun added the performance label Apr 21, 2016

akiezun closed this as completed Apr 21, 2016

akiezun reopened this Apr 21, 2016

akiezun mentioned this issue May 10, 2016

add BROADCAST tests and use 2bit reference from bucket #1806

Merged

droazen assigned jean-philippe-martin Jun 20, 2016

jean-philippe-martin closed this as completed Feb 27, 2017

droazen added this to the beta milestone Feb 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of reading from GS buckets #1755

Poor performance of reading from GS buckets #1755

akiezun commented Apr 21, 2016

jean-philippe-martin commented Apr 21, 2016 via email

droazen commented Apr 21, 2016

akiezun commented Apr 21, 2016

akiezun commented Apr 21, 2016

droazen commented Apr 21, 2016

cwhelan commented Jun 6, 2016

jean-philippe-martin commented Jun 21, 2016

jean-philippe-martin commented Jun 22, 2016

pgrosu commented Jun 23, 2016 •

edited

Loading

jean-philippe-martin commented Jul 26, 2016

pgrosu commented Jul 26, 2016 •

edited

Loading

jean-philippe-martin commented Dec 5, 2016

pgrosu commented Dec 5, 2016

jean-philippe-martin commented Feb 27, 2017

lbergelson commented Feb 27, 2017

jean-philippe-martin commented Feb 27, 2017

Poor performance of reading from GS buckets #1755

Poor performance of reading from GS buckets #1755

Comments

akiezun commented Apr 21, 2016

jean-philippe-martin commented Apr 21, 2016 via email

droazen commented Apr 21, 2016

akiezun commented Apr 21, 2016

akiezun commented Apr 21, 2016

droazen commented Apr 21, 2016

cwhelan commented Jun 6, 2016

jean-philippe-martin commented Jun 21, 2016

jean-philippe-martin commented Jun 22, 2016

pgrosu commented Jun 23, 2016 • edited Loading

jean-philippe-martin commented Jul 26, 2016

pgrosu commented Jul 26, 2016 • edited Loading

jean-philippe-martin commented Dec 5, 2016

pgrosu commented Dec 5, 2016

jean-philippe-martin commented Feb 27, 2017

lbergelson commented Feb 27, 2017

jean-philippe-martin commented Feb 27, 2017

pgrosu commented Jun 23, 2016 •

edited

Loading

pgrosu commented Jul 26, 2016 •

edited

Loading