-
Notifications
You must be signed in to change notification settings - Fork 11
Implement a read + variant streaming specific heuristic for calculating shard sizes #62
Comments
Per https://cloud.google.com/genomics/reference/rpc/google.genomics.v1#google.genomics.v1.StreamReads "strict" shard semantics are now supported server-side. Its advisable to use small shards since the current retry semantics will no longer apply. For overlapping semantics or custom sources (which split shards dynamically), the current sharding approach should still be used. |
Hi Dion (@dionloy), What would be the average number of parallel gRPC streams can one assume at one time for a collection of data? The reason I ask is I look at the total read-count in 1000 genomes based on this link, it is 2.47123*(10^14): ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.sequence.index Thus let's say we have 100 parallel gRPC streams with 10^6 reads per stream/second, it would take almost 29 days to process. Basically sequencing is only going to grow and integrative analysis across multiple datasets is usually assumed. Thanks, |
I don't have exact numbers, but we were running a Dataflow over the Autism (and even at 8 hours I think we can improve the performance significantly, On Wed, May 18, 2016 at 1:58 PM, Paul Grosu [email protected]
|
That sounds more reasonable, but could you give it a try with 2 million gRPC connections/worker as a comparison. Basically the analysis will change significantly if we have 10^9 gRPC connections (yes, that would be a billion overall connections). Do you think we can scale to that as the number of workers is limited to 1000 with 4 cores for Dataflow? Thanks, |
Hi Paul, no, we do not have the capacity to scale to a billion overall connections. On a linux machine I'm guessing you're theoretically restricted to ~64K connections, even assuming zero I/O contention on your own network. |
Hi Dion, I am assuming this would be through GCE to take advantage of the fast internal transfer rates. As GCE are VMs, then additional NICs can be added as the max 65536 connections would be per IP address. Then if we have 1000 GCEs with 16 NICs then we can reach 1 billion connections with only 62500 concurrent connections. The alternative is to use IP aliasing to have 16 IP addresses assigned to a NIC. I am assuming you can change the bandwidth limit to the NICs for a VM. What are the upper limits for VM<->GS and VM<->REST_Server transmission? Previously it sounded that we can have tens of thousands of RPC parallel calls at the same time from several tens of thousands of workers: If need be, we can work to make the transfer even faster, where we have the transmission compressed and/or generate replicated-inverted-indexed data-structures for the more frequently-accessed fields into GS buckets. All I am saying is that we can approach the same response time for next-generation disease discovery by using the same concepts that have been implemented for Google Search (i.e. find me a similar variant that has been used previously for a differently labeled disease as new data is constantly streaming in). Otherwise we have the same tools we had for years but just getting the same results a little faster, rather than changing the way analysis is performed to enable a dramatic shift in how Personalized Medicine analysis can be performed as compared to today. Paul |
Indeed, the scaling should just be by machine (though I think by then you will be CPU limited for that number of connections). On the server side we can not support a billion connections at once (again, number of machines deployed), but definitely thousands should be supported (the comment you quoted was referring to REST calls, not the persistent streaming calls, which have a higher tax). |
Hi Dion, It's too bad that the server cannot access 1 billion connections, but I think I have a way where this can still be achieved. After importing the data through the Google Genomics API, you then just export the JSON formatted pages onto GS buckets. In fact, you don't even have to store them, you can have it be processed directly to GS storage in the JSON format. Then with 1000 GCE instances, with each having 1,000,000 simultaneous connections to the GS buckets, that would yield the 1 billion mark. What is the read/write I/O performance from a GCE instance to a GS bucket? I just benchmarked 1 million concurrent connections on my laptop using NGINX as a server - so that it is event-driven - and my CPU kept hovering at 40%. I think this is similar to what is suggested here: https://cloud.google.com/solutions/https-load-balancing-nginx In fact, I can make it even faster through bypassing the kernel's TCP/IP stack, and reading the packets directly in user-mode with a driver I would write with a few other optimizations that would be too long to detail all out. So on a 4 core system if I boot the kernel with Thus for 1000 Genomes with a total 2.47123E+14 reads with 1 million reads/seconds - assuming the GS-direct access being faster than 1 M/sec - then with the above setup all the reads will be processed in less than a second, which is the same result time of a Google Search. Let me know what you think. Thanks, |
Paul, I don't dispute that it's possible, all it takes are machines and $ to serve it, whether through Genomics, Cloud Storage or Xyz =). |
Hi Dion, As they say if you build it, they will come :) In the meantime, I will try to figure out a solution for this, and please let me know any other way I can help you out. Thank you and have a great weekend! |
Looks like we will need a version bump here in order to pull in the StreamReads sharding support. |
Proposal:
Flags
--calculateShardSizeBasedOnCoverage (default true)
--basesPerShard (overrides if above is false)
Reads
Variants
The text was updated successfully, but these errors were encountered: