-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low-latency storage backends #237
Comments
This would be very interesting.
I would be concerned about the maximum size of the low-latency storage limiting scaling. For example I noticed:
With 100MB chunks, this means the maximum size Zarr array you could process in one step before losing parallelism would be 2.5TB. |
Jacob Tomlinson on twitter suggested momento, Dragonfly, and ElastiCache as possibilities too. |
Also relevant to this issue is the work going on in Zarr to improve performance. |
Amazon S3 Express One Zone might have solved this problem. |
This looks very interesting to me! It would be great to test it. Some clarifying questions:
|
Not yet, by the look of it. I had a quick dig and found that fsspec uses aiobotocore, which in turn depends on botocore. The latest version of aiobotocore is pinned to botocore < 1.33.2, and support for Amazon S3 Express One Zone was added in botocore 1.33.2. Also, quoting from aio-libs/aiobotocore#1057 (comment):
|
Yikes! Reading a bit more to see why I went ahead and filed aio-libs/aiobotocore#1065 to have a place to track the S3 Express One Zone support work in |
Let's just create python bindings for Rust's object store crate and start using that instead. Looks very powerful and performant. |
More performance would be welcome! It doesn't support S3 Express One Zone yet either: apache/arrow-rs#5140 |
Ha, was just looking at that: fastlmm/bed-reader#22. There is in fact a Python interface already: https://github.com/roeap/object-store-python. It's a bit stale. I have been thinking for a while that the |
To defend fsspec: what do you think could possibly be getting in the way? Methods like Does object-store-python support anything async? I again refer readers to rfsspec: rust/tokio does not magically make things much faster, it may help a little. |
A month ago I was doing comparative benchmarks between Spark/Dask/DuckDB/Polars on cloud data. My observations were that, as long as the projects don't do anything dumb (a big assumption) the game was entirely about optimizing S3 access. I starting pushing on Arrow timings (see apache/arrow#38389) and found that it has much less to do with the language that is used, but rather due to how gently one handles S3. What I found at the time was that S3 liked ...
On an m6i.xlarge (4 cores) I could get about 60 MB/s from a single read (regardless of whether or not it was coming from boto, fsspec, or arrow (C++ backed)), or about 500 MB/s if I fully saturated the machine. This was all done before the new S3 single-zone stuff was launched. I came to the following conclusions:
All this is to say, I wouldn't expect "Just reimplement everything in Rust" to solve the problem. I think that you'll need to be more clever than that here. However I also don't think it would hurt. I found myself leaning more towards Arrow over fsspec in profiling because things felt simpler to profile. I was pretty confident that there wasn't some background IOLoop or GIL issue slowing me down (although I never had solid evidence that there was). In Cubed's case I'd recommend trying to build some concurrency in to a single function call. After talking to devs in other projects, the 2-3x threads over cores approach is common in Dask (or will be soon), DuckDB, and Polars. |
Just a note that concurrency X threads is normal and should be expected in zarr-on-dask workflows. fsspec supports concurrency, so this achieved by setting the dask partition to be a few (4-20) zarr chunks, depending on memory.
That of course is the tricky thing, in whichever language. If you need a sync gate at all (because the whole application is not async), then you need to decide where that gate is. |
This is quite opposite to claims that were made before. IF we can reach near equivalent performance in python, the simplicity of the code and speed of development are big benefits. My experiments in rfsspec showed there is probably little difference.
These are interesting findings that are hard to tease out from the linked very long thread. Perhaps there can be some sort of summary of findings? |
A summary of the finding sounds great. If someone wants to do that that would be welcome. |
This notebook might also be of use https://gist.github.com/mrocklin/c1fd89575b40c055a9be77b2a47894df |
Batched function calls
Yeah, I agree: batched async provides a bunch of benefits. On the design for Zarr-Python v3.0, there's some discussion of implementing an async Compiled languages and millions of IOPSOn the topic of implementing IO libraries in compiled languages (like Rust)... There are use-cases which would benefit from being able to read millions of small chunks per second to a single machine (for example, here's an overly long blog (!) post I wrote about my main use-case: training large ML models from multi-dimensional data). To achieve that kind of performance, it would appear necessary to implement at least some of the code in a compiled language. (And to use all the tricks discussed in the thread above: especially using native async IO that modern operating systems provide. I'm tinkering with some ideas in Rust in this repo). But I recognise that this current GitHub discussion is about low-latency cloud storage buckets (using cubed). And my main focus is somewhat different: my main focus is on local, modern SSDs (which, today, can do up to 3 million IOPS. And the performance is rapidly improving). So - I apologise - I'm a bit off-topic. But I wanted to make the point that some use-cases would benefit from implementing IO in a compiled language. As far as I can tell, it's not yet possible to perform millions of IOPS to a single machine from a cloud storage bucket (although Amazon S3 Express One Zone gets us to 100,000 IOPS... although I assume that performance is only possible when talking to the bucket from many VMs). And, of course, if you're using large chunks sizes (100 MB) then you're probably only doing on the order of 10 IOPS per machine, and so your performance is dominated by the network bandwidth, and no one cares if you can shave a few milliseconds off your code's runtime per IO operation! But there are folks (like me) who require on-prem hardware and fast, local storage. Amazing things are happening in the storage industry right now: I've heard it said that there's been more innovation in storage in the last two years than in the last twenty years! Folks like me need code which can keep up with these innovations in hardware! And - who knows - maybe cloud storage buckets will soon deliver millions of IOPS (to a single machine). But - wait - is it even possible to do that over a network? Well... even the lowly 1 Gbps NIC in your laptop can do almost 1.5 million packets per second (PPS)... and 400 Gbps NICs are available, which can do almost 600 million PPS. And, in the future, it'd be nice if cloud storage APIs allowed us to submit multiple disk operations per packet... you know... like HTTP has been capable of since 1996 🙂. |
Thanks for all the comments! Really interesting.
That's a really good point @mrocklin. There is a PR in aiobotocore to support Amazon S3 Express One Zone, so hopefully we'll be able to try it out (via fsspec) soon. That would be the first thing I would try to address this issue, which is particularly acute in Cubed as it uses cloud storage for intermediate results (i.e. it doesn't use S3 very gently).
Agreed. When we start reading multiple chunks per task in more sophisticated ways (e.g. for tree-reduce in #284, #331) then we'll need the ability to load N chunks concurrently - sometimes in a streaming fashion - and no more than N chunks to meet Cubed's memory guarantees. It sounds like fsspec can already do that, which is great, also there's active discussion about how to improve this more generally (like the Zarr V3 discussion @JackKelly linked to). |
The newest release of |
botocore>=1.33.2,<1.33.14 is what I see now, so give it a go |
I assume not... But, just to check: would cubed be happy with high latency if it got high bandwidth and a large number of IO operations per second? |
This seems to have been done https://github.com/developmentseed/obstore |
And it's almost implemented in Zarr itself: zarr-developers/zarr-python#1661 Would also love to plug in Icechunk to Cubed. |
So far we've only used cloud storage (S3 and GCS) for storing intermediate Zarr data in Cubed. It would be interesting to try other storage backends that have lower latency.
Here are some examples, I'm sure there are more:
Note that Google Cloud Memorystore is not serverless, so you'd need to start a cluster before running your computation, and shut it down afterwards.
We'd also want to be more aggressive in deleting intermediate data that is no longer needed in Cubed. Currently we just let it get deleted automatically in the background, well after the computation has completed. Low-latency storage has smaller capacity and is a lot more expensive, so we'd want to delete data that is no longer a direct dependency of any node in the DAG immediately as the computation progresses.
cc @TomNicholas @rabernat
The text was updated successfully, but these errors were encountered: