Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add large svd draft #38

Merged
merged 16 commits into from
May 13, 2020
Merged

[WIP] Add large svd draft #38

merged 16 commits into from
May 13, 2020

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Jul 5, 2019

cc @alimanfoo

I need to put in actual code samples and probably do a screencast, but in the mean time if you'd be willing to look through this post for the general structure I'd welcome the collaboration.

@mrocklin mrocklin changed the title Add large svd draft [WIP] Add large svd draft Jul 5, 2019
@alimanfoo
Copy link
Contributor

Cool, thanks @mrocklin. I took a scan through and looks good, I'll try to follow up with more detailed feedback.

@mrocklin
Copy link
Member Author

mrocklin commented Jul 5, 2019

If you want to give it a try, see this notebook https://gist.github.com/mrocklin/3af8a428e18a3047ab501161ffe17cf9

It should run fine on the single-node GPU machine to which you have access today (sending the address by e-mail)

@pentschev
Copy link
Member

No %%time? :(

@mrocklin
Copy link
Member Author

@quasiben @jakirkham @pentschev I suspect that you all are busy, but it would be fun to try this experiment again, now with the new and improved UCX integration.

@quasiben
Copy link
Member

I gave this a whirl on a DGX2 (16GPUs) with NVLink enabled. Here's a performance report:

https://gistcdn.githack.com/quasiben/78033ef8130afb7b4c5f5f2def36d388/raw/59eb22d47bef5462bdb433ebfe3ae099cc64ccf1/dask-performance-large-svd.html

In general things are looking ok -- I am a bit concerned with the NumPy transfers (don't know where they are coming from yet) and there is a fairly large gap in the middle of the workflow. But things worked!

@pentschev
Copy link
Member

I am a bit concerned with the NumPy transfers (don't know where they are coming from yet)

Probably metadata in UCX send/recv, which is mostly being addressed in dask/distributed#3732 .

@mrocklin
Copy link
Member Author

I am a bit concerned with the NumPy transfers (don't know where they are coming from yet)

It would be nice to know how much was transferred (this is mostly a mental note to me to maybe improve diagnostics)

@mrocklin
Copy link
Member Author

Can I ask that you maybe also call v.compute(). That might slightly change things.

@quasiben
Copy link
Member

Can I ask that you maybe also call v.compute(). That might slightly change things.

Do you mean u.compute() ? I called v

@mrocklin
Copy link
Member Author

From the performance report it looks like we're spending essentially all of our time generating random numbers.

There are a few things that we might experiment with here to improve the focus on SVD, and also give this example a bit more reality. cc'ing @alimanfoo, @eric-czech, and @jakirkham who I think might all care about this problem.

  1. We can mimic a genomics problem I think by using random integers uniformly distributed between 0 and 3, stored as uint8.
  2. We might consider storing these somewhere. We could try something small that fits in GPU memory (it might be easier given that we're storing these somewhat compactly). So we would persist the input dask array fill up memory (we probably have to reduce size a bit) and then and then call svd, hoping that the working memory that it needs wasn't too much greater.
  3. For something larger, we could move to disk. I suspect that reading from disk will be more expensive than regenerating the data, but this is still a useful thing to highlight and will make a nice section. <puts on NVIIDA hat> It's also a great opportunity to mention NVIDIA IO technologies </takes off NVIDIA hat>
  4. If we wanted to experiment a bit, we could play with compressing the data in memory just before we persist it. This would require us to have functions that converted a cupy array to a compressed form, and then decompressed that cupy array. Do we have such a function?
  5. We could fit everything into memory if we just added more machines :)

Anyway, those are some ideas. I'm actually pretty excited about trying them all.

@mrocklin
Copy link
Member Author

@quasiben I'm somewhat concerned about the 2GB/s bandwidth for cupy arrays. This seems low?

I hope that once we remove the random number generation that this becomes our next bottleneck to squash.

@mrocklin
Copy link
Member Author

mrocklin commented Apr 22, 2020 via email

@alimanfoo
Copy link
Contributor

  1. We can mimic a genomics problem I think by using random integers uniformly distributed between 0 and 3, stored as uint8.

FWIW in a real genomics problem the integers can take values 0, 1 or 2 (if working with a diploid organism like humans or mosquitoes). They are also not uniformly distributed. These may or may not make a different, depending on what you're testing. Biggest different I expect would be compressibility (real data will be much more compressible than uniform random). Give me a shout if you'd like some real data to play with, I can point you at some zarrs on GCS.

@mrocklin
Copy link
Member Author

mrocklin commented Apr 25, 2020 via email

@mrocklin
Copy link
Member Author

Here is an example using Zarr with dask array together to compress in-memory data.

https://gist.github.com/mrocklin/ae18a1b51ccf90a2fa9872de23144ffa

@alimanfoo
Copy link
Contributor

Here's a gist showing location of some genome variation data in zarr on GCS, and also the couple of preprocessing steps required before can be run through SVD:

https://gist.github.com/alimanfoo/ef6d9e70750b5273f95cbf681d162de8

Hth :-)

@alimanfoo
Copy link
Contributor

Here's a gist showing location of some genome variation data in zarr on GCS, and also the couple of preprocessing steps required before can be run through SVD:

https://gist.github.com/alimanfoo/ef6d9e70750b5273f95cbf681d162de8

Hth :-)

Btw this example is "tall and skinny" with ~10 million features and ~1,000 individuals, but the data are growing to become more square, e.g., biobank datasets are more like ~10 million features and ~1 million individuals. So don't optimise for the tall and skinny case. At some point would probably be useful to simulate a biobank-scale dataset.

@alimanfoo
Copy link
Contributor

Here is an example using Zarr with dask array together to compress in-memory data.

https://gist.github.com/mrocklin/ae18a1b51ccf90a2fa9872de23144ffa

That's cunning :-)

@mrocklin
Copy link
Member Author

That's cunning :-)

Yeah, I then followed it up by actually swapping out zarr with functions that do bit-twiddling

def compress(x: np.ndarray) -> np.ndarray:
   out = np.zeros_like(x, shape=(x.shape[0], x.shape[1] // 4))
   out += x[:, 0::4]
   out += x[:, 1::4] << 2
   out += x[:, 2::4] << 4
   out += x[:, 3::4] << 6
   return out

def decompress(out: np.ndarray) -> np.ndarray:
   back = np.zeros_like(out, shape=(out.shape[0], out.shape[1] * 4))
   back[:, 0::4] = out & 0b00000011
   back[:, 1::4] = (out & 0b00001100) >> 2
   back[:, 2::4] = (out & 0b00110000) >> 4
   back[:, 3::4] = (out & 0b11000000) >> 6
   return back

This is a bit faster, and it also works wtih CuPy (we don't have good compression algorithms with Zarr on GPU yet). This lets us store a dataset pretty compactly, and expand it on-demand. It didn't quite work, I suspect because of a known memory pressure issue with the dask scheduler (a few other groups have run into something similar, so I hope that there is good pressure to solve it).

Speaking of CuPy, @quasiben and I were playing with a DGX-2 (16GPUs, half terrabyte of device memory) last night and solved something like a 10,000,000 by 20,000 svd in 10 seconds or so (if I recall correctly). That's after it was in RAM though. Probably the main cost will be collecting it from wherever it's stored.

Btw this example is "tall and skinny" with ~10 million features and ~1,000 individuals, but the data are growing to become more square, e.g., biobank datasets are more like ~10 million features and ~1 million individuals. So don't optimise for the tall and skinny case. At some point would probably be useful to simulate a biobank-scale dataset

That's good to know.

So far it looks like we're not overly sensitive to chunking on the y-axis, assuming that we're comfortable with approximate results.

@eric-czech
Copy link

swapping out zarr with functions that do bit-twiddling

@mrocklin do you have any thoughts on trying to do bitpacking like that with the dask API rather than as map_blocks functions (or zarr filters, even assuming they worked with cupy)? Looking at that example, I'm immediately seeing some ways to do useful math on the packed representations so if applying the compression filters at that level instead and relying on bitwise da.Array functions rather than custom numpy functions isn't a bad idea, I'd like to explore that some more.

@alimanfoo mentioned experimenting with something like that in the past a bit but I'd love to know if you think doing lots of bit-fiddling at the dask level would be a bad practice.

Apologies for a somewhat off-topic post but your example shook some thoughts loose so thanks for sharing it!

@mrocklin
Copy link
Member Author

do you have any thoughts on trying to do bitpacking like that with the dask API rather than as map_blocks functions

In principle, yes, this could be made to work. Dask array isn't as smooth on in-place operations, so to make the packing process here as fast then we might want something like dask/dask#6123 . If you want to pack with map_blocks and then operate on the packed representation with the Dask array API though then yeah, I imagine that that would be easier.

@mrocklin
Copy link
Member Author

In general, the experiment with @quasiben was interesting in order to see how much of this data we could effectively operate on in a fixed amount of device memory. Things are very fast if we can stay in that RAM, but there are a lot of interesting complications to doing so. This was more for us to play with technology than to solve an actual science problem.

This leads to a larger question of "what are the actual science problems here?" For example, you all may not even care about doing this quickly, but instead just care about being able to do it at all. I encourage folks to speak up about what would be motivating.

@alimanfoo
Copy link
Contributor

This leads to a larger question of "what are the actual science problems here?" For example, you all may not even care about doing this quickly, but instead just care about being able to do it at all. I encourage folks to speak up about what would be motivating.

I'd say for genomics the main driver is to be able to do it at all. If it can then be done in an interactive environment, even better, where "interactive time" is anything less than the time it takes to make a cup of tea. I.e., you don't need to leave your normal working environment (Jupyter notebook) or worry about any long-running tasks. Also probably relevant as analysis moves to the cloud is cost, i.e., it is not an expensive computation.

Sorry this is a bit vague, but hopefully useful. Main point is we are not running SVDs over and over again all day, so this is not something where there is strong pressure to run as fast and cheaply as possible. But making it possible and straightforward to setup and run large SVDs is valuable.

@quasiben
Copy link
Member

quasiben commented Apr 28, 2020

@alimanfoo can you make tea in 20 seconds ?

In my last experiment I was able to push to 400GB of in memory data (10_000_000, 30_000). SVD on this size was also not greatly impacted. I measured a compute time of 19.3 seconds. This is probably close to the limit (as of now) of the amount of data we can naively perform an SVD on.

To my snarky tea comment before, and as Matt said earlier, reading from disk will be somewhat burdensome. It took roughly a minute to generate the data in memory -- this will mostly like be longer when reading from disk and push us into tea construction time.

There are two things which could be done here to alleviate performance and memory constraints:

  • Optimized GPU readers from disk which integrate seamlessly with Zarr or other formats
  • Increase single GPU RAM (don't know how hard this is) or move to a multinode setup (I am working on this now)

Another interesting thing to note is that after loading data into memory, SVD is a communication heavy process. I didn't expect that. A distributed SVD spends more time transferring than computing. Using a box like a DGX2 is hugely impactful here as we can leverage NVLink for communication between GPUs (NVLink has a 50GB/s bandwidth)

Breakdown of actions for an SVD

defaultdict(int,
            {'compute': 55.52175450325012,
             'transfer': 141.12734627723694,
             'disk-read': 0.00934290885925293})

I also wanted to note that working on these problems is also of general interest to NVIDIA.
ClaraGenomics would be an example of NVIDIA in the genetics space: https://github.com/clara-genomics/Compute4COVID

@mrocklin
Copy link
Member Author

mrocklin commented Apr 28, 2020 via email

@alimanfoo
Copy link
Contributor

@alimanfoo can you make tea in 20sec ?

Nope :-) Just to elaborate a little, obviously interactive is ideal, but tea break would be OK too.

Another way to think about it might be, imagine a researcher who has access to a pangeo-like or saturncloud-like research computing environment, and can fire up notebook servers and dask-kubernetes clusters as needed based on standard node types offered by public cloud providers. They need to run an SVD over a large dataset, and ideally it's (a) convenient to fire up the necessary compute resources and setup the computation, (2) doesn't take too long to run, and (iii) cloud costs are small enough to be negligible, i.e., don't require any special accounting.

Hth.

@mrocklin
Copy link
Member Author

mrocklin commented Apr 28, 2020

Another way to think about it might be, imagine a researcher who has access to a pangeo-like or saturncloud-like research computing environment, and can fire up notebook servers and dask-kubernetes clusters as needed based on standard node types offered by public cloud providers. They need to run an SVD over a large dataset, and ideally it's (a) convenient to fire up the necessary compute resources and setup the computation, (2) doesn't take too long to run, and (iii) cloud costs are small enough to be negligible, i.e., don't require any special accounting.

Like this?

Notebook: https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/mrocklin/0c4e97b9164c55d85ce457fce0b87d41/raw/90a07e2f6fd1563aa65eb83f190e4c853494a4dd/coiled-compressed-svd.ipynb
Screencast: https://www.youtube.com/watch?v=qaJcAvhgLy4

@alimanfoo
Copy link
Contributor

Added quasiben#1 with a bit more explanation of the genomics.

@quasiben
Copy link
Member

quasiben commented May 5, 2020

Thanks @alimanfoo ! I added them directly here. Matt gave me write access and thought it best to centralize now that we can

@quasiben
Copy link
Member

quasiben commented May 5, 2020

@mrocklin when you have some time can you take a pass over this ?

On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 and15 seconds: [SVD Multi-GPU Notebook](https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4)

To see this run, we recommend viewing
[the attached screencast](https://youtu.be/4X5yky2lvEw)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My computer may be borked? Can someone else check if this has audio?

@mrocklin
Copy link
Member Author

mrocklin commented May 6, 2020

Thanks @quasiben I took a look. Some high level thoughts:

  1. Can you check the screencast and verify that it has audio. It didn't play on my machine, but my audio has also been strange recently. It didn't play on my phone though too, which has me concerned
  2. The GPU part of this post assumes knowledge of terms like DGX2 and host memory that we might not be able to rely on with the majority of our readership. It might also be good to somehow bracket that section in "now let's enter GPU land (oh and as a disclaimer, one of the authors works for NVIDIA)"
  3. I'd like to also include the bits on compression, why it's useful and why it didn't work
  4. I'd also like to include bits on moving to the cloud (with appropriate disclaimer there as well)
  5. We should probably shift language from "I" to "we". I did this in a few places as I was going through but I probably missed some. I also added you to the author list, and decided to sort alphabetically by first name.

@quasiben
Copy link
Member

quasiben commented May 6, 2020

@mrocklin thanks for the review. I just recently upgraded to Catalina and I think that changed various app permissions. I'll re-record and upload.

I can also update the post with your suggestions as well. Thanks

layout: post
title: Very Large SVDs
tagline: Dask + CuPy + Zarr + Genomics
author: Alistair Miles (Oxford Big Data Institute), Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
author: Alistair Miles (Oxford Big Data Institute), Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled)
author: Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled), Alistair Miles (Oxford University)

I don't deserve to be first author here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do it alphabetically by first name. I believe that happens to coincide with the current ordering of names

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would feel a bit strange for me, my contribution to this nice work is very minor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the author order

@quasiben
Copy link
Member

This is looking really good. @mrocklin left a few small comments on the latest commits

Co-authored-by: Benjamin Zaitlen <[email protected]>
@quasiben
Copy link
Member

Read through the post again and I'm good with posting. @alimanfoo ?

@alimanfoo
Copy link
Contributor

LGTM :-)

@mrocklin
Copy link
Member Author

I will plan to merge this and tweet about it tomorrow morning US Eastern time

@mrocklin mrocklin merged commit 79ac035 into dask:gh-pages May 13, 2020
@mrocklin mrocklin deleted the svd branch May 13, 2020 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants