Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For GPU version , use cudf or dask-cudf for the input dataframe instead of pandas #1362

Open
webcoderz opened this issue Jun 29, 2023 · 8 comments

Comments

@webcoderz
Copy link

webcoderz commented Jun 29, 2023

Problem
To enable full gpu data science it would be really useful to consider using dask-cudf for data frames larger than single gpu memory or cudf for data frames that can fit entirely on a single gpu, instead of having to pass the data to the cpu for input into the model training.

Solution
Allow a dask dataframe or a cudf dataframe (or both) as input into the model

@leoniewgnr
Copy link
Collaborator

Hi @webcoderz, thanks for bringing this up! How would this work then? Would the dataframe always stay a dask or cudf dataframe throughout the whole model? And what are the benefits? Would this be faster then?

@webcoderz
Copy link
Author

Basically it would enable full gpu pipeline and would give a speedup and enable a larger dataframe to pass through , a dask dataframe can hold a very large dataframe, so this would potentially reduce bottlenecks in terms of data size as when reading in GBs of data into a dask dataframe it splits it into x number of partitions(pd.Dataframes) and would also reduce delays when offramping from gpu to cpu. the cool thing about dask is it runs in parallel and can do lazy loading so you can do stuff like delaying compute until the end of your function chain or wherever is more efficient and then execute a function chain in parallel. The original prophet was GIL locked so this wasn't possible, but it should be here. A lot of considerations have to be made when dealing with large data , like one of the df_utils you all do is making a copy of the dataframe which would potentially blow out memory on a really large dataset, my guess would be to start here https://ml.dask.org/pytorch.html and see if dask-ml supports some of the things you're doing here out of the box before proceeding down the path, I can try to help a bit I have to read more of the code to understand better, but typically it's completely possible to use torch and dask together

@ourownstory
Copy link
Owner

@webcoderz Thank you for your excellent suggestion, and highlighting the other device transfer issue. @leoniewgnr is currently parallelizing some of our older code parts to help spead up data processing (She is mostly done). What you are suggesting sounds like the appropriate next step, as I think that you are right about most of our compute being data-pipeline and device-transfer bottlenecked. If you would be down for the challenge, we would love to have a chat with you and discuss how to proceed. BTW, Our dev core team (all of us are open source volunteers) is open for new members. :)

@webcoderz
Copy link
Author

Hi @ourownstory yea I'd be down! I have a lot on my plate currently but would be happy to help! I think it would be super cool to run this on hundreds of millions of rows of data 😀

@beckernick
Copy link

beckernick commented Jul 5, 2023

Hi! I came across this issue due to the cuDF / Dask mentions (I'm part of the RAPIDS team at NVIDIA that develops cuDF, Dask-CUDA, and a variety of other projects for accelerated computing).

We'd love to see the NeuralProphet community contribute this functionality to the package! I'd be happy to join any discussions on this topic and help try to answer any questions that may come up.

@webcoderz
Copy link
Author

Perfect! this is excellent

@ourownstory
Copy link
Owner

@webcoderz Please let me know if you may still be game for this challenge. I'd be happy to hop on a call to discuss it.

@webcoderz
Copy link
Author

I am all here for it! I gave @leoniewgnr my email if you want to reach out to schedule something!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants