-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hero workload brainstorm #5
Comments
@rabernat @jbusecke there's the Dask Distributed Summit coming up this May. From the summit's landing page, it's a virtual conference "where users, contributors, and newcomers can share experiences to learn from one another and grow together". This would be a great space for Pangeo folks to share their experience using Dask. For example, we could share the results of a hero workload, or if we're not quite there yet, discuss some of the successes and pain points that we've encountered so far as part of such a workload. The CFP closes in a couple of days (May 28) and the talks themselves (30 minutes) are May 19-21. Do either of you, or other Pangeo folks, have an interest in submitting a talk? |
Hi @jrbourbeau! I'm a bit confused about the Dask Summit. There are two "workshop proposals" that I know about: Pangeo and Xarray. But I haven't seen an actual call for talk proposals. Did we miss that deadline? |
Regarding the hero workload, I have an idea coming together. The core of the method is described in this recent paper by @dhruvbalwada. We want to compute the vorticity / strain / divergence joint distribution from ocean surface velocities. However, we want to do it using the dataset described here - https://medium.com/pangeo/petabytes-of-ocean-data-part-1-nasa-ecco-data-portal-81e3c5e077be - which is mirrored on google cloud here - https://catalog.pangeo.io/browse/master/ocean/LLC4320/. We could also introduce some scale dependence into the calculation by doing filtering with the new software package coming together here - https://github.com/ocean-eddy-cpt/gcm-filters/. Our default way of implementing these calculations - using xgcm - would not be able to leverage GPUs. However, we could easily hand code some of the routines using cupy. One big challenge is that the data live in Google Cloud US-Central region. Would it be possible to get a Coiled GPU dask cluster going there? |
In addition to workshops there are also talks and tutorials at the Dask summit. Apologies, I probably should have posted something earlier here about the summit. Fortunately, I submitted an aspirational talk about processing a petabyte of data with Dask in the cloud. The hero workload sounds like it would be a good fit for that talk : )
We're in the process of rolling out GCP support in US-Central at the moment -- I can let you know when things are up and running Also cc'ing @gjoseph92 for visibility |
Following up on the conversation at today's meeting. The idea is to do the following
We could also consider an interactive version of this, where a user can draw a bounding box on a map and then have the histogram generated on the fly! 😱 🤩 I propose that anyone interested (cc @ocean-transport/members + @dhruvbalwada + @jrbourbeau) have a meeting dedicated to this topic. I am available this Friday 12-1pm and 2:45-3:30pm (EST). If those times don't work, we can look at next week. |
I can go both those times this week, with a slight preference for the second slot. |
I am very interested and can do both slots. |
I combed throught the xgcm issue tracker and I think it might be helpful to fasttrack this issue. My thinking was that there might be opportunities to optimize these higher level operators without calling the lower level ones (diff/interp) in a chained manner? But for that to work it might be helpful to have this functionality implemented and properly tested, so that we have a "ground truth" before messing with the internals. Just a thought we can chat about it during the meeting. |
I would like to attend and both time slots work for me as well. I will read Dhruv's paper before then. |
I am also interested, free both times, but need to read Dhruv's paper to understand! |
Count me in too -- both time slots this Friday work for me |
Ok invite sent for 2:45. @dhruvbalwada - perhaps you could walk us through the code you used in your paper? No need to prepare anything fancy though... |
Seems like there might be an overlap with what Qiyu is already doing? At least maybe worth inviting him... |
Good idea, he just started looking at the vorticity-strain histograms in llc4320 to compare against idealized simulations with seasonality. @qyxiao are you available tomorrow at 2:45pm eastern time? |
Hi Dhruv,
Yes I am, what is this about?
…On Thu, Apr 15, 2021, 4:59 PM Dhruv Balwada ***@***.***> wrote:
Good idea, he just started looking at the vorticity-strain histograms in
llc4320 to compare against idealized simulations with seasonality.
@qyxiao <https://github.com/qyxiao> are you available tomorrow at 2:45pm
eastern time?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEH3I7CHN4RQYNVVYMWYFXTTI5HP7ANCNFSM4ZJONWGA>
.
|
I have shared meeting invite with Qiyu, so he can join in as well. |
Thanks so much @cspencerjones for making this connection. It's actually perfect because we are looking for a scientific lead for this project. It sounds like @qyxiao is a natural fit for this. Just have to make all our goals are aligned. |
So in #12, I provided a simple example how xgcm's operations can be optimized. (This was my TODO item from our last meeting.) I'm trying to figure out the next step: how to actually get such optimizations to work via xgcm. I see a few different paths we could take:
I'd be curious to hear the Coiled folks' views on the feasibility of 3, using my example notebook as a reference problem. |
FYI, I also just shared an example of how to do an "interp with land" operation using the kernel + apply_ufunc approach here: xgcm/xgcm#324 |
Yesterday in the group meeting it was mentioned that we should come up with a so-called "hero workload" that both
I'm opening this issue as a place for us to brainstorm about what such a workload might look like. @rabernat @jbusecke do you have any initial thoughts about huge datasets and/or computation-heavy workloads the geoscience community would be excited to see us work on?
The text was updated successfully, but these errors were encountered: