-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop support for cythonized scheduler #5685
Comments
Could someone please explain what is meant by "stalled"? AIUI this is done. We could of course always find more things to do. Though if there are other things we would like this group to do, maybe we can spend some time discussing those. Are there any metrics for build times? In particular CI runtime with/without? Given we are only compiling one file, wouldn't expect this to take very long (and that hadn't been my experience either). Yes MyPy and Cython don't work well today. This was one of the concerns I raised when MyPy was being explored. Personally would still suggest not using this on a Cythonized file. Cython developers continue to work on typing that is more natural to Python though this likely will take time to do. Can someone please share some examples of what is deemed unreadable along with an explanation of why that is found to be the case? Same question with compilation issues. In terms of building a package this could be done at this point. We held off initially as we didn't have CI coverage, which we since added. We did some benchmarking here. There have been other attempts to explore this. The main challenge here is a proper benchmark likely involves a multinode cluster, which is a non-trivial ask for a regularly run benchmark. Curious what others think about benchmarking. This goes beyond Cython as we really need a better idea of runtime and the impact of changes (though this will add to CI runtime). |
We haven't done anything on this front in months. I doubt it is generating value for many people but I might be wrong, of course. It's probably fair to say that an initial increment in this effort is done but what comes next? The discussions around scheduler performance shifted and we haven't moved forward on this in a long time. If we indeed want to support it we should consider next steps. If not, we should consider dropping it in favour of ease of development and reduced maintenance.
We don't have any proper statistics for this. The only thing I can say is that installation + build takes about 1-2min looking at a few recent builds. That's about 3-10% of runtime (given that a job runs for about 20-30min). That fluctuates, of course. I can't say if it impacts test runtime itself. I wouldn't want this argument to get derailed due to this, that's just one more small cut.
I think this is a bit hard to produce. At this point it is actually very hard to distinguish what kind of annotations are in there for mypy reasons and which are there because cython requires it. I personally cannot make a strong case for this but it was raised as a concern.
Also hard to pin down exactly. Searching the PRs we can at least filter out the ones that mention cython and quite a few mention some subtle problems impacting the code. https://github.com/dask/distributed/pulls?q=is%3Apr+cython
I think this is coupled strongly to the outcome of this issue. If we continue to keep this code and want to support this moving forward, we have to publish this to make it accessible to users. If we don't the value of keeping it is too small imo.
I personally think that these changes have to be monitored automatically. Many of the cythonization changes can break easily, for instance if cython falls back to python code due to typing issues. We don't have proper visibility into this and obviously not the bandwidth to check this manually for every change. |
There are various parts of the codebase that don't see attention for a period of time. This doesn't say much about their usefulness. Serializing NumPy arrays as an arbitrary example (the last change occurred there almost 1yr ago). Despite the lack of attention many people happily use this (whether they know it or not). Idk what can be determined about a piece of code's value using this metric. That said, taking a step back, I think it is worth looking at how the optimization work has progressed. Initially we thought the way to improve things was to use HLGs to build up more complex task graphs with fewer nodes, ship these to the Scheduler, and then unpack and execute them on Workers. Though this would move more of the burden to the Scheduler, which was already having some issues. Hence we spent time optimizing the Scheduler. After a fair amount of optimization, the remaining pain points were communication and serialization. On the communication front there have been a number of improvements. The most recent of which was adding support with asyncio with uvloop. More gains could likely be gotten by patching CPython. Serialization has seen a number of improvements, most notably removing extra unneeded passes over the data ( #4531 ). There are other improvements that could be made like ( #4699 ) or ( #4923 ), but perhaps this gets handle along with other serialization work ( #5581 ). To summarize, we started down an optimization path, which led us to some different issues. Now that we have done some work on these other issues, it seems worthwhile to revisit the big picture and see where things stand.
Again I'm not sure this is how we should be evaluating code's utility, but ignoring that. A next step would be separating out Beyond that some of the objects in the Scheduler are opaque. This affects Python and Cython code in terms of general maintainability and performance. It also shows up in communication and serialization. Namely we use Outside of this am hoping the feedback you and others have provided here could be translated into a series of scoped issues that we could then spend time working on. Admittedly a lot of this is cleanup work that requires a modest effort and will likely yield some performance improvements along the way. There wasn't much appetite for this clean up work before as things were largely driven by improving performance. Now that there is more appetite for cleanup work, it seems reasonable to spend some time on this.
Not to derail the argument, but I do think we should have concrete information here if we are going to make decisions based on it. It seems a first step here would be measuring the runtime of different steps in CI and identifying the significant time consuming steps. Ideally this would be scoped into an issue someone could work on to address this concern.
Again it would help to have concrete examples to discuss. If there are particular issues, someone can work on addressing those. Perhaps whoever voiced this concern can share some examples?
Sorry to be a broken record here, but again concrete examples would help. That said, went through the search results. Here are the first 3 PRs that come up. This goes back to October 2021. So a few months. Numbered by most recent PR to oldest.
Before that are the MyPy PRs, which we have already discussed above so will skip that. Nothing is really sticking out to me, but feel free to point out anything I've missed.
Agree that publishing packages seems like a reasonable ask. This is being tracked in issue ( #4442 ). Maybe we can discuss there more what we would like to see.
Honestly this is a bit separate from Cythonization itself. Dask really needs regularly run benchmarks regardless. There has been work in When developing the Cythonized Scheduler, we used these benchmarks. RAPIDS also has GPU-BDB. There are two challenges:
1 ends up being harder than a project like say NumPy or Pandas as Dask does a lot under-the-hood when performing any kind of computation. Building a graph, optimizing the graph, shipping that to the Scheduler, distributing that workload to Workers, checking on the Workers status and progress, managing memory, etc. Then there are questions about the size of computation, chunks, and cluster it runs on. Trying to get the right benchmark that measures the right thing is tricky. 2 comes with its own set of challenges. First where does this run? Perhaps someone can run on a desktop they have lying around. This tends towards a bus factor of 1 though. It could be run in the cloud, but then this becomes a question of who pays for this. Then comes the question of what the machine needs in order to run in terms of hardware, OS, drivers, etc. Multinode comes up a lot for workloads that scale and seems desirable for benchmarking, but again difficult to get access to (and possibly expensive!). Finally there is a question of frequency. While it would be nice to run on every PR, that is likely impractical. Similarly every commit may prove tricky depending on what our expectations are. Running daily may be more doable. Anyways if we want to discuss this further, which I'm supportive of, would suggest raising a new issue as it seems like a whole topic unto its own.
Taking this out of order, but I hope it makes sense why. I think we need a better handle on what the actual issues are. It seems they are captured in a general way, but it would help to find more concrete examples. ATM it doesn't appear we have them. If they are coming up regularly though, it shouldn't be too difficult to share examples here. Once we have those examples, it may be easier to make some decisions based on them. Certainly one decision might be to drop. Another may simply be to fix the issues. |
Bringing this back up. We have a group meeting next week. In that meeting I plan to advocate for reverting the Cythonization work, and hopefully (if people feel up for it) continuing that work in a separate branch. In its current state Cythonization doesn't provide much value to Dask users, but does significantly complicate the codebase, which makes it harder to debug and work on. It has been in this state for a long while, and we don't seem to have a plan to get it out of this state without causing even further change to main (which I don't think is appropriate). With that in mind, I propose that we ...
I would still love to see experimentation happen in this space. I think that we would probably just want that experimentation to happen in either a separate branch, or in a way that made main more sensible and with less churn. I'd love to engage in async conversation here a bit, see if we can nail down the parameters of this decision, and then make this decision next week. |
+1 on this -- I have a test suite failing on Python 3.8 because https://github.com/dask/distributed/runs/5812749893?check_suite_focus=true |
In todays developer meeting we reached the decision to move forward with this and drop cython. Briefly summarized, our current prioritization does not allow us to focus on developing this further and the maintenance introduces overhead. While there are benefits for certain workloads to run on the cythonized code, we believe that these benefits do not outweigh the cost for maintenance. We may pick up Cython at a later point in time again if we believe it will benefit the project. |
@fjetter , @jakirkham , any appetite to take this on? I can so it, if not. |
Anything left for uncythonizing or can we close this ? |
Maybe small remaining comments and annotations, but certainly nothing blocking. |
In a call last week we discussed dropping the support for cythonization of the scheduler code.
The cythonization project was an attempt to improve the scheduler performance. While it indeed delivered improved performance it was not as significant as originally hoped.
The cythonization annotations introduce a few subtle problems to maintenance and development
Considering that the project is currently stalled, the costs are ongoing and the value for users is very limited, I suggest dropping support for cythonization.
We may want to pick up cythonization at a later point in time for specific applications again, for instance as specialized c extensions.
@quasiben, @jakirkham, @jrbourbeau, @jcrist
xref Drop PyPy #5681
The text was updated successfully, but these errors were encountered: