-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added property long_eta #59
base: master
Are you sure you want to change the base?
Conversation
For tasks that run hours not seconds, a total average eta generally becomes more stable and accurate as the task progresses.
Add return
IMO, the bug in this project is that the moving average window is too short (sometimes only several ms) -- the higher the next() call frequency is, the worse results the progress bar shows. Correct moving average calculation should have some minimal time window period to give "stable" results (see e.g. https://github.com/python-progress/python-progress where the window is 2 seconds by default, and decide whether you need the That said, "long_eta" calculated across the whole period (hours) would be rarely useful; the upload/download process is dynamic and the speed changes dramatically -- giving the user incorrect estimation. The useful info is "what has been the average speed in last few seconds". |
@praiskup Model the progress as a scatter plot, sampling the "remaining" at intervals t. Actually this particular method doesn't even require that the intervals be regularly spaced. What you wind up with is a trapezoid that probably overestimates most of the time, but is acceptable for a rough time of a long running job. Right now I have a big data job, and the estimate is using this logic. It looks like this. That being said there are some better schemes that can be made using the long estimate, which will converge to a result faster. I actually made some better logic that uses a weighted shift to prevent spikes from moving the estimate too much. But the logic for that was more involved, so I didn't put it in this particular PR. For example, I'm using a 60/40 weighted split:
Note that this simple linear extrapolation is a time/memory tradeoff. Not too much time and memory should be spent calculating progress, so even though calculating a linear regression might be the way to go, the manner in which long_eta works is acceptable for long tasks and scales O(1) unlike a regression that scales O(n) in time and space. Thus this is a reasonable compromise. If this PR is accepted, I might try making a |
Ah, I thought that you want fix the broken "moving average" speed calculation by taking "average" speed into account, but you have different (pretty corner) use-case in my opinion. Unless you process something which has "uniform" speed over the whole period, you never want to take absolute "average" into account, otherwise it gives you very misleading info. And if you do, you probably don't want to use "simple moving average" calculator/project. But I can also accept that it is matter of taste. I'm +1 for additional functionality - since I see that it is useful for you, but if progress should support both calculations now - I'd suggest you to make clear distinction between "moving" average and normal average in documentation and in property names, so it is absolutely clear what is what. Seeing "avg_long" in "simple moving average" project isn't descriptive enough. |
For tasks that run hours not seconds, a total average eta generally becomes more stable and accurate as the task progresses.
A small running average does the unstable "windows file transfer" sort of estimate where the estimated time is wildly unstable, whereas a long running average starts unstable and then converges to an ok approximation.