-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schedulers other than depth-first unlikely to set TaskRun created time correctly #35
Comments
Hi, Thanks a lot for the feedback. Actually, you have been really thoughtful about it. Regarding the random scheduler: we don't support it in the current PYBOSSA setup (since October 2017). It is included as a demo plugin, so people can build their own PYBOSSA schedulers. However, you are right about the issue with the random scheduler. Since almost 8 months ago, we're not using a lot PYBOSSA.JS as we have now more flexibility in the API to build schedulers on the frontend. Basically, the api now allows you to requests all the tasks for the user from /api/task with a flag, to mimick as well the newtask enpoint. Basically, you can request all the tasks for a given project for a given user with this flag ?participated=1 In this way, you can order the tasks by date, by attribute, get 100 tasks at a time, use last_id for keyset pagination, etc. You have many more options, but you are right, while this new flag gives us more flexibility we still have to validate the tasks before submitting one. Thus, how do we do it? Exactly as you said, calling the API endpoint: /<short_name>/task/. Thus, if you want to implement a random scheduler int he frontend you can do the following:
Point 3 will save you to request tasks all the time ;-) However, if you have too many users you can end up with tasks that have been completed by the crowd, so you should add some logic behind the scenes to remove those tasks that have been completed from the localStorage. Thus, I think we don't need to implement anything new. You can pass the header application/json and you will get the response in JSON format to know how it went the request. We can however, add a new flag to this point to allow just the refresh, and return only the confirmation, but do we need this? I'm not sure about it. What do you think? PS: This is the way we're building our projects now, using this method. |
Hi Daniel, Thanks for the details. I didn't realize the random scheduler was removed, but I believe this issue could arise with breadth first as well, as a race condition while other tasks are being completed. To restate my concern more succinctly, there is a code path that results in incorrect
If However, the simple fix would be to add a function that calls the project presenter endpoint at the time the user starts working on the task. The purpose is solely to update the
and call it here:
This will almost never run if depth first is the scheduler, but if it does run, it will set the time stamp correctly. The code flow in Pybossa.js is very complex, but semantically, the goal is this:
If the |
Overall, my suggestion is that when people are caching tasks on the front-end, there should be clear documented recommendation about the best way to tell the back-end to set the |
To clarify for others reading this who are not familiar with the Pybossa schema, task_run.finish_time is set when the taskrun is saved.
|
Totally agree. Can you send a PR with your code? We will merge it. Loving your solution for PYBOSSA.JS. |
Regarding task_run.finish_time. We decided to take the measurements always in the server, as it is a bit more "precise" When we started developing PYBOSSA timers on the browser were not really good, now we can use https://developer.mozilla.org/en-US/docs/Web/API/Performance/now#Syntax Thus, it could be something you can control as well on the task_run.info to measure it. What do you think? |
Hmm. So you are suggesting that the server API allow posting TaskRun.created and TaskRun.finish_time as measured by the browser? Basically by removing the two fields from My first instinct is that it makes it easier to fake the timing, but I guess it's not that hard to fake out the server side timers anyway. It would have the advantage that the Redis database would only be used for caching tasks and queues, and it would be less of a problem to reset the Redis db. (Right now I lose the Redis db when I deploy to new machines in the cluster, so I lose the timing keys discussed above, which caused me a lot of confusion before I figured it out. With the cluster solution I am using, it would be some work for me to put the Redis DB on persistent storage.) My initial impression is it would be good to move responsibility for timing to the front end, especially as it would be simpler for front-end task caches to be responsible to track the start time. Also, we wouldn't have to worry about keys expiring, people could leave a task open in a browser all week if they wanted, and come back to it and keep going, without getting a timer key expired error. Seems like a good idea. |
The random scheduler won't set TaskRun
created
time correctly, breadth firstcreated
time will have race conditions with other users active, the default depth first scheduler is settingcreated
time correctly with a little luck.Before Pybossa.js presents a task, it pre-loads the next task. At the time of pre-load, the back-end sets a Redis key for the pre-loaded task with a time stamp, even though the user hasn't started the task yet.
You can easily verify this by going to your Pybossa back-end environment:
You will see that current task and next task are both set, as expected.
But they have almost identical time stamps, so we must examine when or if the time stamp is updated for the pre-loaded task, so we know it matches when the user sees the task, not when it is pre-loaded.
When you save the first task, if you are using a depth first scheduler (default), the call to _fetchNewTask here will get the task that is currently being presented, because, after all, the user has not completed it.
https://github.com/Scifabric/pybossa.js/blob/master/pybossa.js#L147
Luckily,
/project/<project_id>/newtask
callsguard.stamp(task, get_user_id_or_ip())
which updates the Redis key that will be used as the current task's timestamp when it is saved.Then there is an
if (previousTask && task.id === previousTask.id)
statement in the pre-load , that says, oops, same task, let's get the one after.So with a depth first scheduler, the current task just being presented will be loaded (a second time), and secondTry will always be run to get the actual next task.
This is good because it has the effect of setting the time correctly for the current task right when it is presented. But it relies on the scheduler returning the same task again - a RANDOM scheduler will not luckily refresh the
created
time of the current task.BREADTH FIRST may return a different task than the first time if in the meantime, another user has completed tasks, shuffling what is next in line.
In either case, we would expect the TaskRun
created
time stamp to incorrectly be saved with the time as of the pre-load, thus incorrectly including all the time of the prior task with the task now being saved.Possible Solution
Some API call has to be made at the time the task is presented to update the Redis key with the created time stamp. This is the only other routine on the backend that also sets the time stamp, so it could be called just to refresh the timestamp, and discard the result.
https://github.com/Scifabric/pybossa/blob/master/pybossa/view/projects.py#L852
Or a new lightweight API could be made that returned nothing, just refreshes the time stamp. It would be a bad idea to make
/api/task/<task_id>
set the time stamp, as it would break existing expectations and not be backward compatible with older Pybossa.js. For example it is called inpybossa.saveTask
and you wouldn't want to set thecreated
stamp at the moment you are saving the taskrun.The text was updated successfully, but these errors were encountered: