Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

there is possible race condition with job start callback #93

Open
volodymyrss opened this issue Mar 4, 2024 · 12 comments
Open

there is possible race condition with job start callback #93

volodymyrss opened this issue Mar 4, 2024 · 12 comments
Assignees

Comments

@volodymyrss
Copy link
Member

volodymyrss commented Mar 4, 2024

since this callback is sent very quickly after query is sent to the backend

reported by @dsavchenko

@dsavchenko
Copy link
Member

This callback is added in nb2workflow by oda-hub/nb2workflow#135

@dsavchenko
Copy link
Member

It was also mentined here oda-hub/dispatcher-app#665 (comment)

@burnout87
Copy link
Contributor

burnout87 commented Mar 4, 2024

I think I see why the race condition happens:

  • the first time the run_analysis is called (so when the request is sent, for instance, for the first time, from the frontend), the call_back is also called, and before the run_analysis completes
  • so, the call_back completes by writing the progress status within the scratch_dir, but then, it might be overwritten to submitted during the completion of the run_analysis, thus not returning the progress status

I also tested this behavior locally, and I could see that the call-back is not started before the completion of the first run_analysis, but this is expected

This is my guess

@dsavchenko
Copy link
Member

What can we do with it? I think of the lock file (won't help by itself, still good to have) and also the condition in the job manager that it can't "lower" the status

@burnout87
Copy link
Contributor

How would you implement the lock? I guess it'd be on the file.

So to make sure that, in relation to what described here, first the run_analysis completes, and then the run_call_back will execute? So we'd have a consistent sequence of states?

also the condition in the job manager that it can't "lower" the status

And about this, when do you see this needed? In the case the first run_call_back called from the nb2service happens to finish before the first call to run_analysis? I will look into it anyway

@dsavchenko
Copy link
Member

I just tried again to track possible causes in the code (because I'm not 100% sure about it), but I get lost because I don't really understand a purpose and the logic of "job aliasing". Could someone explain?
May this aliasing also lead to similar problem, e.g. callback and run_query operate in different dirs?

And about this, when do you see this needed? In the case the first run_call_back called from the nb2service happens to finish before the first call to run_analysis?

Yes, this particular case. Probably, just one restriction that "progress" can't become "submitted" would be enough.

How would you implement the lock? I guess it'd be on the file.

I just thought of using a library, like https://py-filelock.readthedocs.io/en/latest/index.html

@dsavchenko
Copy link
Member

Another possible race condition: progress report callback may overwrite the "done" status. It's not fully confirmed, but it's possible and I suspect it may be the cause of frontend stuck in "progress" intil re-request with status "new". We observe this occasionally.

@volodymyrss
Copy link
Member Author

Yeah, that's what I see. It looks very weird, there is a product modal, but the only product is "progress" which can be viewed. Eventually actual product also appears.

@burnout87
Copy link
Contributor

ok ,good to know, I will also observe it and test locally

@burnout87
Copy link
Contributor

Another possible race condition: progress report callback may overwrite the "done" status. It's not fully confirmed, but it's possible and I suspect it may be the cause of frontend stuck in "progress" intil re-request with status "new". We observe this occasionally.

oda-hub/dispatcher-app#670 is intended to fix this

@volodymyrss
Copy link
Member Author

Let's close this until confirmed that it happens again.

@dsavchenko
Copy link
Member

I will reopen, at least I see this on staging. Example - PhotoZ instrument, Run_phosphoros_basic, it doesn't internally report progress from the notebook, and the job is "submitted" but not "progressing" up until the notebook is completed (or failed).
A consequent inconvenience is that a notebook preview isn't available up until the result is there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants