Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files in transformations will stay assigned for ever #7116

Open
chaen opened this issue Jul 24, 2023 · 8 comments · Fixed by #7741
Open

Files in transformations will stay assigned for ever #7116

chaen opened this issue Jul 24, 2023 · 8 comments · Fixed by #7741
Assignees

Comments

@chaen
Copy link
Contributor

chaen commented Jul 24, 2023

This is quite an edge case. It can happen whenever you have a transformation that creates multiple operations on files.

Suppose the following request is created

- req1: Waiting
   - op1: Waiting
     - LFN1:Waiting
     - LFN2: Waiting
   - op2: Waiting
     -  LFN1: Waiting
     - LFN2:Waiting

If the Op1 fails for LFN1, but succeeds for Op2, we will have the following

- req1: Failed
   - op1: Failed
     - LFN1:Failed
     - LFN2: Done
   - op2: Waiting
     -  LFN1: Waiting
     - LFN2:Waiting

if you then call getRequestFileStatus, you will get:

LFN1: Failed
LFN2: Waiting

And that's the best it can do really.

The problem then is when updating the FileTask status:

for lfn, newStatus in statusDict["Value"].items():
if newStatus == "Done":
updateDict[lfn] = TransformationFilesStatus.PROCESSED
elif newStatus == "Failed":
updateDict[lfn] = TransformationFilesStatus.PROBLEMATIC

We do not take into account the case of LFN2.
It is very hard to know what to do. The Request is in a final state (it will never change anymore), but the file is not (it only went half way through the process it has to follow).
Setting it Problematic is maybe an option...
I think more and more that whenever we have a Transformation with multiple operations per request, we should have a flag saying whether the requests are reentrant or not (i.e. can you re-run them from the beginning at any point in time). If they are not, even resetting task should be forbidden.

Opinion @andresailer ?

@chaen chaen self-assigned this Jul 24, 2023
@chaen
Copy link
Contributor Author

chaen commented Jul 24, 2023

@sfayer @marianne013 as you start using the TS, you may have opinion too ?

@fstagni
Copy link
Contributor

fstagni commented Jul 24, 2023

I don't think that a file in (RMS) status Waiting should be set as TransformationFilesStatus.PROBLEMATIC: it just feels wrong.

I think we should instead check what is the status of the request first, and if failed set all files to TransformationFilesStatus.PROBLEMATIC

@chaen
Copy link
Contributor Author

chaen commented Jul 24, 2023

That's not correct either. Of course, we should take the request status into consideration.
But some files can be properly finished, and some not. So setting everything to problematic is not good.
The real question is what to do with the files that are only half way through.

@fstagni
Copy link
Contributor

fstagni commented Jul 24, 2023

I don't see any other option but setting TransformationFilesStatus.PROBLEMATIC those files that are in RMS status Waiting for which the request is in status Failed.

@chaen
Copy link
Contributor Author

chaen commented Jul 25, 2023

I think this is the correct thing to do in general.
The problem is the requests might not be re-entrant. And in this case, we need to fix the request itself, not just resetting the task. That's why I am thinking of adding this safety flag

@chaen
Copy link
Contributor Author

chaen commented Nov 2, 2023

@andresailer another ping

@andresailer
Copy link
Contributor

Sorry, I don't really have an opinion here. Anything that makes things more fault tolerant sounds good!

@chaen
Copy link
Contributor Author

chaen commented Jun 25, 2024

I add an extra case here. Hopefully this issue will (soon) consolidate into unit tests, and possibly a fix

- req1: Waiting
   - op1: Waiting
     - LFN1:Done
     - LFN2: Waiting
   - op2: Done
     -  LFN1: Done
     - LFN2:Done

Both LFN1 and LFN2 will be considered Done, while in fact only LFN1 is.
This is the sort of things that happens constantly with replicateAndRegister. It went unnoticed until now as the registration always worked, but it is absolutely possible that a file was marked as processed while it wasn't yet registered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants