Refinement errors (Stale file handle) #68
-
Hi all, I am trying to track the source of this error. Some, but not all, parts of the split job give this error:
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
I am getting this error in many different projects and two different HPC platforms. It is strange as it happens in some parts of the split job, but not all. |
Beta Was this translation helpful? Give feedback.
-
One possible explanation is that jobs could be somehow erasing each other's scratch folder, but that is not supposed to happen though. What is the path to the scratch folder printed at the beginning of each job? |
Beta Was this translation helpful? Give feedback.
-
This is printed at the beginning of one failed job
An input file for this failed split job is read from a scratch directory as follows:
Another failed split job of the same run is reading its inputs from a different scratch directory:
I noticed that there are lines like this printed
Edit: If I empty the scratch dir of all previous jobs (some of which had failed), the new jobs seems to be running now ok. So perhaps the error is related to these "zombie runs"? |
Beta Was this translation helpful? Give feedback.
Yes, that's possible, any run that hasn't produced any output for an hour will be considered zombie and the local scratch for that run will be cleared out by new jobs. We haven't seen any problems with this so far, but we can make this be a user adjustable parameter in case a longer timeout is needed in some setups. This change will be included in the next release.