Refinement errors (Stale file handle) #68

JuhaHuiskonen · 2024-02-14T17:32:30Z

JuhaHuiskonen
Feb 14, 2024

Hi all, I am trying to track the source of this error. Some, but not all, parts of the split job give this error: frealign/maps/tomo-coarse-refinement-eyBDuFHF9AKLv3oz_frames_CSP_01.parx

251	Progress: 100%|##########| 11/11 [00:00<00:00, 17.84it/s]
252	2024-02-14 08:08:09 [INFO] pyp/system/mpi.py:111 | 11 command(s) finished
253	2024-02-14 08:08:13 [INFO] pyp/align/core.py:57 | Particle extraction (mode -2) took: 00h 00m 04s
254	2024-02-14 08:08:13 [INFO] pyp/align/core.py:1057 | Running CSPT (mode 2) using exposures 0 to 40
255	2024-02-14 08:08:13 [INFO] pyp/system/db_comm.py:35 | Parameters entered into database successfully
256	2024-02-14 08:08:13 [INFO] pyp/system/mpi.py:97 | Running 11 command(s)
257	2024-02-14 08:08:13 [INFO] pyp/system/mpi.py:99 | First command is: /opt/pyp/external/CSP/csp tomo-coarse-refinement-eyBDuFHF9AKLv3oz_frames_CSP_01.parx 5 0 1 1 frealign/GS4_TS9_ts_007.mrc GS4_TS9_ts_007.allboxes frealign/GS4_TS9_ts_007_stack.mrc > GS4_TS9_ts_007_csp_000000_000001.log
258	
259	Progress:   0%|          | 0/11 [00:00<?, ?it/s]
281	Progress: 100%|##########| 11/11 [9:09:01<00:00, 519.28s/it]
282	
283	Progress: 100%|##########| 11/11 [9:09:01<00:00, 2994.65s/it]
284	2024-02-14 17:17:14 [INFO] pyp/system/mpi.py:111 | 11 command(s) finished
285	2024-02-14 17:17:17 [INFO] pyp/align/core.py:57 | CSP Total time elapsed: 09h 09m 09s
286	2024-02-14 17:17:17 [INFO] /opt/pyp/bin/run/pyp:57 | Total time elapsed (csp_swarm): 09h 09m 28s
287	Traceback (most recent call last):
288	  File "/opt/pyp/bin/run/pyp", line 3914, in <module>
289	    csp_swarm(args.file, parameters, int(args.iter), args.skip, args.debug)
290	  File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
291	    return func(*args, **kwargs)
292	  File "/opt/pyp/bin/run/pyp", line 2456, in csp_swarm
293	    align.csp_refinement(
294	  File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
295	    return func(*args, **kwargs)
296	  File "/opt/pyp/src/pyp/align/core.py", line 1676, in csp_refinement
297	    new_par_file = csp_run_refinement(
298	  File "/opt/pyp/src/pyp/align/core.py", line 1218, in csp_run_refinement
299	    shutil.copy2( new_par_file, prev_par_file )
300	  File "/usr/local/envs/pyp/lib/python3.8/shutil.py", line 435, in copy2
301	    copyfile(src, dst, follow_symlinks=follow_symlinks)
302	  File "/usr/local/envs/pyp/lib/python3.8/shutil.py", line 264, in copyfile
303	    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
304	OSError: [Errno 116] Stale file handle: 'frealign/maps/tomo-coarse-refinement-eyBDuFHF9AKLv3oz_frames_CSP_01.parx'
305	2024-02-14 17:17:17 [ERROR] /opt/pyp/bin/run/pyp:3927 | PYP (cspswarm) failed
306	2024-02-14 17:17:22 [INFO] /opt/pyp/bin/run/pyp:3273 | Job 3214652_34 (v0.6.2) launching on c6109.mahti.csc.fi using 16 task(s) and 220 GB of RAM
307	2024-02-14 17:17:22 [INFO] pyp/refine/csp/particle_cspt.py:57 | Total time elapsed (csp_local_merge): 00h 00m 00s
308	Traceback (most recent call last):
309	  File "/opt/pyp/bin/run/pyp", line 3992, in <module>
310	    particle_cspt.merge_movie_files_in_job_arr(
311	  File "/opt/pyp/src/pyp/utils/timer.py", line 78, in wrapper_timer
312	    return func(*args, **kwargs)
313	  File "/opt/pyp/src/pyp/refine/csp/particle_cspt.py", line 380, in merge_movie_files_in_job_arr
314	    with open(movie_file) as f:
315	FileNotFoundError: [Errno 2] No such file or directory: 'stacks.txt'
316	2024-02-14 17:17:22 [ERROR] /opt/pyp/bin/run/pyp:4012 | PYP (csp_local_merge) failed
317	2024-02-14 17:17:22 [INFO] /opt/pyp/bin/run/pyp:4018 | Deleted temporary files from /scratch/project_2009485/nextPYP/tmp/huiskone/3214652_34
318

Answered by abartesaghi

Feb 29, 2024

Yes, that's possible, any run that hasn't produced any output for an hour will be considered zombie and the local scratch for that run will be cleared out by new jobs. We haven't seen any problems with this so far, but we can make this be a user adjustable parameter in case a longer timeout is needed in some setups. This change will be included in the next release.

View full answer

JuhaHuiskonen · 2024-02-27T19:19:42Z

JuhaHuiskonen
Feb 27, 2024
Author

I am getting this error in many different projects and two different HPC platforms. It is strange as it happens in some parts of the split job, but not all.

0 replies

abartesaghi · 2024-02-28T03:26:57Z

abartesaghi
Feb 28, 2024
Maintainer

One possible explanation is that jobs could be somehow erasing each other's scratch folder, but that is not supposed to happen though. What is the path to the scratch folder printed at the beginning of each job?

0 replies

JuhaHuiskonen · 2024-02-28T06:22:17Z

JuhaHuiskonen
Feb 28, 2024
Author

This is printed at the beginning of one failed job

1	2024-02-27 09:57:15 [INFO] /opt/pyp/bin/run/pyp:3273 | Job 3261043_6 (v0.6.2) launching on c2374.mahti.csc.fi using 8 task(s) and 220 GB of RAM
2	2024-02-27 09:57:15 [WARNING] /opt/pyp/bin/run/pyp:3238 | Detected zombie run at 3260403_29, clearing up files
3	2024-02-27 09:57:15 [WARNING] /opt/pyp/bin/run/pyp:3238 | Detected zombie run at 3260458, clearing up files
4	2024-02-27 09:57:16 [INFO] /opt/pyp/bin/run/pyp:3200 | Filesystem                                  Size  Used Avail Use% Mounted on
5	2024-02-27 09:57:16 [INFO] /opt/pyp/bin/run/pyp:3200 | 10.141.0.14@o2ib:10.141.0.13@o2ib:/scratch   40T  2.0T   39T   5% /scratch/project_2009485/nextPYP/tmp

An input file for this failed split job is read from a scratch directory as follows:

420	2024-02-27 11:57:08 [INFO] pyp/align/core.py:1623 | Running refinement for class 1 of 1
421	2024-02-27 11:57:08 [INFO] pyp/system/local_run.py:65 | 
422	/opt/pyp/external/frealign_v9.11/bin/apply_mask.exe << eot
423	M
424	/scratch/project_2009485/nextPYP/tmp/huiskone/3261043_6/GS4_TS8_ts_003/frealign/scratch/GS4_TS8_ts_003_r01_01.mrc

Another failed split job of the same run is reading its inputs from a different scratch directory:

162	2024-02-27 11:22:00 [INFO] pyp/align/core.py:1623 | Running refinement for class 1 of 1
163	2024-02-27 11:22:00 [INFO] pyp/system/local_run.py:65 | 
164	/opt/pyp/external/frealign_v9.11/bin/apply_mask.exe << eot
165	M
166	/scratch/project_2009485/nextPYP/tmp/huiskone/3261043_8/GS4_TS7_ts_004/frealign/scratch/GS4_TS7_ts_004_r01_01.mrc

I noticed that there are lines like this printed

2 2024-02-27 09:57:15 [WARNING] /opt/pyp/bin/run/pyp:3238 | Detected zombie run at 3260403_29, clearing up files. Is it possible that nextPYP spawns "remove files" jobs to delete old directories, and this file removal would be running when a new job is running, removing some files that are needed for the new job?

Edit: If I empty the scratch dir of all previous jobs (some of which had failed), the new jobs seems to be running now ok. So perhaps the error is related to these "zombie runs"?

5 replies

abartesaghi Feb 29, 2024
Maintainer

Yes, that's possible, any run that hasn't produced any output for an hour will be considered zombie and the local scratch for that run will be cleared out by new jobs. We haven't seen any problems with this so far, but we can make this be a user adjustable parameter in case a longer timeout is needed in some setups. This change will be included in the next release.

Answer selected by JuhaHuiskonen

JuhaHuiskonen Feb 29, 2024
Author

What if some parts of a split job (let's say the first 10 of 30) complete running, but the rest (20 jobs) remain in the queue for over an hour. Will such jobs be considered zombies? Or is this check done for each part of a split job? In my experience, there can be a long lag in updating the log file; even if the job is running successfully, the log file is not updated constantly, but often with a delay.

JuhaHuiskonen Mar 1, 2024
Author

OK, it looks like the reason is what I suspected. Here I am launching 30 jobs. They start running at different times. When job 18 starts, it marks job 1 as a zombie:

1	2024-02-29 22:09:54 [INFO] /opt/pyp/bin/run/pyp:3273 | Job 3268159_18 (v0.6.2) launching on c3274.mahti.csc.fi using 128 task(s) and 220 GB of RAM
2	2024-02-29 22:09:55 [WARNING] /opt/pyp/bin/run/pyp:3238 | Detected zombie run at 3268159_1, clearing up files

Then job 1 fails:

348 OSError: [Errno 116] Stale file handle: 'frealign/maps/tomo-coarse-refinement-f0Dpk4FLJAMRI077_frames_CSP_01.parx'

I am hoping for a quick fix as this is affecting nearly all (attempted) runs on my system so using nextPYP is not possible atm.

abartesaghi Mar 1, 2024
Maintainer

Yes, this will be fixed in tomorrow's release.

abartesaghi Mar 2, 2024
Maintainer

This issue should be fixed in v0.6.3. The default timeout was increased to 10 hours and you can now change this value in the Resources tab.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nextPYP

Refinement errors (Stale file handle) #68

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

nextPYP

Refinement errors (Stale file handle) #68

JuhaHuiskonen Feb 14, 2024

Replies: 3 comments · 5 replies

JuhaHuiskonen Feb 27, 2024 Author

abartesaghi Feb 28, 2024 Maintainer

JuhaHuiskonen Feb 28, 2024 Author

abartesaghi Feb 29, 2024 Maintainer

JuhaHuiskonen Feb 29, 2024 Author

JuhaHuiskonen Mar 1, 2024 Author

abartesaghi Mar 1, 2024 Maintainer

abartesaghi Mar 2, 2024 Maintainer

JuhaHuiskonen
Feb 14, 2024

Replies: 3 comments 5 replies

JuhaHuiskonen
Feb 27, 2024
Author

abartesaghi
Feb 28, 2024
Maintainer

JuhaHuiskonen
Feb 28, 2024
Author

abartesaghi Feb 29, 2024
Maintainer

JuhaHuiskonen Feb 29, 2024
Author

JuhaHuiskonen Mar 1, 2024
Author

abartesaghi Mar 1, 2024
Maintainer

abartesaghi Mar 2, 2024
Maintainer