-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
level_3 cal collisions causing missing intermediate files #8729
Comments
Comment by Tyler Pauly on JIRA: One solution to the issue could be to alter the intermediate filenames to include an association or product name string, such that an exposure residing in multiple level 3 associations would have unique intermediate filenames if multiple associations are being processed simultaneously. |
Comment by Brett Graham on JIRA: What version of jwst was used for these runs? |
Comment by Melanie Clarke on JIRA: Another possible solution, discussed elsewhere, might be to save the necessary intermediate data to temp files instead of to named files in the output directory. |
Comment by Katie Kaleida on JIRA: If we are removing group associations (which it looks like we are likely going to JSSET-236), does this problem go away? |
we're currently seeing this issue affecting the reprocessing of program 01207 with b11.1.1: jw01207-o002_20250102t121019_spec3_00001 jw01207-o004_20250102t121019_spec3_00002 jw01207-c1000_20250102t121019_spec3_00001 (consisting of o002, 003, 004) jw01207-c1000_20250102t121019_spec3_00002 (consisting of o002, 003, 004) are stepping on each other. |
Issue JP-3717 was created on JIRA by Hien Tran:
ops has seen evidence that concurrent level 3 pipeline processes for associations with common input members can step on each other, causing missing intermediate files (i.e., *outlier_id2.fits), and crash.
a recent example is
jw01568-c1000_20240819t100727_image3_00001
and{}jw01568-c1004_20240819t100727_image3_00001{
}. the c1000 asn consists of observations o001 and o002, while c1004 asn contains o001, o002, and o003. ALL of the same members in c1000 are also in c1004. therefore, when intermediate files for c1000 got produced and +cleaned up+ afterwards, the same intermediate files produced by the c1004 process got removed by, and along with those in the first (c1000) process, and became unavailable when they were needed by the 2nd process.the ALOG.out logs for the two processes are attached, along with an sdiff between the listings of the *outlier_id2.fits files generated in the alog for the failed c1004 and those available on disk. note that all the missing files are for o001 and o002 – exactly those that got wiped out by the c1000 process.
The text was updated successfully, but these errors were encountered: