Race condition 500 Internal Server Error when submitting multiple builds to a directory that has never been used #3358

hroncok · 2024-08-06T23:29:21Z

This happens to me fairly regularly when I run Copr impact checks to see if an upgrade of some Fedora package does not break anything. I decided to create a smaller reproducer and report it.

Using the copr CLI:

create a new copr project
add packages from Fedora distgit (other sources may also be impacted)
submit several builds to a custom directory that has never been used yet, at the same time

Some of the builds will fail with:

Something went wrong:
Error: Response is not in JSON format, there is probably a bug in the API code.
Try 'copr-cli --debug' for more info.

Adding --debug does not reveal much:

Server response:
----------------


500 Internal Server Error

Internal Server Error
The server encountered an internal error or
misconfiguration and was unable to complete
your request.
Please contact the server administrator at 
 root@localhost to inform them of the time this error occurred,
 and the actions you performed just before this error.
More information about this error may be available
in the server error log.

Reproducer (uses moreutils-parallel):

COPR=reproducer-race
copr create $COPR --chroot fedora-rawhide-x86_64 --delete-after-days 30
copr add-package-distgit $COPR --webhook-rebuild off --commit rawhide --name dummy-test-package-gloster
parallel -j8 copr build-package $COPR:custom:1 --nowait --background --name -- dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster

Often some of the first builds errors:

Build was added to reproducer-race:
  https://copr.fedorainfracloud.org/coprs/build/...
Created builds: ...

Something went wrong:
Error: Response is not in JSON format, there is probably a bug in the API code.
Try 'copr-cli --debug' for more info.
Build was added to reproducer-race:
  https://copr.fedorainfracloud.org/coprs/build/...
Created builds: ...

If it does not happen to you, repeat with a new directory name ($COPR:custom:2, $COPR:custom:3...) until it does.

Use this to cancel the running/pending builds after you run the above in case you want to preserve resources for others:

parallel copr cancel -- $(copr list-builds --output-format text-row $COPR | cut -f1)

I hypothesize that a first build in the custom directory does something special (wrt creating the directory) and when multiple builds think they are first, they all attempt to do the special thing at the same time and some of them get an unhandled exception because of a race condition.

The text was updated successfully, but these errors were encountered:

FrostyX · 2024-08-07T12:21:24Z

Triage: Two issues to solve ... 1. Why 500? 2. Return something reasonable if 500

hroncok · 2024-08-07T12:43:59Z

In my experience, 500 happens when there is an unhandled Python exception. If the webserver runs in debug mode, the exception is shown, but if it is in production mode, it is hidden. If you have a development copr server with debug mode enabled, we could try reproducing there.

hroncok · 2024-08-07T12:47:40Z

I am looking at the code, searching where this could have happened and I found c1fa04b -- if this wasn't deployed yet, perhaps this fixed the issue.

FrostyX · 2024-08-07T13:04:52Z

Hello @hroncok,
thank you for the report. The step-by-step reproducer is very much appreciated.

We decided to not prioritize this issue for the next 3 months because although annoying, it seems there should be an easy workaround. I suppose only the reproducer is done via parallel to hit the issue more easily but your actual script goes one by one? Then something like sleep 1 between calls should workaround this? If I am wrong and there isn't an easy workaround, please let us know and we will prioritize this more.

hroncok · 2024-08-07T13:14:03Z

No, I use parallel to submit thousands of builds.

The workaround I use is to resubmit the failed ones later (a bit tricky to figure out which failed, but I can manage).

Another workaround is to submit the first one manually and use parallel to submit the rest after.

praiskup · 2024-09-24T11:39:06Z

Probably related to #3372

fedora-copr-github-bot added this to CPT Kanban Aug 6, 2024

github-project-automation bot moved this to Needs triage in CPT Kanban Aug 6, 2024

FrostyX moved this from Needs triage to In 2 years in CPT Kanban Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition 500 Internal Server Error when submitting multiple builds to a directory that has never been used #3358

Race condition 500 Internal Server Error when submitting multiple builds to a directory that has never been used #3358

hroncok commented Aug 6, 2024 •

edited

Loading

FrostyX commented Aug 7, 2024

hroncok commented Aug 7, 2024

hroncok commented Aug 7, 2024

FrostyX commented Aug 7, 2024

hroncok commented Aug 7, 2024

praiskup commented Sep 24, 2024

Race condition 500 Internal Server Error when submitting multiple builds to a directory that has never been used #3358

Race condition 500 Internal Server Error when submitting multiple builds to a directory that has never been used #3358

Comments

hroncok commented Aug 6, 2024 • edited Loading

FrostyX commented Aug 7, 2024

hroncok commented Aug 7, 2024

hroncok commented Aug 7, 2024

FrostyX commented Aug 7, 2024

hroncok commented Aug 7, 2024

praiskup commented Sep 24, 2024

hroncok commented Aug 6, 2024 •

edited

Loading