Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected downtime on copr-frontend service #2959

Closed
FrostyX opened this issue Oct 18, 2023 · 7 comments
Closed

Unexpected downtime on copr-frontend service #2959

FrostyX opened this issue Oct 18, 2023 · 7 comments
Labels
fedora-copr-admin Tasks that need to be done by Fedora Copr administrator

Comments

@FrostyX
Copy link
Member

FrostyX commented Oct 18, 2023

We have several emails from UptimeRobot that copr-fe is down. Also confirmed by @tuliom.

First thing that came to my mind was OOM issues again but doesn't seem like it:

 ● httpd.service - The Apache HTTP Server
     Active: active (running) since Sat 2023-09-23 20:19:55 UTC; 3 weeks 4 days ago
@FrostyX FrostyX added the fedora-copr-admin Tasks that need to be done by Fedora Copr administrator label Oct 18, 2023
@github-project-automation github-project-automation bot moved this to Needs triage in CPT Kanban Oct 18, 2023
@FrostyX
Copy link
Member Author

FrostyX commented Oct 19, 2023

This may or may not be related, I don't want to mislead us in the wrong direction.

A build failure was reported today by @penguinpee. The build 06547142 failed during SRPM phase with the following error:

Network error: Request client error on https://download.copr.fedorainfracloud.org/results/gui1ty/extract-msg/fedora-rawhide-x86_64/06547141-python-red-black-tree-mod/python-red-black-tree-mod-1.21-3.fc40.src.rpm: 404 Not Found

The URL is correct, I can download it now.

Is it possible that there is some underlying infrastructure issue causing downtimes of both frontend and backend?

@praiskup praiskup moved this from Needs triage to In 3 months in CPT Kanban Oct 19, 2023
@praiskup
Copy link
Member

Triage time: @nirik didn't we have some DNS outage recently?

Another option is that we could have the disks unmounted, and lighttpd would be serving an empty directory? But that wouldn't explain FE outage.

Another option is that the Cloud Fronts distribution was modified during that time. But again, it wouldn't be related to FE outage.

@tuliom
Copy link

tuliom commented Oct 19, 2023

Another option is that we could have the disks unmounted, and lighttpd would be serving an empty directory?

At least when the issued happened to me, the browser's requests timed out.
I don't think an empty directory would have caused this.

@FrostyX
Copy link
Member Author

FrostyX commented Nov 14, 2023

Last night I stumbled upon this issue and had around 2 minutes to investigate.

  • I don't think it is caused by DNS because I couldn't load the page even when trying to access it via an IP address
  • I don't think it is an issue with httpd because SSH couldn't connect either
  • As @praiskup said, it seems that PostgreSQL is eating too much memory, and we are swapping too much

Attaching screenshots taken after I was able to connect (top CPU and top memory):

Screenshot_2023-11-13_23-32-57
Screenshot_2023-11-13_23-32-39

@nirik
Copy link

nirik commented Nov 16, 2023

FYI, there was a dns issue on 2023-10-26 for about an hour. Our dnssec signature expired.

nagios saw some issues on 2023-11-13:

fedora-noc.log:Nov 13 14:13:15 PROBLEM - copr-fe.aws.fedoraproject.org is DOWN: CRITICAL - Socket timeout (noc01) $
fedora-noc.log:Nov 13 14:13:36 PROBLEM - copr-fe.aws.fedoraproject.org is DOWN: CRITICAL - Socket timeout (noc02) $
fedora-noc.log:Nov 13 14:25:15 RECOVERY - copr-fe.aws.fedoraproject.org is UP: SSH OK - OpenSSH_8.8 (protocol 2.0) (noc01) $
fedora-noc.log:Nov 13 14:28:37 RECOVERY - copr-fe.aws.fedoraproject.org is UP: SSH OK - OpenSSH_8.8 (protocol 2.0) (noc02) $
fedora-noc.log:Nov 13 19:33:35 PROBLEM - copr-fe.aws.fedoraproject.org is DOWN: CRITICAL - Socket timeout (noc01) $
fedora-noc.log:Nov 13 19:36:56 PROBLEM - copr-fe.aws.fedoraproject.org is DOWN: CRITICAL - Socket timeout (noc02) $
fedora-noc.log:Nov 13 19:46:12 RECOVERY - copr-fe.aws.fedoraproject.org is UP: SSH OK - OpenSSH_8.8 (protocol 2.0) (noc01) $
fedora-noc.log:Nov 13 19:46:56 RECOVERY - copr-fe.aws.fedoraproject.org is UP: SSH OK - OpenSSH_8.8 (protocol 2.0) (noc02) $

Just about 10min or so... but unclear what happened.

@FrostyX
Copy link
Member Author

FrostyX commented Dec 20, 2023

Is this the same thing as #3026?

@praiskup
Copy link
Member

Yes, things seem to be resolved in #3026. Closing.

@praiskup praiskup moved this from In 3 months to Done in CPT Kanban Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fedora-copr-admin Tasks that need to be done by Fedora Copr administrator
Projects
Archived in project
Development

No branches or pull requests

4 participants