Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blog and wiki keep going down #163

Open
domenic opened this issue Sep 26, 2021 · 10 comments
Open

blog and wiki keep going down #163

domenic opened this issue Sep 26, 2021 · 10 comments

Comments

@domenic
Copy link
Member

domenic commented Sep 26, 2021

According to StatusCake these are constantly going down. Blog seems a bit worse than wiki.

According to the DigitalOcean logs everything is fine. They're getting a bit more traffic than normal, maybe 10-20 requests per minute, but all responses are 200s supposedly. Blog is at about 40% RAM usage and 20% CPU usage; wiki is at about 30% RAM and 60% CPU usage; and the shared database server is at at about 68% RAM and 12% CPU usage.

There was a major spike in incoming connections and CPU/RAM usage last night around 19:11 Eastern Time, but the outages started getting bad around 17:26 Eastern Time so I'm not sure if it's related.

My best hypothesis is that either DigitalOcean sucks, or something about our setup sucks, and can't handle this much traffic.

Potential ideas:

  • Bump up the server resources even more. Seems unlikely to help given that our RAM/CPU usage is not that high. Although maybe upgrading from the "basic" tier to "pro" tier gives us access to some less-flaky type of server. If we pay enough money we could even run two containers per service, load-balanced by DigitalOcean. This might be worth trying as a first attempt just to see if it makes a difference.
  • Bump up the database server resources.
  • Investigate more complicated in-container caching architectures to reduce the amount of times we hit the database. My understanding was that since DigitalOcean puts a CDN in front of us, sending the right caching expiry headers would cause the CDN to cache the appropriate resources and not hit our source server as much. It seems like this should be enough for relatively-low-traffic sites like ours. But maybe we need to go beyond that somehow and do WordPress/MediaWiki-specific caching stuff.
  • Try AWS instead of DigitalOcean.
domenic added a commit to whatwg/blog.whatwg.org that referenced this issue Sep 26, 2021
@foolip
Copy link
Member

foolip commented Sep 26, 2021

Could it be a StatusCake issue? Is it easy to catch them being down?

I just got an email about wiki being down and tested it immediately, but it loaded for me. Maybe the response time is too long or something?

@domenic
Copy link
Member Author

domenic commented Sep 26, 2021

I've caught them down a few times, but it's possible the problem is less serious than StatusCake makes it appear, hmm.

@foolip
Copy link
Member

foolip commented Sep 27, 2021

When you've seen them down, has trying to reload fixed the problem, or has it been down for minutes at a time?

I'm thinking that maybe this is a warmup problem. Maybe instances are killed or somehow frozen when there hasn't been traffic for a while. This is how AppEngine behaves at least, although it's not the same kind of architecture so it's not exactly the same for sure.

@foolip
Copy link
Member

foolip commented Nov 26, 2021

Hmm, it was reported down in whatwg/blog.whatwg.org#12, but works for me now.

@domenic when we last met you said the blog and wiki have stopped going down, but I wonder if really it's just the monitoring behavior that has changed...?

@mathiasbynens
Copy link
Member

https://blog.whatwg.org/ has been down since (at least) yesterday, FWIW.

@domenic
Copy link
Member Author

domenic commented Dec 9, 2021

I've kicked the control panel again :(

I don't think this is a warmup or monitoring problem. I think this is either:

  1. DigitalOcean is bad at keeping uptime for its app platform; or
  2. The very-simple Docker image we have for the blog (basically just the WordPress base Docker image + its themes) is not production-quality in some way, and falls down very easily.

I think the next step here would be to try setting up the same simple Docker image on another service provider (e.g. AWS), and pointing blog.whatwg.org to that deployment for a few months, and seeing if it's better. That would narrow down whether it's (1) or (2).

@mathiasbynens
Copy link
Member

It’s down again.

@mathiasbynens
Copy link
Member

FYI, it's down again. (Let me know if this is not the best place to post alerts.)

@domenic
Copy link
Member Author

domenic commented Mar 15, 2022

So I narrowed down this problem to something about the CDN fronting the blog and wiki. Right now the blog at https://blog.whatwg.org/ is down. However the deployment URL https://blog-6tqz3.ondigitalocean.app/ is up. And all the logs show 200 requests to the internal URL.

I am going to try contacting DigitalOcean support since this seems like it is not our problem. (I.e., it is not the server being overloaded because we don't have enough caching, or something like that.)

@domenic
Copy link
Member Author

domenic commented Jul 10, 2022

DigitalOcean support has been unhelpful both times I tried pointing them at live outages.

Today I was pointed to https://fly.io/ which seems really promising?? Maybe we should try switching to that. We can switch just the sites at first, having them connect to the existing DigitalOcean database, and if that works then switch the database too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants