blog and wiki keep going down #163

domenic · 2021-09-26T17:04:18Z

According to StatusCake these are constantly going down. Blog seems a bit worse than wiki.

According to the DigitalOcean logs everything is fine. They're getting a bit more traffic than normal, maybe 10-20 requests per minute, but all responses are 200s supposedly. Blog is at about 40% RAM usage and 20% CPU usage; wiki is at about 30% RAM and 60% CPU usage; and the shared database server is at at about 68% RAM and 12% CPU usage.

There was a major spike in incoming connections and CPU/RAM usage last night around 19:11 Eastern Time, but the outages started getting bad around 17:26 Eastern Time so I'm not sure if it's related.

My best hypothesis is that either DigitalOcean sucks, or something about our setup sucks, and can't handle this much traffic.

Potential ideas:

Bump up the server resources even more. Seems unlikely to help given that our RAM/CPU usage is not that high. Although maybe upgrading from the "basic" tier to "pro" tier gives us access to some less-flaky type of server. If we pay enough money we could even run two containers per service, load-balanced by DigitalOcean. This might be worth trying as a first attempt just to see if it makes a difference.
Bump up the database server resources.
Investigate more complicated in-container caching architectures to reduce the amount of times we hit the database. My understanding was that since DigitalOcean puts a CDN in front of us, sending the right caching expiry headers would cause the CDN to cache the appropriate resources and not hit our source server as much. It seems like this should be enough for relatively-low-traffic sites like ours. But maybe we need to go beyond that somehow and do WordPress/MediaWiki-specific caching stuff.
Try AWS instead of DigitalOcean.

Might help with whatwg/misc-server#163.

foolip · 2021-09-26T19:33:30Z

Could it be a StatusCake issue? Is it easy to catch them being down?

I just got an email about wiki being down and tested it immediately, but it loaded for me. Maybe the response time is too long or something?

domenic · 2021-09-26T19:45:31Z

I've caught them down a few times, but it's possible the problem is less serious than StatusCake makes it appear, hmm.

foolip · 2021-09-27T07:18:47Z

When you've seen them down, has trying to reload fixed the problem, or has it been down for minutes at a time?

I'm thinking that maybe this is a warmup problem. Maybe instances are killed or somehow frozen when there hasn't been traffic for a while. This is how AppEngine behaves at least, although it's not the same kind of architecture so it's not exactly the same for sure.

foolip · 2021-11-26T08:22:27Z

Hmm, it was reported down in whatwg/blog.whatwg.org#12, but works for me now.

@domenic when we last met you said the blog and wiki have stopped going down, but I wonder if really it's just the monitoring behavior that has changed...?

mathiasbynens · 2021-12-09T11:37:04Z

https://blog.whatwg.org/ has been down since (at least) yesterday, FWIW.

domenic · 2021-12-09T12:10:21Z

I've kicked the control panel again :(

I don't think this is a warmup or monitoring problem. I think this is either:

DigitalOcean is bad at keeping uptime for its app platform; or
The very-simple Docker image we have for the blog (basically just the WordPress base Docker image + its themes) is not production-quality in some way, and falls down very easily.

I think the next step here would be to try setting up the same simple Docker image on another service provider (e.g. AWS), and pointing blog.whatwg.org to that deployment for a few months, and seeing if it's better. That would narrow down whether it's (1) or (2).

mathiasbynens · 2021-12-24T22:06:25Z

It’s down again.

mathiasbynens · 2022-02-14T09:37:06Z

FYI, it's down again. (Let me know if this is not the best place to post alerts.)

domenic · 2022-03-15T14:22:45Z

So I narrowed down this problem to something about the CDN fronting the blog and wiki. Right now the blog at https://blog.whatwg.org/ is down. However the deployment URL https://blog-6tqz3.ondigitalocean.app/ is up. And all the logs show 200 requests to the internal URL.

I am going to try contacting DigitalOcean support since this seems like it is not our problem. (I.e., it is not the server being overloaded because we don't have enough caching, or something like that.)

domenic · 2022-07-10T07:00:16Z

DigitalOcean support has been unhelpful both times I tried pointing them at live outages.

Today I was pointed to https://fly.io/ which seems really promising?? Maybe we should try switching to that. We can switch just the sites at first, having them connect to the existing DigitalOcean database, and if that works then switch the database too.

domenic added a commit to whatwg/blog.whatwg.org that referenced this issue Sep 26, 2021

Add wp-super-cache plugin

63abc9a

Might help with whatwg/misc-server#163.

annevk mentioned this issue Nov 26, 2021

Blog is down whatwg/blog.whatwg.org#12

Closed

annevk mentioned this issue Mar 16, 2022

Tweak colour contrast whatwg/whatwg.org#392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blog and wiki keep going down #163

blog and wiki keep going down #163

domenic commented Sep 26, 2021

foolip commented Sep 26, 2021

domenic commented Sep 26, 2021

foolip commented Sep 27, 2021

foolip commented Nov 26, 2021

mathiasbynens commented Dec 9, 2021

domenic commented Dec 9, 2021

mathiasbynens commented Dec 24, 2021

mathiasbynens commented Feb 14, 2022

domenic commented Mar 15, 2022

domenic commented Jul 10, 2022

blog and wiki keep going down #163

blog and wiki keep going down #163

Comments

domenic commented Sep 26, 2021

foolip commented Sep 26, 2021

domenic commented Sep 26, 2021

foolip commented Sep 27, 2021

foolip commented Nov 26, 2021

mathiasbynens commented Dec 9, 2021

domenic commented Dec 9, 2021

mathiasbynens commented Dec 24, 2021

mathiasbynens commented Feb 14, 2022

domenic commented Mar 15, 2022

domenic commented Jul 10, 2022