Upgrade EC2 instance type when high load is expected #3477

DeD1rk · 2023-11-09T15:31:02Z

What?

We can change the EC2 instance type of our server with just a simple reboot. We can use this to upgrade from a t3a.small to e.g. a t3a.2xlarge, so that we can handle short periods of high load (such as when event registrations open) better, without permanently paying for big servers.

Why?

We need to not crash on e.g. weekend registration opening. The most extreme event registration openings are easy to predict.

How?

When event registration is about to open, temporarily change the instance type in terraform. With terraform apply we can be back online in about 1-2 minutes if we skip the currently slow collectstatic step (see #3290). Doing this manually is probably fine.

I tried this out on staging. As a t3a.small, it handles +- 16 request per second to the homepage (unauthenticated). Upgrading to a 2xlarge (8 vcpus instead of 2), it handles 64. So scaling is pretty much perfectly linear.

We should think about how we can be sure to apply this in time. For example, it would be nice to upgrade early in the morning and downgrade again in the evening, so we should be aware of upcoming opening registrations a few days in advance to plan it.

The text was updated successfully, but these errors were encountered:

DeD1rk · 2024-03-13T19:36:49Z

Update: this (together with the other optimizations we've done) worked flawlessly for the weekend registrations.

ColonelPhantom · 2024-10-31T19:03:15Z

Speaking of optimization, is the scaling with the number of vCPUs, or the number of workers? Seeing as we are probably spending most time blocked on the database rather than processing things in Python on CPU.

Of course, increasing the worker count is probably its own can of worms since each one consumes quite a bit of memory. So I'm not sure we can add more workers on a smaller VM.

ColonelPhantom · 2024-10-31T19:34:08Z

I just checked the statistics of the server with htop, and we're of course pretty idle at this time. Memory-wise, there seems to be a bit of room left (2.5G / 3.8G). There's 5 gunicorn workers at 226M to 288M. Additionally, there's 4 celery workers at 115M to 217M. So I think there's room to add a few more gunicorn processes. Additionally, do we need 4 Celery workers? Since I'd expect most Celery tasks to not be that latency-sensitive.

DeD1rk added priority: medium A new feature or a bugfix that is non-critical. feature Issues regarding a complete new feature server Server related issues optimization Issues regarding slowness labels Nov 9, 2023

DeD1rk mentioned this issue Nov 9, 2023

Collectstatic to S3 is slow #3290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade EC2 instance type when high load is expected #3477

Upgrade EC2 instance type when high load is expected #3477

DeD1rk commented Nov 9, 2023

DeD1rk commented Mar 13, 2024

ColonelPhantom commented Oct 31, 2024

ColonelPhantom commented Oct 31, 2024

Upgrade EC2 instance type when high load is expected #3477

Upgrade EC2 instance type when high load is expected #3477

Comments

DeD1rk commented Nov 9, 2023

What?

Why?

How?

DeD1rk commented Mar 13, 2024

ColonelPhantom commented Oct 31, 2024

ColonelPhantom commented Oct 31, 2024