Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade EC2 instance type when high load is expected #3477

Open
DeD1rk opened this issue Nov 9, 2023 · 3 comments
Open

Upgrade EC2 instance type when high load is expected #3477

DeD1rk opened this issue Nov 9, 2023 · 3 comments
Labels
feature Issues regarding a complete new feature optimization Issues regarding slowness priority: medium A new feature or a bugfix that is non-critical. server Server related issues

Comments

@DeD1rk
Copy link
Member

DeD1rk commented Nov 9, 2023

What?

We can change the EC2 instance type of our server with just a simple reboot. We can use this to upgrade from a t3a.small to e.g. a t3a.2xlarge, so that we can handle short periods of high load (such as when event registrations open) better, without permanently paying for big servers.

Why?

We need to not crash on e.g. weekend registration opening. The most extreme event registration openings are easy to predict.

How?

When event registration is about to open, temporarily change the instance type in terraform. With terraform apply we can be back online in about 1-2 minutes if we skip the currently slow collectstatic step (see #3290). Doing this manually is probably fine.

I tried this out on staging. As a t3a.small, it handles +- 16 request per second to the homepage (unauthenticated). Upgrading to a 2xlarge (8 vcpus instead of 2), it handles 64. So scaling is pretty much perfectly linear.

We should think about how we can be sure to apply this in time. For example, it would be nice to upgrade early in the morning and downgrade again in the evening, so we should be aware of upcoming opening registrations a few days in advance to plan it.

@DeD1rk DeD1rk added priority: medium A new feature or a bugfix that is non-critical. feature Issues regarding a complete new feature server Server related issues optimization Issues regarding slowness labels Nov 9, 2023
@DeD1rk
Copy link
Member Author

DeD1rk commented Mar 13, 2024

Update: this (together with the other optimizations we've done) worked flawlessly for the weekend registrations.

@ColonelPhantom
Copy link
Contributor

Speaking of optimization, is the scaling with the number of vCPUs, or the number of workers? Seeing as we are probably spending most time blocked on the database rather than processing things in Python on CPU.

Of course, increasing the worker count is probably its own can of worms since each one consumes quite a bit of memory. So I'm not sure we can add more workers on a smaller VM.

@ColonelPhantom
Copy link
Contributor

I just checked the statistics of the server with htop, and we're of course pretty idle at this time. Memory-wise, there seems to be a bit of room left (2.5G / 3.8G). There's 5 gunicorn workers at 226M to 288M. Additionally, there's 4 celery workers at 115M to 217M. So I think there's room to add a few more gunicorn processes. Additionally, do we need 4 Celery workers? Since I'd expect most Celery tasks to not be that latency-sensitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues regarding a complete new feature optimization Issues regarding slowness priority: medium A new feature or a bugfix that is non-critical. server Server related issues
Projects
None yet
Development

No branches or pull requests

2 participants