Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited scalability at organization level #4227

Open
ay0o opened this issue Oct 31, 2024 · 2 comments
Open

Limited scalability at organization level #4227

ay0o opened this issue Oct 31, 2024 · 2 comments

Comments

@ay0o
Copy link

ay0o commented Oct 31, 2024

I have found that using the SSM parameter to store the multi-runner configuration is quite limiting in terms of scalability. Even using the premium tier, it allows about 30 different configurations maximum. You might think this is enough, but check this out.

Assume an organization with dozens of projects, each with several repositories. If the runners are shared by all repositories in the organization, everything is good.

However, let's say we want each project to have their own runners. This could be as easy as creating new runner configs within multi-runners config using different labels per project to choose a runner (e.g. self-hosted, project_1, self-hosted, project_2). The problem is that, as mentioned above, as the tool is storing all the multi-runners config in a single SSM parameter, we reach the maximum size at about 30 configurations.

So, the alternative is to actually deploy an instance of this module per project but this leads to another issue. If the GitHub Apps (one per instance) are installed at the organization level this module breaks due to cross-project usage.

For example, let's say a job from a repository that belongs to project_1 is triggered. The message will be sent to the webhook, but as the GitHub Apps are installed at organization level, any might receive it. This means that it was maybe the webhook for project_2 the one that received the message from the job with labels self-hosted, project_1.

Depending on whether repository_whitelist is used or not, the message in the webhook will be different (not authorized or unexpected labels), but the ultimate outcome is that the webhook will not publish a message in SQS and therefore, the EC2 instance will not be created.

The only working solution is to install the apps on specific repositories. For every new repository, the GitHub App needs to be installed in it and the repository should be also added to the project's runner group. Depending on the size and how active is the organization, this may be manageable. For me, sitting at over 1k repositories, I can tell you it's not.

So, the bottom line is that the tool should use a different approach to the SSM parameter to store the multi-runner config so that a single instance of the module can scale to hundreds of configurations if needed.

@npalm
Copy link
Member

npalm commented Nov 1, 2024

Thx for taking the time to create the issue. We also use the setup with a large org. For onboarding repo's to runners (app and groups) we have automations in place that are self service. But agree without automation this won't scale.

What we do different is we run just about 10 different runner groups for the full org in ephemeral mode. Which menas nothing is shared between repositories. The multi runner setup is created with in scope to make it easy to support several fleets. It is amazing that you are already 30 and up. We have choosen ad that time (and time contstraint) to not make the lambda logic more complex, but simpley deploy the control plane one time per configuration. Because in this we we could use all the existing logic.

At that time the webhook was still storing configuration the hte lambda environment variable, which is stops working and some moment (about 6 - 10 groups). For that reasone we moved the configuration to SSM. But indeed with again a limiation due to scaling.

The question is now what is the best way to move forward. Would it make sens to move the configuration again? And what are the valid options? The good news is that configuration is no manged on one place (ConfigLoader), which menas that adding or chaning the direction is releative simpel.

Also wondering with over already 30 configurations for the multi-runners, are you not hitting other limitations?

@ay0o
Copy link
Author

ay0o commented Nov 1, 2024

By " If the runners are shared by all repositories in the organization" I meant, if the runners were deployed at the organization, just like you're doing. In fact, it seems we were doing the same, different instances of the module (because we can't deploy to different VPCs within a single instance, this would be a nice feature) with different pool of runners, all of them ephemeral. The largest had 9 runner configurations.

As said, it just works. However, the company is now demanding more visibility about how much each project is spending and this includes the GitHub runners. In order to provide this, we need to have different runners configuration for each project (different matcher labels, runner group, and a Project tag for Cost Explorer). And here's where we hit the wall. We started to add the configurations per project and at about 15, we got the error that we exceeded the allowed size for the parameter in Parameter Store. The advanced tier doubles the size from 4KB to 8KB, so I'm assuming I could reach about 30 configurations but didn't actually test it because that wouldn't work for us either.

Maybe could it be possible to store each config independently instead of storing the full multi_runner_config map in a single parameter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants