Automatic SSH pooling/multiplexing/... configuration? #215

maxfischer2781 · 2021-10-19T09:16:21Z

maxfischer2781
Oct 19, 2021
Maintainer

SSH commands currently re-use the same connection, and we just hope that one connection with multiplexing is enough. Depending on the situation, we might need to pool several connections, limit multiplexing, or even limit command frequency though. Since proper configuration of all this is likely complex and limited by knowledge of the setup, having the SSH executor configure itself automatically would be useful.

Which parameters would be useful for us? Which information can we query from ssh itself and which (how?) must we discover ourselves?

See also #144 on pooling of multiple connections and #145 on multiplexing over one connection.

olifre · 2021-10-20T20:13:54Z

olifre
Oct 20, 2021

I am not (yet) sure we can really query SSH internals, we may need to scale this empirically — probably, we need two dynamic variables:

Number of channels per connection
Number of connections

I have tried to study the actual observation that opening many connections via SSH via asyncssh indeed causes issues — using the following program:

import logging
import asyncssh.logging
import asyncio, asyncssh, sys

root_logger = logging.getLogger()
asyncssh.logging.set_log_level(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root_logger.addHandler(handler)

async def run_client():
    async with asyncssh.connect(host='hpc.login.node',username='service-account', client_keys='/var/lib/cobald/.ssh/id_ed25519') as conn:
        try:
            result = await conn.run('scancel 9999999', check=True, input=None)
        except asyncssh.ChannelOpenError as coe:
            print("Something is bad...")
            return
        print(result.stdout, end='')

async def main():
    loop = asyncio.get_event_loop()
    tasks = []
    for i in range(1,30):
        tasks.append(loop.create_task(run_client()))
    await asyncio.gather(*tasks)

try:
    asyncio.get_event_loop().run_until_complete(main())
except (OSError, asyncssh.Error) as exc:
    sys.exit('SSH connection failed: ' + str(exc))

Using 30 in the inner range of parallel tasks, I reliably get connection aborts and finally:

SSH connection failed: Connection lost

Trying the same thing with channels, I can use:

import logging
import asyncssh.logging
import asyncio, asyncssh, sys

root_logger = logging.getLogger()
asyncssh.logging.set_log_level(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root_logger.addHandler(handler)

async def run_cmd(conn):
    try:
        result = await conn.run('scancel 9999999', check=True, input=None)
    except asyncssh.ChannelOpenError as coe:
        print("Something is bad...")
        return
    print(result.stdout, end='')

async def main():
    loop = asyncio.get_event_loop()
    conn = await asyncssh.connect(host='hpc.login.node',username='service-account', client_keys='/var/lib/cobald/.ssh/id_ed25519')
    tasks = []
    for i in range(1,30):
        tasks.append(run_cmd(conn))
    await asyncio.gather(*tasks)

try:
    asyncio.get_event_loop().run_until_complete(main())
except (OSError, asyncssh.Error) as exc:
    sys.exit('SSH connection failed: ' + str(exc))

This reliably breaks after about 10 channels, and it fails with the message Something is bad, as expected.

So at least we have two "signals" which may show resource exhaustion — or also just connection errors in general, at least in the first case. We may need to dig deeper to find potential internals of SSH which may help us to judge on this.

3 replies

maxfischer2781 Oct 21, 2021
Maintainer Author

Many thanks for the probing setup! Aiming for "Number of channels per connection" and "Number of connections" seems indeed like something to look into.
I've played with your setup a bit to do some testing and understand the sshd config options – there is a lot of bogus information floating around the web, sadly.

"Trying the same thing with channels"..."This reliably breaks after about 10 channels": That seems to directly match the MaxSessions option:

MaxSessions

Specifies the maximum number of open sessions permitted per network connection. The default is 10.

As far as I can tell that is always a constant per sshd instance, i.e. per remote machine. In other words, we could directly use a probing/counting loop like yours once when we start the first connection. Once we have that we then know that we need a new connection per every N sessions/channels.

I think we can work with a fixed number of channels per connection.

"I have tried to study the actual observation that opening many connections": I would love to get your input on how reliable the results were. Unlike the sessions/channel (where I always got 10) the number of connections was pretty unreliable – I got results in the range 40 to 60 with even some more extreme outliers.

Digging through the ssh docs, the only thing that affects this seems to be MaxStartups:

MaxStartups

Specifies the maximum number of concurrent unauthenticated connections to the SSH daemon.

So the limit is on the number of opening connections, not total connections. That would fit the unreliable "limit" I have seen on total number of connections – basically the faster/slower my code was the more/less connections got done authenticating in time.

There might still be a total number imposed by the system but that is apparently difficult.

olifre Oct 24, 2021

I think we can work with a fixed number of channels per connection.

I fully agree. Probing on the first connection seems reasonable, this should only change when the server is reconfigured — your observation that MaxSessions is the correct variable, and the limit of 10 by default also matches my observation that in some cases SSH multiplexing breaks down when using too many sessions in parallel, at the same limit.

Unlike the sessions/channel (where I always got 10) the number of connections was pretty unreliable – I got results in the range 40 to 60 with even some more extreme outliers.

Same here. On well-used login nodes, I got even lower numbers, especially when there is higher latency, which all matches the explanations you have found. Going by the numbers we get via fail2ban for bad logins on our login nodes, one can also assume a continuous variation here.

Given that this number is quite unreliable, we probably need some dynamics here. However, since the number of sessions is commonly 10, maybe another option would be to use only a small number of connections (or even 1?) and serialize things once the maximum number of sessions is reached?
Alternatively, connections could be pre-created slowly with a delay, such that the number of opening connections is kept low.

maxfischer2781 Oct 25, 2021
Maintainer Author

Given that this number is quite unreliable, we probably need some dynamics here. However, since the number of sessions is commonly 10, maybe another option would be to use only a small number of connections (or even 1?) and serialize things once the maximum number of sessions is reached?
Alternatively, connections could be pre-created slowly with a delay, such that the number of opening connections is kept low.

Hm, we could certainly go with just queueing tasks exceeding MaxSessions as a first bandaid. That's reasonably simple to do with async code. As a next refactoring step, we could start spinning up new connections whenever tasks are queued.

maxfischer2781 · 2021-11-06T07:28:57Z

maxfischer2781
Nov 6, 2021
Maintainer Author

Based on @olifre and my tests I propose the following two-step roadmap:

Queue requests whenever we hit the MaxSessions limit.
Testing shows that we reliably a) get ChannelOpenError when hitting the limit and b) the underlying connection remains useable. We currently already handle a but disregard b, by discarding the connection and request. Instead, we should queue the request and handle it once other sessions are done.
The easiest approach right now would be to just have a loop that retries the requests. Alternatively we could use a Semaphore to count the sessions, or a Queue to serialise the requests.
This should prevent spurious failures from spiking request counts. I would expect long-term throughput to actually be enough for many cases, so it would be a safe refactoring target.
Spin up new connections whenever we hit the MaxSessions limit.
We could combine the current multiplexing with pooling, related to how @giffels initially implemented pooling. Whenever requests queue up, we start a single new connection and add it to a pool which actually serves the requests. That avoids hitting MaxStartups and should smooth out the pool size – requests are serviced while we scale up the pool, so it will naturally converge to some optimal size based on request vs connection duration.
This will be more complicated than what we have right now, since we don't have a direct relation between pool, connections and sessions anymore. At a minimum we would track how many sessions there are per connection. We probably also want to have cleanup of unused connection, and might need configurable limits/rates (YAGNI?).
(This is inherently similar to the feedback loop that cobald/tardis is good at anyways, but I would expect a much more spiky request pattern at smaller scale. So a custom made, private feedback implementation seems appropriate to me.)

2 replies

olifre Nov 7, 2021

I think that's a great plan 👍. Presumably, step 1 might be completely sufficient — since we are looking at mostly spiky use cases indeed, I would guess a queue works best, also to ensure the order in which things are retried will not change unexpectedly.

giffels Nov 10, 2021
Maintainer

+1 from my side!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic SSH pooling/multiplexing/... configuration? #215

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

MaxSessions

MaxStartups

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Automatic SSH pooling/multiplexing/... configuration? #215

maxfischer2781 Oct 19, 2021 Maintainer

Replies: 2 comments · 5 replies

olifre Oct 20, 2021

maxfischer2781 Oct 21, 2021 Maintainer Author

MaxSessions

MaxStartups

olifre Oct 24, 2021

maxfischer2781 Oct 25, 2021 Maintainer Author

maxfischer2781 Nov 6, 2021 Maintainer Author

olifre Nov 7, 2021

giffels Nov 10, 2021 Maintainer

maxfischer2781
Oct 19, 2021
Maintainer

Replies: 2 comments 5 replies

olifre
Oct 20, 2021

maxfischer2781 Oct 21, 2021
Maintainer Author

maxfischer2781 Oct 25, 2021
Maintainer Author

maxfischer2781
Nov 6, 2021
Maintainer Author

giffels Nov 10, 2021
Maintainer