Agent-Device Connection Robustness #538

BrianJKoopman · 2023-10-05T20:01:21Z

BrianJKoopman
Oct 5, 2023
Maintainer

Overview

The problem of maintaining a robust connection between agents and their corresponding devices is one we haven't really addressed across all agents. Currently, most agents do not handle a network interruption well. In most cases, this breaks the TCP connection, causing a process in the agent to crash, but the agent remains online.

The current solution is to restart the process, either directly or by restarting the entire agent.

We haven't made an general recommendations for how to handle connection dropouts, as a result there are various implementations for recovering the connection in a handful of agents.

Two solutions come to mind:

If the connection is broken, the Agent should be shutdown. This allows the HostManager to recover the Agent.
Keep the Agent online through network outages, recovering the connection automatically. Effectively make running the Agent independent of the presence of the device.

My preference is to implement 2. But I will briefly describe a partial implementation of 1. (Feel free to skip this section.)

HostManager Reconnection

The best example I have for how this can be accomplished is in the Lakeshore 372 Agent. This Agent will attempt to establish a connection to the 372 in the init_lakeshore task. If the connection cannot be made (the 372 is offline) it will stop the reactor, bringing down the agent.

socs/socs/agents/lakeshore372/agent.py

Lines 204 to 213 in 147149e

 try: 

 self.module = LS372(self.ip) 

 except ConnectionError: 

 self.log.error("Could not connect to the LS372. Exiting.") 

 reactor.callFromThread(reactor.stop) 

 return False, 'Lakeshore initialization failed' 

 except Exception as e: 

 self.log.error(f"Unhandled exception encountered: {e}") 

 reactor.callFromThread(reactor.stop) 

 return False, 'Lakeshore initialization failed'

This does not handle dropouts post-initialization, but you could imagine doing a similar thing when the connection breaks during normal operation.

This solution has historically been requested by users, as it brings down the agent and Docker container, which is perhaps the most obvious sign that something is wrong. If the agent was online with a crashed process, it is perhaps less clear data isn't being collected.

Dynamically Reconnecting

Rather than performing a hard shutdown of the agent and letting the HostManager restart things, this option recognizes a connection drop, and dynamically reconnects to the device if possible. This can be handled at the agent level or the driver level. Most existing implementations are at the agent level, but I'd like to advocate for moving this lower -- to the drivers.

For completeness, we'll describe the two implementations.

Reconnecting in the Agent

Reconnecting in the Agent is the most common currently implemented solution. It involves catching any command interacting with the device, then rerunning an initialization function or task before letting the process loop again. Some examples of this are:

This ends up looking like:

def acq(self, session, params):
    while True:
        if not self.intialized:
            self.agent.start('initialize')
    
        if self.initialized:
            try:
                get_data_from_device()
            except:
                self.initialized = False
                sleep(1)  # wait a second before trying again
                continue

while the 'initialize' task looks like:

def initialize(self, session, params):
    self.device = Device()
    self.initialized = True
    return True, "Initialized device"

I don't believe there are any examples of this being implemented on a task, but that would probably end up looking something like:

def task(self, session, params):
    try:
        unique_function_call()
    except:
        self.log.error("Communication issue.")
        return False, "Communication issue"
    return True, "Task successful."

Reconnecting in the Drivers

Reconnecting in the drivers avoids including reconnect logic (and especially the try/catches) in the Agent code. This is especially relevant for longer tasks/processes that have many function calls that interact with the hardware.

I'm going to get into a more detailed example of how I think this should work, but wanted to point out part of the inspiration comes from this open PR.

With the caveat that this isn't functional nor very elegant and is part pseudo-code -- this is what I imagine this looking like. Starting with the agent:

# Agent
from drivers import get_data_from_device

class ExampleAgent:
    def __init__(self, arguments):
        self.device = Device(arguments['address'])

    def acq(self, session, params):
        while True:
            try:
                # could generally be a function with many underlying
                # get_data_from_device() calls
                data = get_data_from_device("acq command")
            except ConnectionResetError:  # raised by driver
                self.log.error("Failed to get data from device. Device seems unavailable.")
                sleep(1)  # wait a second before trying again
                continue

            publish_to_feed(data)

    def task(self, session, params):
        try:
            resp = unique_function_call()
        except ConnectionResetError:
            self.log.error("Communication error.")
            return False, "Communication error."
        return True, "Accomplished task"

The drivers then have the reconnect logic. I think we could feasibly provide some base class that handles the reconnection logic, but need to try it out to know how well that'll work in practice.

# Drivers
import selectors

TIMOEUT = 60
BUFFER_SIZE = 4096

class Device:
    def __init__(self, address):
        self.com = self._connect()
        self.reset = lambda: self.__init__(address)

    def _connect(self, address):
        com = socket.socket()
        com.settimeout(TIMEOUT)  # important to set if device is offline on connection attempt, else we just block
        try:
            com.connect(address)
        except TimeoutError:
            print(f"Connection not established within {timeout}.")
        return com

    def _write(self, msg):
        # handle any message prep here, adding terminating characters,
        # encoding, etc? That maybe makes it hard to generalize this to a base
        # class though, since that can be unique per device.

        # Reconnect after first failure
        try:
            self.com.sendall(msg)  # could also wrap some 3rd party module
                                   # similarly, though likely at a higher level
        except socket.error:
            print("Failed to send message. Reconnecting and trying once more.")
            self.reset()

        # Raise error after second failure
        try:
            self.com.sendall(msg)
        except socket.error:
            raise ConnectionResetError

    def _read(self):
        # check if the socket is ready to read from
        self._check_ready()

        # handle any decoding to str, stripping, etc. here
        data = self.com.recv(BUFFER_SIZE)
        return data

    def _check_ready(self):
        """Check socket is ready to read from."""
        sel = selectors.DefaultSelector()
        sel.register(self.com, selectors.EVENT_READ)
        if not sel.select(TIMEOUT):
            raise ConnectionResetError

    def get_data_from_device(self, command):
        # combination write/read to return data from a given command
        self._write(command)  # catches a connection error
        data = self._read()

        return data

    def unique_function1(self, arg1, arg2):
        return self.get_data_from_device('command')

Other Connection Types

This discussion very much focuses on TCP. I have to give serial some more thought, but probably can be fairly similar. Then we need to dive into the other types like SNMP, but I believe we might already have some sort of reconnection there.

Thoughts on this design welcome! It sounded like the right direction when talking on the phone the other day. This'll need to happen in coordination with simonsobs/ocs#357 so that we can tell when the connection has degraded and alert against it. It'll be great to start implementing this though and to form the general recommendation for how this connections can be made robust.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent-Device Connection Robustness #538

{{title}}

Replies: 0 comments

Select a reply

Agent-Device Connection Robustness #538

BrianJKoopman Oct 5, 2023 Maintainer

Overview

HostManager Reconnection

Dynamically Reconnecting

Reconnecting in the Agent

Reconnecting in the Drivers

Other Connection Types

Replies: 0 comments

BrianJKoopman
Oct 5, 2023
Maintainer