You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem of maintaining a robust connection between agents and their corresponding devices is one we haven't really addressed across all agents. Currently, most agents do not handle a network interruption well. In most cases, this breaks the TCP connection, causing a process in the agent to crash, but the agent remains online.
The current solution is to restart the process, either directly or by restarting the entire agent.
We haven't made an general recommendations for how to handle connection dropouts, as a result there are various implementations for recovering the connection in a handful of agents.
Two solutions come to mind:
If the connection is broken, the Agent should be shutdown. This allows the HostManager to recover the Agent.
Keep the Agent online through network outages, recovering the connection automatically. Effectively make running the Agent independent of the presence of the device.
My preference is to implement 2. But I will briefly describe a partial implementation of 1. (Feel free to skip this section.)
HostManager Reconnection
The best example I have for how this can be accomplished is in the Lakeshore 372 Agent. This Agent will attempt to establish a connection to the 372 in the init_lakeshore task. If the connection cannot be made (the 372 is offline) it will stop the reactor, bringing down the agent.
This does not handle dropouts post-initialization, but you could imagine doing a similar thing when the connection breaks during normal operation.
This solution has historically been requested by users, as it brings down the agent and Docker container, which is perhaps the most obvious sign that something is wrong. If the agent was online with a crashed process, it is perhaps less clear data isn't being collected.
Dynamically Reconnecting
Rather than performing a hard shutdown of the agent and letting the HostManager restart things, this option recognizes a connection drop, and dynamically reconnects to the device if possible. This can be handled at the agent level or the driver level. Most existing implementations are at the agent level, but I'd like to advocate for moving this lower -- to the drivers.
For completeness, we'll describe the two implementations.
Reconnecting in the Agent
Reconnecting in the Agent is the most common currently implemented solution. It involves catching any command interacting with the device, then rerunning an initialization function or task before letting the process loop again. Some examples of this are:
def acq(self, session, params):
while True:
if not self.intialized:
self.agent.start('initialize')
if self.initialized:
try:
get_data_from_device()
except:
self.initialized = False
sleep(1) # wait a second before trying again
continue
Reconnecting in the drivers avoids including reconnect logic (and especially the try/catches) in the Agent code. This is especially relevant for longer tasks/processes that have many function calls that interact with the hardware.
I'm going to get into a more detailed example of how I think this should work, but wanted to point out part of the inspiration comes from this open PR.
With the caveat that this isn't functional nor very elegant and is part pseudo-code -- this is what I imagine this looking like. Starting with the agent:
# Agent
from drivers import get_data_from_device
class ExampleAgent:
def __init__(self, arguments):
self.device = Device(arguments['address'])
def acq(self, session, params):
while True:
try:
# could generally be a function with many underlying
# get_data_from_device() calls
data = get_data_from_device("acq command")
except ConnectionResetError: # raised by driver
self.log.error("Failed to get data from device. Device seems unavailable.")
sleep(1) # wait a second before trying again
continue
publish_to_feed(data)
def task(self, session, params):
try:
resp = unique_function_call()
except ConnectionResetError:
self.log.error("Communication error.")
return False, "Communication error."
return True, "Accomplished task"
The drivers then have the reconnect logic. I think we could feasibly provide some base class that handles the reconnection logic, but need to try it out to know how well that'll work in practice.
# Drivers
import selectors
TIMOEUT = 60
BUFFER_SIZE = 4096
class Device:
def __init__(self, address):
self.com = self._connect()
self.reset = lambda: self.__init__(address)
def _connect(self, address):
com = socket.socket()
com.settimeout(TIMEOUT) # important to set if device is offline on connection attempt, else we just block
try:
com.connect(address)
except TimeoutError:
print(f"Connection not established within {timeout}.")
return com
def _write(self, msg):
# handle any message prep here, adding terminating characters,
# encoding, etc? That maybe makes it hard to generalize this to a base
# class though, since that can be unique per device.
# Reconnect after first failure
try:
self.com.sendall(msg) # could also wrap some 3rd party module
# similarly, though likely at a higher level
except socket.error:
print("Failed to send message. Reconnecting and trying once more.")
self.reset()
# Raise error after second failure
try:
self.com.sendall(msg)
except socket.error:
raise ConnectionResetError
def _read(self):
# check if the socket is ready to read from
self._check_ready()
# handle any decoding to str, stripping, etc. here
data = self.com.recv(BUFFER_SIZE)
return data
def _check_ready(self):
"""Check socket is ready to read from."""
sel = selectors.DefaultSelector()
sel.register(self.com, selectors.EVENT_READ)
if not sel.select(TIMEOUT):
raise ConnectionResetError
def get_data_from_device(self, command):
# combination write/read to return data from a given command
self._write(command) # catches a connection error
data = self._read()
return data
def unique_function1(self, arg1, arg2):
return self.get_data_from_device('command')
Other Connection Types
This discussion very much focuses on TCP. I have to give serial some more thought, but probably can be fairly similar. Then we need to dive into the other types like SNMP, but I believe we might already have some sort of reconnection there.
Thoughts on this design welcome! It sounded like the right direction when talking on the phone the other day. This'll need to happen in coordination with simonsobs/ocs#357 so that we can tell when the connection has degraded and alert against it. It'll be great to start implementing this though and to form the general recommendation for how this connections can be made robust.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Overview
The problem of maintaining a robust connection between agents and their corresponding devices is one we haven't really addressed across all agents. Currently, most agents do not handle a network interruption well. In most cases, this breaks the TCP connection, causing a process in the agent to crash, but the agent remains online.
The current solution is to restart the process, either directly or by restarting the entire agent.
We haven't made an general recommendations for how to handle connection dropouts, as a result there are various implementations for recovering the connection in a handful of agents.
Two solutions come to mind:
My preference is to implement 2. But I will briefly describe a partial implementation of 1. (Feel free to skip this section.)
HostManager Reconnection
The best example I have for how this can be accomplished is in the Lakeshore 372 Agent. This Agent will attempt to establish a connection to the 372 in the
init_lakeshore
task. If the connection cannot be made (the 372 is offline) it will stop the reactor, bringing down the agent.socs/socs/agents/lakeshore372/agent.py
Lines 204 to 213 in 147149e
This does not handle dropouts post-initialization, but you could imagine doing a similar thing when the connection breaks during normal operation.
This solution has historically been requested by users, as it brings down the agent and Docker container, which is perhaps the most obvious sign that something is wrong. If the agent was online with a crashed process, it is perhaps less clear data isn't being collected.
Dynamically Reconnecting
Rather than performing a hard shutdown of the agent and letting the HostManager restart things, this option recognizes a connection drop, and dynamically reconnects to the device if possible. This can be handled at the agent level or the driver level. Most existing implementations are at the agent level, but I'd like to advocate for moving this lower -- to the drivers.
For completeness, we'll describe the two implementations.
Reconnecting in the Agent
Reconnecting in the Agent is the most common currently implemented solution. It involves catching any command interacting with the device, then rerunning an initialization function or task before letting the process loop again. Some examples of this are:
This ends up looking like:
while the 'initialize' task looks like:
I don't believe there are any examples of this being implemented on a task, but that would probably end up looking something like:
Reconnecting in the Drivers
Reconnecting in the drivers avoids including reconnect logic (and especially the try/catches) in the Agent code. This is especially relevant for longer tasks/processes that have many function calls that interact with the hardware.
I'm going to get into a more detailed example of how I think this should work, but wanted to point out part of the inspiration comes from this open PR.
With the caveat that this isn't functional nor very elegant and is part pseudo-code -- this is what I imagine this looking like. Starting with the agent:
The drivers then have the reconnect logic. I think we could feasibly provide some base class that handles the reconnection logic, but need to try it out to know how well that'll work in practice.
Other Connection Types
This discussion very much focuses on TCP. I have to give serial some more thought, but probably can be fairly similar. Then we need to dive into the other types like SNMP, but I believe we might already have some sort of reconnection there.
Thoughts on this design welcome! It sounded like the right direction when talking on the phone the other day. This'll need to happen in coordination with simonsobs/ocs#357 so that we can tell when the connection has degraded and alert against it. It'll be great to start implementing this though and to form the general recommendation for how this connections can be made robust.
Beta Was this translation helpful? Give feedback.
All reactions