-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement backoff for reconnect #112
base: master
Are you sure you want to change the base?
Conversation
The current reconnect feature doesn't support backoff, so it's not really useful since spamming reconnects is not good behaviour. This introduces a backoff feature, taking inspiration from https://hex.pm/packages/connection. This was meant to be a small change, however with the existing async and handle_initial_conn_failure options, the solution turned out to be incoherent and also uncovered existing bugs with reconnect during init that could indefinitely block the supervisor startup process. As a result, this change makes breaking changes to the api. In particular, async init is now the default and only behaviour, ie. a network connection is only made after finalizing the special process init.
28caf5c
to
6c3b831
Compare
The current way of sending messages via `handle_cast`, `handle_info` callbacks doesn't allow the user to know whether the message was sent successfully or not. This is particularly important now that a connection might be in a backoff state for a long time. This adds a new callback, `handle_send_result` which gets called on success or error of processing a message to be sent. Note that this doesn't apply to `send_frame` since that is synchronous and already provides feedback. This isn't named `handle_reply_result` because I believe `:reply` should be renamed since it only makes sense in `handle_frame`, but not in `handle_cast` and `handle_info`. This also allows passing through a key to correlate a result with a send request.
6c3b831
to
0b5c9a1
Compare
Currently Stargate prevents the application from starting up or causes it to crash if pulsar is or becomes unavailable. This is a result of how WebSockex is implemented. This builds on a refactor of WebSockex at Azolo/websockex#112 which implements connection management async to process startup. This makes Stargate more production ready since it will allow an application to degrade gracefully when pulsar isn't available temporarily. Main changes as a result of this: 1. Reconnect backoff feature is added to be customized by the user 2. Since Stargate will continue running even when not connected to Pulsar, the user needs to know that messages aren't being produced successfully. A breaking change was made to the async ACK MFA where the first argument is now the result of the produce, which may either be `:ok` or `{:error, reason}`.
Currently Stargate prevents the application from starting up or causes it to crash if pulsar is or becomes unavailable. This is a result of how WebSockex is implemented. This builds on a refactor of WebSockex at Azolo/websockex#112 which implements connection management async to process startup. This makes Stargate more production ready since it will allow an application to degrade gracefully when pulsar isn't available temporarily. Main changes as a result of this: 1. Reconnect backoff feature is added to be customized by the user 2. Since Stargate will continue running even when not connected to Pulsar, the user needs to know that messages aren't being produced successfully. A breaking change was made to the async ACK MFA where the first argument is now the result of the produce, which may either be `:ok` or `{:error, reason}`.
Currently Stargate prevents the application from starting up or causes it to crash if pulsar is or becomes unavailable. This is a result of how WebSockex is implemented. This builds on a refactor of WebSockex at Azolo/websockex#112 which implements connection management async to process startup. This makes Stargate more production ready since it will allow an application to degrade gracefully when pulsar isn't available temporarily. Main changes as a result of this: 1. Reconnect backoff feature is added to be customized by the user 2. Since Stargate will continue running even when not connected to Pulsar, the user needs to know that messages aren't being produced successfully. A breaking change was made to the async ACK MFA where the first argument is now the result of the produce, which may either be `:ok` or `{:error, reason}`.
Currently Stargate prevents the application from starting up or causes it to crash if pulsar is or becomes unavailable. This is a result of how WebSockex is implemented. This builds on a refactor of WebSockex at Azolo/websockex#112 which implements connection management async to process startup. This makes Stargate more production ready since it will allow an application to degrade gracefully when pulsar isn't available temporarily. Main changes as a result of this: 1. Reconnect backoff feature is added to be customized by the user 2. Since Stargate will continue running even when not connected to Pulsar, the user needs to know that messages aren't being produced successfully. A breaking change was made to the async ACK MFA where the first argument is now the result of the produce, which may either be `:ok` or `{:error, reason}`.
## Next | ||
### Enhancements | ||
- Added `backoff` feature for reconnecting | ||
- `send_frame/2` is no longer processed during `opening`, ie. it is queued until `connected` or failure to connect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a breaking change, and potentially a large one in my opinion.
I'm torn on this. I understand that it could make the library more convenient, but it could also cause a message to be sent prematurely before some kind of required connection setup was finished.
Adding another method or keyword option would be my preference but that means more stuff to maintain. It could be useful though. I remain somewhat on the edge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm torn on this. I understand that it could make the library more convenient, but it could also cause a message to be sent prematurely before some kind of required connection setup was finished.
Can you elaborate on this? Maybe I'm missing something but I'm not sure when a message could be sent prematurely because of this?
Initially I had failing tests since they currently assume that when the process is started, the connection is connected and able to send messages. Before I realized that I could make this change with selective receive so that the send_frame
message waits in the inbox until the connection is connected, I added a feature WebSockex.await_status(pid, :connected)
so that you could do the following:
{:ok, pid} = WebSockex.start_link(url)
WebSockex.await_status(pid, :connected)
WebSockex.send_frame(pid, message)
However I removed this and instead relied on selective receive because it means less work for the user and one less change in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle_connect
is used to queue up initial messages. By casting to itself, a process can queue up messages to send in a specific order after startup.
By handling send_frame
after the connection is established, we actually put ourselves at much higher risk for a race condition. A frame could be sent before the initial connection messages were sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I have 2 thoughts on this.
-
I'm just exploring how this could work better ignoring breaking changes. Maybe first read my comment on
handle_send_result
. If instead ofsend_frame
we havehandle_send_result
andhandle_call
which would allow a user to implement their ownsend_frame
, and we haveinit
, then the user could create a message queue in their state duringinit
for startup messages and thenhandle_call
could add messages to the queue as well and the queue can be processed in the right order starting fromhandle_connect
. I also think that addinghandle_continue
support would be useful here. In fact if we just addhandle_continue
then this can be used fromhandle_connect
to process initial messages beforesend_frame
messages by calling:continue
inhandle_connect
. -
Without making breaking changes, we need a way to wait for the connection to be established before calling
send_frame
. This can either be done by addingawait_status
as above, or by addinghandle_call
which has access to the current status. The user can theoretically keep track of the current status withhandle_connect
andhandle_disconnect
in their own state, and implementhandle_call
on top ofhandle_cast
, but it's quite a lot of extra stuff for each user to add which is already a solved problem in a plainGenServer
. On the one hand I kinda like addingawait_status
natively because it requires no extra work from the user, but on the other hand I'm really seeing that everything could be handled much better with more flexibility ifWebSockex
is on par withGenServer
, namely havinginit
,handle_call
,handle_continue
.
I was trying to solve all the issues I ran into with adding backoff without adding a lot of other features, but it seems that we'd be better served by first adding the other features that we can then more easily build on here.
- Connection is always established asynchronously to process init | ||
- `async` option is removed since it's now redundant | ||
- `handle_initial_conn_failure` option is removed since it's no longer applicable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually ruins one of the ways I'm using WebSockex myself. 😢
In my use case, there are a couple of other WebSockex processes that depend on the connection being open before they start then when things disconnect the supervisor uses a :rest_for_one
policy to reconnect.
Originally, reconnecting was actually more of a feature request that I challenged myself to do. I rarely used it myself. 😣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the reason why I removed sync connect is because as far as I know, it's bad practice to handle a potentionally long task in the init stage of a special process, and e.g. in a GenServer
would normally be done by calling :continue
in the init
callback. The reason for this is that it can block the entire supervisor tree startup until init
returns.
In the case of WebSockex
, this can happen if:
- the server you're connecting to is blocked by a firewall which drops all packets, so it hangs either until tcp timeout or an application timeout (I think
WebSockex
has either a 5s or 30s timeout) - you combine sync connect with infinite reconnect, which means
init
never returns.
You can also observe the result of this by making a simple WebSockex
client with sync connect and infinite reconnect and just doing a WebSockex.start
inside an iex
session. The entire iex
session locks up that you can't even do anything.
I prefer not to give users (of this library) footguns and unexpected behaviour which is why I decided to remove the ability to do this.
I can see that the biggest issue is actually the combination of sync connect and reconnect and that these 2 features are actually at odds of co-existing at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the reason why I removed sync connect is because as far as I know, it's bad practice to handle a potentionally long task in the init stage of a special process
This can definitely be true in the Application supervisor, but sometimes even that is necessary. I don't think you can consider it a bad practice, but it can be annoying to debug. On the other hand, it can also fit really well into the failure as feature mentality.
But there are people who want "footguns" simply because it fits into their application flow easier. That being said, handle_initial_conn_failure
is a long and slightly convoluted option that is disabled by default for that reason. It was a requested feature, by multiple people iirc.
But my main argument will be that async
unique startup is a feature. I don't expect :gen_tcp
to startup then give me an error it can't connect after the incredibly long TCP timeout when I've already done other things with the assumption that I have a valid connection. I wouldn't expect WebSockex to startup without having a WebSocket connection established.
{:reconnect, new_state} -> | ||
{:reconnect, state.conn, new_state} | ||
{:backoff, timeout, new_conn, new_state} -> | ||
{:backoff, timeout, new_conn, new_state} | ||
|
||
{:reconnect, new_conn, new_state} -> | ||
{:reconnect, new_conn, new_state} | ||
{:backoff, timeout, new_state} -> | ||
{:backoff, timeout, state.conn, new_state} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't understand why we would remove :reconnect
instead of just adding :backoff
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH it doesn't necessarily even have to be called :backoff
. I don't think that it would really be confusing to just have:
{:reconnect, timeout, state}
Would you prefer to keep the reconnect
word rather than backoff
?
What I don't want though is to keep the existing behaviour of reconnect which basically does a while true: reconnect
.
IMO this is one of those footguns that shouldn't exist. It results in spamming the server with reconnects and I've found that if a library enables bad behaviour by default, then this behaviour will end up occurring in the wild and make some poor devs/SREs very unhappy.
Of course a user could opt into this bad behaviour by doing
{:reconnect, 0, state}
But at least that's an explicit choice and in an explicit choice, a user might be more inclined to specify a timeout of 1s or something.
If we want to keep the existing result with no timeout {:reconnect, state}
then IMO we should change the behaviour to add a default backoff of at least 1s or something. Of course this could also be considered a breaking change and I would personally prefer making the explicit breaking change of requiring an explicit timeout so that users opt into the change of behaviour by making the small refactor of {:reconnect, state}
to {:reconnect, timeout, state}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can buy that.
I would say keep :reconnect
and default {:reconnect, state}
to {:reconnect, default, state}
. That way the breaking change is the handling of messages.
I also think that we may need to nest receive
blocks to catch the reconnect message as a priority.
defp handle_send_result(result, frame, key, state) do | ||
result = try_callback(state.module, :handle_send_result, [result, frame, key, state.module_state]) | ||
|
||
case result do | ||
{:ok, new_state} -> | ||
{:ok, new_state} | ||
|
||
{:close, new_state} -> | ||
{:close, new_state} | ||
|
||
{:close, close_frame, new_state} -> | ||
{:close, close_frame, new_state} | ||
|
||
{:"$EXIT", _} = res -> | ||
res | ||
|
||
badreply -> | ||
{:"$EXIT", | ||
%WebSockex.BadResponseError{ | ||
module: state.module, | ||
function: :handle_send_result, | ||
args: [result, frame, state.module_state], | ||
response: badreply | ||
}} | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the desire for this, but I think that this is actually a not great idea.
TLDR; Just because a message is sent doesn't mean that it was received. Rebuilding messages that need to be sent should depend on what the server has processed.
The reason why is that the WebSocket protocol actually doesn't have an acknowledgement mechanism built into it. Most people build those mechanisms around the protocol, erroring out if processing a message out of order, etc. We have a guarantee that our message was sent because of TCP guarantees, but we have no idea if our message was processed without an acknowledgement.
Some servers do this with custom ping messages wrapped in data frames, some servers require unsolicited pong messages with data payloads, some servers only use WebSockets for unidirectional data. There's a whole world out there of ways of handling messages that were sent.
But because the underlying mechanism is TCP, you could send 5 additional frames before the server responds that it is unable to process a frame. We may be able to say that this frame "failed to send", but the current frame wasn't the problem. Likewise, your last frame could have succeeded in getting to the proxy, but it never made it to the server.
So, I think that offering this kind of functionality would actually lead to bad practices in the design of the connection itself. Deciding on what needs to be sent or resent on a connection should be determined after getting some state from the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ye I understand there are many network failures that could occur. I actually added this because adding a backoff state changes the behaviour quite a bit. For example, if exponential backoff is implemented, it's not unrealistic that someone could set a max backoff to 10s depending on their application/use case. But imo this still applies even for a backoff of 1s.
On the one hand, we don't want to prevent handle_cast
messages from being processed while in the backoff state (ie. excluding handle_cast
messages from the selective receive in backoff state).
On the other hand, doing a :reply
while in the backoff state results in a 100% chance for failure, which can be acted upon immediately by e.g. stopping sending more messages until connected again, or in my use case, NACKing the message immediately (see jeffgrunewald/stargate#26). The stopping sending messages until connected can of course be implemented via the handle_connect
and handle_disconnect
callbacks, but there's still the chance of messages getting lost because of race conditions, which can then only be NACK'ed on a timeout.
So yes this isn't a replacement for application ACKs, but rather a way to give fast feedback of messages guaranteed to NOT be delivered during a backoff state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not wrong, but after reading your explanation I feel the deficiency in the WebSockex API may be that we need to have some kind of connection state passed to handle_cast
and handle_info
.
What are your thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this and decided against it because that solution still allows for race conditions.
If in handle_cast
you check that the connection is connected and then do a reply
, and the connection drops before the reply
is processed, then you still lose a message. Having a callback handle_send_result
puts reply
in the same position as send_frame
, just that the handling is async. send_frame
gives you feedback on whether the TCP packet was sent or if the connection wasn't established, e.g. by returning a NotConnected
error.
IMO having these 2 apis be equally powerful in terms of gaurantees/feedback it provides is good.
The other thing I thought is that people have also asked for allowing the same features as GenServer
. If WebSockex
adds handle_call
support then together with handle_send_result
, a user is able to implement send_frame
without it being a native api.
In general I feel that async apis should be more powerful, not less powerful than sync apis, and in any case, in Elixir, sync apis are always an abstraction over async apis.
So it just feels wrong that reply
is less powerful/reliable than send_frame
in terms of feedback/reliability.
I think we have some disagreements on the underlying reliability of the WebSocket protocol. I tried to explain my point of view. I also feel like you are gutting some important functionality for my personal use case, so I tried to point that out. On the other hand, I understand the desire to want to get rid of some of the complexity that results from some of the features implemented for ease of use. I think that unfortunately, there aren't as many changes that you can make without removing behavior that someone like me depends upon. The other problem is that WebSocket servers are so diverse in their implementations that we can't actually assume that the protocol is implemented 100% correctly. I do really appreciate this though, and hope we can continue having a conversation on how to get this functionality into WebSockex. |
I wasn't planning on making so many changes when I set out to add backoff support.
However I do prefer small PRs so if we can agree on a subset of the changes then I can try split it up into smaller PRs. Lets see what we can agree on |
related to #111.
The current reconnect feature doesn't support backoff, so it's not
really useful since spamming reconnects is not good behaviour.
This introduces a backoff feature, taking inspiration from
https://hex.pm/packages/connection.
This was meant to be a small change, however with the existing
async and handle_initial_conn_failure options, the solution turned out
to be incoherent and also uncovered existing bugs with reconnect during
init that could indefinitely block the supervisor startup process.
As a result, this change makes breaking changes to the api, and
introduces some new apis to recover some previous functionality.
In particular, async init is now the default and only behaviour, ie.
a network connection is only made after finalizing the special process
init.
Additionally, I've removed
send_frame
from the selective receive duringconnecting
status since it is not expected for a connection to be in this statusfor very long and this removes the need for an extra api to wait for the connection being
connected
.You can now do the following:
I realised that currently
handle_cast
andhandle_info
messages are only handled when in thewebsocket_loop
, ie. when connected because those can both return{:reply, frame, state}
which would fail if not connected at that point.This is probably currently fine since
open_loop
andclose_loop
aren't expected to last that long, so it's fine for a message to wait in the inbox until the connection is established.However since I introduced a
backoff_loop
this is a problem because eitherhandle_cast
andhandle_info
messages must be handled inbackoff_loop
and:reply
has to somehow be handled differently when not connected orhandle_cast
andhandle_info
messages must NOT be handled inbackoff_loop
which means messages could wait in the inbox for potentially a long time, especially since the backoff is given by the library user, which could conceivably be > 10 s if exponential backoff is implemented (this flexibility is provided precisely because we want the user to be able to choose their own backoff strategy for their use case).Using
{:reply, frame, state}
already makes sending messages unreliable in the sense that the user isn't given feedback on whether the message was actually successfully sent, despite websocket itself being a reliable transport. This means that using this could result in messages unexpectedly getting lost, although currently if:reply
results in any error then the connection is closed, and in practice this probably mostly happens because of connection errors.I'm not sure how many people think about this when choosing between
send_frame
and:reply
, but it is suggested to use:reply
instead ofsend_frame
(#34), which I don't think is a good thing if you want reliability.I've added an additional callback,
handle_send_result
, which is called on success or error of processing a:reply
action.Note that I've not named it
handle_reply_result
because I think that the:reply
action should be renamed since it only makes sense inhandle_frame
, but not inhandle_cast
andhandle_info
. This allows sending a message via:reply
to be just as reliable assend_frame
since the user is given feedback on whether the message was sent or not.