-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unhandled server-side disconnects #61
Comments
There are two kinds of disconnects - the web socket disconnect event, and the socket read returning nil or throwing an exception (e.g. EPIPE). Are you detecting the later cases? Are you sure that's covering all cases? Is this returning nil? Are you handling that case? |
Thanks @ioquatix! I didn't realize the socket read can return Help me write the correct code here? What should it look like on I would have expected to never need this code (and the driver to emit this in all the right cases):
Are we doing it wrong? Will add logging to see if we're getting a |
Ruby sockets are a bit of a mess. A read of length 0 gives nil return. You also need to handle empty string, i.e. |
I kinda expected the driver to do this for me as it emits other kinds of disconnects. Am I asking for too much? |
Yes. Do I think it's annoying? Here is an implementation that works: |
This doesn't look like the problem :( I added this code to log all messages, handle and show I never see one.
That last down is from a ping thread that watches slack status and restarts a bot when slack reports it's down. I noticed in your implementation you implement |
I also have a piece of ping code that can identify a socket/driver stuck in this disconnected state. I can examine anything on the object when that happens, ie. an instance of |
I wonder if there are bugs in |
As you know if it's a bug in Celluloid, it's unlikely to ever get fixed. |
I am seeing more new problems, now with the DNS resolver ... sigh. What should I be replacing celluloid-io with? |
I already shower how to use async with slack: https://gist.github.com/ioquatix/c24f107e2cc7f48e571a37e8e93b0cda |
Yes thank you @ioquatix for being helpful. |
Let me know if I can be of further assistance. |
Well given all the money I am paying you why don't you just implement slack-ruby/slack-ruby-client#210 :))) I am KIDDING totally of course. I'll eventually get to it. Much much thanks for being responsive and pitching in your comments, suggestions and thoughts. Let me know how I can be helpful. |
Hehe. Well, if you want to pay me money to make it work, I won't stop you :) |
I want to resurrect this since we've implemented async support in slack-ruby/slack-ruby-client#219 and the problem persists. Can someone help me understand what code in this library handles the ping/pong frames per https://tools.ietf.org/html/rfc6455#section-5.5.2? What's the best way to observe these? What happens if the server decides to go away and how does a |
The next step to debugging this is to dump all socket read/write events. |
I also wouldn’t rely on a close event from the websocket driver at worst you won’t receive it. |
It seems that you're saying that the websocket driver implementation can't always notice that it got disconnected? What good is such implementation for? |
It’s for graceful shutdown of the connection. If the underlying TCP connection fails e.g. timeout or someone trips over the ethernet, there won’t be any high level close event. |
I don't get it. How's a client of the web socket supposed to know someone pulled the cable? How do other language libraries handle this? |
|
yep, its not the most usable API, @dblock. this is actually the entire purpose of sending "ping" messages. it gives both the server and client a mechanism to test whether the connection is alive, rather than depending on a as a tangent: it's not really the fault of the implementors or designers of the WebsSocket API. even TCP connections (lower level protocol) have a similar problem that when there's no traffic you can't really tell the difference between a dead connection and one where there's just nothing being said. that's why in general we call these ping/pong messages a keep-alive mechanism. |
First, it's nobody's fault, I am just trying to get to the bottom of an issue that's affecting hundreds if not thousands of bots. Everyone who's contributing and helping is amazing, including @aoberoi and @ioquatix.
I see, that makes sense. Not the "best" API, but if a disconnect causes a
You two @ioquatix and @aoberoi are contradicting each-other. It seems like @ioquatix is saying the implementation "handles" disconnects and will return |
Okay let me back up a step and confirm my understanding of what's happening with you both (and anyone else who finds this issue interesting). Apps built using Currently, we're determining that the connection is unresponsive by periodically sending a request to Slack's Web API method I don't think @ioquatix and myself are disagreeing (but please correct me if I'm wrong). We're both saying that returning Again, this is my current understanding, but I admit that I've not read all the code or followed every development along the way, so I'm happy to be corrected. |
I think that’s a reasonable assessment. If the TCP connection is still alive but the websocket session is dead, a periodic ping at the websocket layer should detect it. If TCP and websockets are still working but the bot is non-functional for some other high level reason, then checking if the bot is in the channel might be the only solution. |
I agree, this seems reasonable. I am going to first get our current situation wrapped up, ie. get to a mergeable state for slack-ruby/slack-ruby-client#219, slack-ruby/slack-ruby-bot#198 and slack-ruby/slack-ruby-bot-server#75 that replaces celluloid with async and force-closes the websocket when the bot goes offline according to slack status. Next I'll try (unless someone beats me to it :) to build a websocket-level ping monitoring worker into slack-ruby-client itself, slack-ruby/slack-ruby-client#223. |
Apologies for taking a while to get to this and thank you to @ioquatix for pitching in :) I think the root of this issue is a question of scope: this library does not handle any I/O or network concerns for you, it only handles the WebSocket protocol data flow. This is because Ruby has a huge variety of I/O and concurrency frameworks, and we wanted something that could be combined with any of them. Therefore, how you interact with the network socket is out of scope: you read from it, and feed the results into The driver is only capable of detecting events that result from data coming over the wire. Any network-related events like disconnections need to be handled by the caller -- if your TCP socket becomes disconnected, the driver cannot detect that and you need to handle it yourself. This library will emit a If you want to see how ping/pong is handled, that happens here: https://github.com/faye/websocket-driver-ruby/blob/0.7.0/lib/websocket/driver/hybi.rb#L369-L378 If this is still a problem and is in scope for us to fix in this library, can you leave further comments? Otherwise let us know this is out of scope and close the issue :) Happy to try and answer any other questions you might have. |
One thing I wanted to add is that we do provide an interface for sending a ping message and detecting the peer's response to it -- the block will execute if/when we receive a matching pong frame from the remote peer: driver.ping("my ping message") do
# this runs if/when we receive a pong response
end However as this library is agnostic about your concurrency framework, it does not support calling For example, faye-websocket combines this library with EventMachine to produce a complete client. It can be configured to call https://github.com/faye/faye-websocket-ruby/blob/0.10.7/lib/faye/websocket/api.rb#L52-L57 |
Is this an ongoing issue, on can I close it @dblock ? |
My team is still seeing at least 1 disconnect every day. Our workaround is some extra code to ping every minute to test that the connection is alive, and if not, close and reconnect as needed. That is working fine but the underlying issue is still there. |
We implemented ping in slack-ruby-client, read https://code.dblock.org/2019/03/04/solving-slack-side-disconnects-in-slack-ruby-client.html. If you think the driver should be doing this, then leave open. Otherwise close (and document)? |
Thanks, I didn't realize there was a higher-level fix here: slack-ruby/slack-ruby-client#226. Seems like this issue can be closed, then. |
This may not be what you want! I think an argument could be for the websocket driver to maintain the connection all on its own. After all, it's surprising to users that a driver would lose a connection pretty much guaranteed at some point. |
Hmm. I am not a consumer of this library directly; I'm using it via slack-ruby-client. So I don't think I am knowledgable enough to weigh in on this question. Sorry if I derailed this thread! |
@dblock Can you elaborate on why this is surprising, because it would help me understand how I might have mis-communicated the fact that the network (or other I/O) layer is out of scope for this library and it is not capable by design of handling events from the network layer. |
@dblock I notice in that blog post you have the following code: def run_loop
loop { read }
rescue EOFError, Errno::ECONNRESET, Errno::EPIPE => e
driver.emit(:close, WebSocket::Driver::CloseEvent.new(1001, 'server closed connection'))
end I would not advise calling |
Because ping is part of the websocket protocol to maintain connections? I totally see your point, though, re:design. I guess someone has to maintain the connection via |
We spent so much time getting this to work in 100% of the cases that I am going to leave this as is until someone comes up with better :) In this case we need the websocket driver to abort itself (this is the scenario where the server went away but the driver is sitting there waiting). |
I did want to add a thank you to everyone here for keeping the thread alive and for writing this driver and maintaining it and more. I am not a websocket expert by any means, I am in the "i thought it would just work" camp :) |
@dblock We might be talking at cross purposes. What I was trying to say is that this library only handles the WebSocket protocol, which is to say everything related to the data that goes over the socket. It does not handle the socket itself, and the server going away or closing the connection is a socket-related event that is out of scope for this library. It also doesn't take a position on when or how often So the question I was asking is about how we could have communicated this better in the documentation, so that you would have known you needed to handle this yourself? |
Probably an FAQ? Frankly I don't know if I would have discovered it, but it would have been nice to read a detailed explanation on the responsibility of clients to maintain connections. I found https://django-websocket-redis.readthedocs.io/en/latest/heartbeats.html that seems to explain something similar in python, or https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers#Pings_and_Pongs_The_Heartbeat_of_WebSockets. I do think a library that is linked from this project that implements a connected, and maintained websocket, with ping, is a better answer, but I am not writing one ATM. |
That Django documentation highlights what I was trying to say about your WebSocket keep-alive mechanism depending on the application protocol you're running on top of it. For example, this library was developed as part of the Faye project, which implements a protocol called Bayeux. Faye used to implement heartbeats on the client side by sending an empty JSON list of messages ( In the case of Bayeux, that protocol has its own heartbeat mechanism that works over WebSocket, HTTP or any other transport -- it doesn't rely on the transport layer to do keep-alive itself. Nevertheless the Faye server can be configured to send WebSocket ping messages if desired. You'll also find that it doesn't necessarily make sense to make a WebSocket automatically reconnect on failure, because the application layer needs to know about that interruption in case it resulted in dropped messages. So, there's not really a neat solution to "just keeping a WebSocket connected" that doesn't involve the application layer, unfortunately. |
I have a sneaky suspicion that these experiences are all due to "keep alive" mechanisms. What I have noticed is that the client does not respond to ping messages with a pong message. In the api integration I am doing, the websocket server expects pongs to pings it sends. This might sound like an implementation detail, it does seem that as the ping pong is part of the protocol, that this is missing from the client. I do see that in the hybi driver, there is
The server I am joining is disconnecting me as if I didn't respond to its pong message - will do further investigation (wireshark) |
I see, my issue with ping | pong not enabled is that my ping setting was not set as per suggested answer here |
@tjad Just to confirm: |
@jcoglan Thanks a lot. Yeah makes sense. Now receiving and handling opcode 10s . Perhaps in this case server was expecting inital ping. All working as expected. |
Coming from slack-ruby/slack-ruby-client#208 and celluloid/celluloid-io#187, we're seeing the Slack server drop connections and the client not noticing and being stuck in a run loop.
There's some chance this is a bug here or misunderstanding in the way we use websocket-ruby-driver. Initially I thought this was celluloid-io, but I think this is the driver not handling something correctly.
This currently affects many thousands of users as the slack-ruby-client library is widely used in production and we haven't had much luck even when they looked at it from their side via @aoberoi.
Questions:
The text was updated successfully, but these errors were encountered: