-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How robust is the callback API when the RabbitMQ machine keeps changing? #122
Comments
What do you mean by "alternating"? |
It's a redundancy scheme; each time there is a single machine running, out of two machines with RabbitMQ set up. Should one machine fail or otherwise be unavailable the other is supposed to take over. But besides that, they also alternate themselves from time to time (I'm not sure why though, it must be due to load balancing). |
Do you mean that each (AMQP) connection made from amqplib will be to one of two RabbitMQs? In that case, amqplib should behave fine. Do you shut the RabbitMQs down? What happens to the connections when you do that? |
I'll be sure to return with the details; I'm not sure how the alternation works. The AMQP connection is to a single RabbitMQ though; the underlying instance is changed from time to time. |
If you mean that you swap out the VM while keeping the volumes and networking intact, that may work; anything else, I think you would cause any AMQP library (and many other bits of software) a conniption. |
Sorry, I had provided the wrong explanation. I understand it now. The two RabbitMQ instances are never shut down; what happens is that the AMQP connection is lost from time to time (it is a limitation that arrives from a high availability environment with redundant VMs and we have to cope with it) and if an attempt to reconnect is made, the DNS chooses which instance to connect to (it is a cluster so it doesn't matter which instance to connect to). The problem is that the attempt to reconnect is never made; amqp.node apparently fails to notice that the connection is lost. We have a heartbeat timeout set at the host URL and are checking for 'error' and 'close' events but they are apparently never issued. The queues expect the consumed messages to be ack'ed. We want the node script to detect a lost connection and finish itself, so the environment will automatically start a new process and establish a connection again. |
OK, that set-up make sense. I'm surprised that amqplib doesn't notice missed heartbeats, or connection drops (though apparently past me was unsure about TCP connection loss -- see #58). Let me check for myself that heartbeats work or not .. |
Ok, thank you very much. I'll be looking forward to your feedback. |
When I simulate a stalled connection using an SSH tunnel, I see that amqplib will report a heartbeat timeout after an appropriate duration (about two times the heartbeat interval). To what are you setting your heartbeat interval? Do you see the connections made in the RabbitMQ management UI, and do they have the heartbeat (timeout) you expect? |
I set the heartbeat to 55s and yes, the UI shows the expected timeout value. I had been instructed to test the connection drop with an iptables rule which blocked outgoing packets from the machine node.js was running on, and when I ran the test the timeout worked, so I was surprised as well when the issue took place in production after a while; maybe I can reproduce it with further testing (not before tomorrow), or maybe the issue is due to something else. Maybe I'm doing something wrong with the API in my node script? http://pastebin.com/qapFMq1m |
Apparently the issue has been solved. The near 60 seconds heartbeat was the issue. It conflicts with the RabbitMQ load balancer in AWS which checks every 1 minute or so whether data has passed through the connection or not (if no data has passed, the balancer breaks the connection). The likely scenario is that if the hearbeat fails to be issued on time (usually in case of heavier message processing taking place) and isn't able to prevent the load balancer from breaking the connection, the client stops receiving messages and the library apparently doesn't react to that (should it?). A lower heartbeat (e.g. 30 seconds) is necessary in order to avoid this situation. |
Unclear -- it should at least detect that it's missed heartbeats, after a couple of minutes (two heartbeat intervals) or so. |
If it's genuinely under heavy processing, it's very difficult to predict what Node.JS will do to be honest. It's entirely possible it will miss Have you managed to reproduce this in "laboratory conditions"? |
We have a node script running a socket.io server whose clients consume messages from a RabbitMQ queue. We've recently migrated to Amazon AWS and there are now two instances of RabbitMQ which keep alternating from time to time. We suspect amqp.node (we're using the callback API 0.2.1) is not able to handle those changes; a while after the process has started the socket.io server simply stops accepting new connections and the consumers stop receiving messages. We didn't have this issue in the previous production environment. Any feedback about this issue will be really appreciated.
The text was updated successfully, but these errors were encountered: