How robust is the callback API when the RabbitMQ machine keeps changing? #122

piovezan · 2015-01-04T12:15:13Z

We have a node script running a socket.io server whose clients consume messages from a RabbitMQ queue. We've recently migrated to Amazon AWS and there are now two instances of RabbitMQ which keep alternating from time to time. We suspect amqp.node (we're using the callback API 0.2.1) is not able to handle those changes; a while after the process has started the socket.io server simply stops accepting new connections and the consumers stop receiving messages. We didn't have this issue in the previous production environment. Any feedback about this issue will be really appreciated.

squaremo · 2015-01-04T12:30:11Z

What do you mean by "alternating"?

piovezan · 2015-01-04T12:36:28Z

It's a redundancy scheme; each time there is a single machine running, out of two machines with RabbitMQ set up. Should one machine fail or otherwise be unavailable the other is supposed to take over. But besides that, they also alternate themselves from time to time (I'm not sure why though, it must be due to load balancing).

squaremo · 2015-01-04T12:45:30Z

Do you mean that each (AMQP) connection made from amqplib will be to one of two RabbitMQs? In that case, amqplib should behave fine.

Do you shut the RabbitMQs down? What happens to the connections when you do that?

piovezan · 2015-01-04T12:52:38Z

I'll be sure to return with the details; I'm not sure how the alternation works. The AMQP connection is to a single RabbitMQ though; the underlying instance is changed from time to time.

squaremo · 2015-01-04T12:56:07Z

The AMQP connection is to a single RabbitMQ though; the underlying instance is changed from time to time.

If you mean that you swap out the VM while keeping the volumes and networking intact, that may work; anything else, I think you would cause any AMQP library (and many other bits of software) a conniption.

piovezan · 2015-01-04T13:47:57Z

Sorry, I had provided the wrong explanation. I understand it now. The two RabbitMQ instances are never shut down; what happens is that the AMQP connection is lost from time to time (it is a limitation that arrives from a high availability environment with redundant VMs and we have to cope with it) and if an attempt to reconnect is made, the DNS chooses which instance to connect to (it is a cluster so it doesn't matter which instance to connect to). The problem is that the attempt to reconnect is never made; amqp.node apparently fails to notice that the connection is lost. We have a heartbeat timeout set at the host URL and are checking for 'error' and 'close' events but they are apparently never issued. The queues expect the consumed messages to be ack'ed. We want the node script to detect a lost connection and finish itself, so the environment will automatically start a new process and establish a connection again.

squaremo · 2015-01-04T14:06:34Z

OK, that set-up make sense.

I'm surprised that amqplib doesn't notice missed heartbeats, or connection drops (though apparently past me was unsure about TCP connection loss -- see #58). Let me check for myself that heartbeats work or not ..

piovezan · 2015-01-04T14:21:32Z

Ok, thank you very much. I'll be looking forward to your feedback.

squaremo · 2015-01-04T14:53:04Z

When I simulate a stalled connection using an SSH tunnel, I see that amqplib will report a heartbeat timeout after an appropriate duration (about two times the heartbeat interval).

To what are you setting your heartbeat interval? Do you see the connections made in the RabbitMQ management UI, and do they have the heartbeat (timeout) you expect?

piovezan · 2015-01-04T16:12:20Z

I set the heartbeat to 55s and yes, the UI shows the expected timeout value. I had been instructed to test the connection drop with an iptables rule which blocked outgoing packets from the machine node.js was running on, and when I ran the test the timeout worked, so I was surprised as well when the issue took place in production after a while; maybe I can reproduce it with further testing (not before tomorrow), or maybe the issue is due to something else. Maybe I'm doing something wrong with the API in my node script? http://pastebin.com/qapFMq1m

piovezan · 2015-01-05T17:23:22Z

Apparently the issue has been solved. The near 60 seconds heartbeat was the issue. It conflicts with the RabbitMQ load balancer in AWS which checks every 1 minute or so whether data has passed through the connection or not (if no data has passed, the balancer breaks the connection). The likely scenario is that if the hearbeat fails to be issued on time (usually in case of heavier message processing taking place) and isn't able to prevent the load balancer from breaking the connection, the client stops receiving messages and the library apparently doesn't react to that (should it?). A lower heartbeat (e.g. 30 seconds) is necessary in order to avoid this situation.

squaremo · 2015-01-05T21:39:05Z

the client stops receiving messages and the library apparently doesn't react to that (should it?)

Unclear -- it should at least detect that it's missed heartbeats, after a couple of minutes (two heartbeat intervals) or so.

squaremo · 2015-01-10T13:34:39Z

The likely scenario is that if the hearbeat fails to be issued on time (usually in case of heavier message processing taking place) and isn't able to prevent the load balancer from breaking the connection

If it's genuinely under heavy processing, it's very difficult to predict what Node.JS will do to be honest. It's entirely possible it will miss setInterval deadlines, delay events, and so on.

Have you managed to reproduce this in "laboratory conditions"?

squaremo self-assigned this Jan 10, 2015

cressie176 closed this as completed May 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How robust is the callback API when the RabbitMQ machine keeps changing? #122

How robust is the callback API when the RabbitMQ machine keeps changing? #122

piovezan commented Jan 4, 2015

squaremo commented Jan 4, 2015

piovezan commented Jan 4, 2015

squaremo commented Jan 4, 2015

piovezan commented Jan 4, 2015

squaremo commented Jan 4, 2015

piovezan commented Jan 4, 2015

squaremo commented Jan 4, 2015

piovezan commented Jan 4, 2015

squaremo commented Jan 4, 2015

piovezan commented Jan 4, 2015

piovezan commented Jan 5, 2015

squaremo commented Jan 5, 2015

squaremo commented Jan 10, 2015

How robust is the callback API when the RabbitMQ machine keeps changing? #122

How robust is the callback API when the RabbitMQ machine keeps changing? #122

Comments

piovezan commented Jan 4, 2015

squaremo commented Jan 4, 2015

piovezan commented Jan 4, 2015

squaremo commented Jan 4, 2015

piovezan commented Jan 4, 2015

squaremo commented Jan 4, 2015

piovezan commented Jan 4, 2015

squaremo commented Jan 4, 2015

piovezan commented Jan 4, 2015

squaremo commented Jan 4, 2015

piovezan commented Jan 4, 2015

piovezan commented Jan 5, 2015

squaremo commented Jan 5, 2015

squaremo commented Jan 10, 2015