Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How robust is the callback API when the RabbitMQ machine keeps changing? #122

Closed
piovezan opened this issue Jan 4, 2015 · 13 comments
Closed
Assignees

Comments

@piovezan
Copy link

piovezan commented Jan 4, 2015

We have a node script running a socket.io server whose clients consume messages from a RabbitMQ queue. We've recently migrated to Amazon AWS and there are now two instances of RabbitMQ which keep alternating from time to time. We suspect amqp.node (we're using the callback API 0.2.1) is not able to handle those changes; a while after the process has started the socket.io server simply stops accepting new connections and the consumers stop receiving messages. We didn't have this issue in the previous production environment. Any feedback about this issue will be really appreciated.

@squaremo
Copy link
Collaborator

squaremo commented Jan 4, 2015

What do you mean by "alternating"?

@piovezan
Copy link
Author

piovezan commented Jan 4, 2015

It's a redundancy scheme; each time there is a single machine running, out of two machines with RabbitMQ set up. Should one machine fail or otherwise be unavailable the other is supposed to take over. But besides that, they also alternate themselves from time to time (I'm not sure why though, it must be due to load balancing).

@squaremo
Copy link
Collaborator

squaremo commented Jan 4, 2015

Do you mean that each (AMQP) connection made from amqplib will be to one of two RabbitMQs? In that case, amqplib should behave fine.

Do you shut the RabbitMQs down? What happens to the connections when you do that?

@piovezan
Copy link
Author

piovezan commented Jan 4, 2015

I'll be sure to return with the details; I'm not sure how the alternation works. The AMQP connection is to a single RabbitMQ though; the underlying instance is changed from time to time.

@squaremo
Copy link
Collaborator

squaremo commented Jan 4, 2015

The AMQP connection is to a single RabbitMQ though; the underlying instance is changed from time to time.

If you mean that you swap out the VM while keeping the volumes and networking intact, that may work; anything else, I think you would cause any AMQP library (and many other bits of software) a conniption.

@piovezan
Copy link
Author

piovezan commented Jan 4, 2015

Sorry, I had provided the wrong explanation. I understand it now. The two RabbitMQ instances are never shut down; what happens is that the AMQP connection is lost from time to time (it is a limitation that arrives from a high availability environment with redundant VMs and we have to cope with it) and if an attempt to reconnect is made, the DNS chooses which instance to connect to (it is a cluster so it doesn't matter which instance to connect to). The problem is that the attempt to reconnect is never made; amqp.node apparently fails to notice that the connection is lost. We have a heartbeat timeout set at the host URL and are checking for 'error' and 'close' events but they are apparently never issued. The queues expect the consumed messages to be ack'ed. We want the node script to detect a lost connection and finish itself, so the environment will automatically start a new process and establish a connection again.

@squaremo
Copy link
Collaborator

squaremo commented Jan 4, 2015

OK, that set-up make sense.

I'm surprised that amqplib doesn't notice missed heartbeats, or connection drops (though apparently past me was unsure about TCP connection loss -- see #58). Let me check for myself that heartbeats work or not ..

@piovezan
Copy link
Author

piovezan commented Jan 4, 2015

Ok, thank you very much. I'll be looking forward to your feedback.

@squaremo
Copy link
Collaborator

squaremo commented Jan 4, 2015

When I simulate a stalled connection using an SSH tunnel, I see that amqplib will report a heartbeat timeout after an appropriate duration (about two times the heartbeat interval).

To what are you setting your heartbeat interval? Do you see the connections made in the RabbitMQ management UI, and do they have the heartbeat (timeout) you expect?

@piovezan
Copy link
Author

piovezan commented Jan 4, 2015

I set the heartbeat to 55s and yes, the UI shows the expected timeout value. I had been instructed to test the connection drop with an iptables rule which blocked outgoing packets from the machine node.js was running on, and when I ran the test the timeout worked, so I was surprised as well when the issue took place in production after a while; maybe I can reproduce it with further testing (not before tomorrow), or maybe the issue is due to something else. Maybe I'm doing something wrong with the API in my node script? http://pastebin.com/qapFMq1m

@piovezan
Copy link
Author

piovezan commented Jan 5, 2015

Apparently the issue has been solved. The near 60 seconds heartbeat was the issue. It conflicts with the RabbitMQ load balancer in AWS which checks every 1 minute or so whether data has passed through the connection or not (if no data has passed, the balancer breaks the connection). The likely scenario is that if the hearbeat fails to be issued on time (usually in case of heavier message processing taking place) and isn't able to prevent the load balancer from breaking the connection, the client stops receiving messages and the library apparently doesn't react to that (should it?). A lower heartbeat (e.g. 30 seconds) is necessary in order to avoid this situation.

@squaremo
Copy link
Collaborator

squaremo commented Jan 5, 2015

the client stops receiving messages and the library apparently doesn't react to that (should it?)

Unclear -- it should at least detect that it's missed heartbeats, after a couple of minutes (two heartbeat intervals) or so.

@squaremo
Copy link
Collaborator

The likely scenario is that if the hearbeat fails to be issued on time (usually in case of heavier message processing taking place) and isn't able to prevent the load balancer from breaking the connection

If it's genuinely under heavy processing, it's very difficult to predict what Node.JS will do to be honest. It's entirely possible it will miss setInterval deadlines, delay events, and so on.

Have you managed to reproduce this in "laboratory conditions"?

@squaremo squaremo self-assigned this Jan 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants