Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance troubleshooting of rejected messages and gracefully handle false positives #34

Open
mverzilli opened this issue Jan 31, 2018 · 0 comments
Labels

Comments

@mverzilli
Copy link
Contributor

mverzilli commented Jan 31, 2018

We looked into the issue in further detail.

  1. Metfone seems to be failing because Nuntium is trying to reroute AOs to Mobitel/Smart users through the Metfone channel, and it reports a reasonable PDU code (0x45: Message submit failed). So this is actually a configuration issue.
  2. Mobitel is sending invalid PDU status codes with some message rejections. These errors aren’t poisoning the connection, but given Nuntium doesn’t know better it assumes the worst and just disables the channel and sends an email alert. Kakada then simply goes to Nuntium and reenables the channel.
  3. Smart sends a valid but unhelpful PDU status code (255 => 0xFF => “Unknown error”). Kakada handles it similarly to Mobitel.
  4. The “rejection event” doesn’t seem to be linked to the message log (at least the one that’s accesible through the UI). Anyway these messages are coming from the Zero Reporting System, which are pretty similar amongst each other and only use ASCII chars.

So I guess we’d need:

A. An “Ignored rejections with these PDU codes” setting per channel that lets the admin parameterize that. We should be able to define “rule-specific" retry cycles to still disable the channel if failures happen to be non-transient.
B. A way of filtering messages that got rejected + Track the rejection codes in the message’s log history. These would let us analyze whether the frequency of those events is out of hands.

Rationale below.

The scenarios we have right now are:

  1. Channel level transient failures
  2. Channel level non-transient failures
  3. Failures caused by poison messages

We generally know how to deal with these scenarios, but the difficulty in this case is that we could be getting the same codes for the 3 of them, so we’re in the dark. Then all we can do is provide tools to help us deal with this and make our best guess on which of the 3 scenarios we’re at:

-The rules to ignore codes let us handle scenario 1 reducing the need for manual intervention (which is happening anyway: Kakada does it every time he gets an alert).
-The retry cycle policy protects us in case we fell in scenario 2 (in today’s reality this would be Kakada reenabling the channel and getting a new alert one or two minutes after, which would signal that something is actually wrong).
-Enhanced filtering and message logs would let us analyze if there’s anything particular that could be fixed from the originating application. Right now (correct me if I’m wrong) there’s no way to do it without SSHing to the server (and even in that case I’m not sure the information is available).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant