Enhance troubleshooting of rejected messages and gracefully handle false positives #34

mverzilli · 2018-01-31T18:18:25Z

We looked into the issue in further detail.

Metfone seems to be failing because Nuntium is trying to reroute AOs to Mobitel/Smart users through the Metfone channel, and it reports a reasonable PDU code (0x45: Message submit failed). So this is actually a configuration issue.
Mobitel is sending invalid PDU status codes with some message rejections. These errors aren’t poisoning the connection, but given Nuntium doesn’t know better it assumes the worst and just disables the channel and sends an email alert. Kakada then simply goes to Nuntium and reenables the channel.
Smart sends a valid but unhelpful PDU status code (255 => 0xFF => “Unknown error”). Kakada handles it similarly to Mobitel.
The “rejection event” doesn’t seem to be linked to the message log (at least the one that’s accesible through the UI). Anyway these messages are coming from the Zero Reporting System, which are pretty similar amongst each other and only use ASCII chars.

So I guess we’d need:

A. An “Ignored rejections with these PDU codes” setting per channel that lets the admin parameterize that. We should be able to define “rule-specific" retry cycles to still disable the channel if failures happen to be non-transient.
B. A way of filtering messages that got rejected + Track the rejection codes in the message’s log history. These would let us analyze whether the frequency of those events is out of hands.

Rationale below.

The scenarios we have right now are:

Channel level transient failures
Channel level non-transient failures
Failures caused by poison messages

We generally know how to deal with these scenarios, but the difficulty in this case is that we could be getting the same codes for the 3 of them, so we’re in the dark. Then all we can do is provide tools to help us deal with this and make our best guess on which of the 3 scenarios we’re at:

-The rules to ignore codes let us handle scenario 1 reducing the need for manual intervention (which is happening anyway: Kakada does it every time he gets an alert).
-The retry cycle policy protects us in case we fell in scenario 2 (in today’s reality this would be Kakada reenabling the channel and getting a new alert one or two minutes after, which would signal that something is actually wrong).
-Enhanced filtering and message logs would let us analyze if there’s anything particular that could be fixed from the originating application. Right now (correct me if I’m wrong) there’s no way to do it without SSHing to the server (and even in that case I’m not sure the information is available).

mverzilli added the ready label Jan 31, 2018

mverzilli added backlog and removed ready labels Mar 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance troubleshooting of rejected messages and gracefully handle false positives #34

Enhance troubleshooting of rejected messages and gracefully handle false positives #34

mverzilli commented Jan 31, 2018 •

edited

Loading

Enhance troubleshooting of rejected messages and gracefully handle false positives #34

Enhance troubleshooting of rejected messages and gracefully handle false positives #34

Comments

mverzilli commented Jan 31, 2018 • edited Loading

mverzilli commented Jan 31, 2018 •

edited

Loading