Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of a response that comes after MRP resends end. #29640
Fix handling of a response that comes after MRP resends end. #29640
Changes from all commits
6ecfb46
d06fbf8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem any difference between the flags WaitingForResponseOrAck and WaitingForAck in the way they are actually used. So, is the former required? We could use the latter for gotMRPAck instead, no?
Because once we receive the application response, we would be implicitly treating it as an Ack for the sent message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is absolutely a difference.
WaitingForResponseOrAck
gets set to false when we get a response (which might carry an ack... but we might be able to tell that) or an ack.WaitingForAck
gets set to false when we get an ack or stop listening for acks.The point is: we don't. And the reason we don't is that once
WaitingForAck
goes false we are no longer tracking what message id has an outstanding ack (if any), and hence can't tell whether to treat the application response as that ack...We could consider changing that, having a single
WaitingForAck
bit that stays true of MRP runs out of resends, and treating an app response, no matter whether it carries an ack or not (?), as an ack if we are no longer in the MRP table. That involves some more widespread logic changes than what we did here, but would be reasonably equivalent.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I read, it gets set to false in ReliableMessageContext.cpp:108, rt? That is when MRP is processing its Ack, be it either a StandaloneAck or a piggybacked Ack. I was treating them both as the same, i.e, an acknowledgment of delivery of the sent message, hence an Ack.
Yes, once WaitingForAck goes false, we have removed the message from RetransmitTable and do not care about retransmission anymore. However, Acks can still arrive after that, as you stated(maybe in a piggybacked form). These are duplicate Acks and are generally discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also in all the places where
SetResponseExpected(false)
is called.The point is: if we removed fthe message from RetransmitTable because we ran out of retransmits, they may not in fact be duplicate acks. But we just can't tell, because we removed our state that would enable us to tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were removing from the retransmit table only on all MRP retries timeout or an Ack was received. Are we evicting an outstanding retransmit to make room for a new Send? That should be a SendError because our RetransmitTable is full, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. The question is what happens when a message is received after the MRP retries timeout but before response timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an MRP standpoint, this message means nothing as all state for Ack receipt is gone(although this message is proof of delivery of the sent message). However, since Exchange has not timed out, the application may be waiting for the response still, and the question is should this be delivered up to the app? Or thrown away. I guess, if the ExchangeContext exists, then it MAY be sent up to the app bypassing all MRP checks.
One option would have been to notify the App with OnSendError() upon MRP timeout, and that would have allowed the App to either close the exchange or wait. But in the absence of that, the App will wait until ResponseTimeout and if the Response comes in within the timeout and the Exchange is available, then it can be sent to the app. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that is what we implement.
The answer to this question is "yes". ;)
It being thrown away (due to recent changes to the MRP logic which this PR undoes) is exactly what this PR is fixing. Throwing it away leads to an unacceptably high rate of commissioning failures.
Sounds like we are in agreement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just set SetWaitingForAck(false) here, instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because what if ackMessageCounter is an ack for a totally different message counter value than the one we are waiting for an ack for? As in, it's some stale ack that was sent for a previous message, delayed in transit, and is now getting delivered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, isn't HandleRcvdAck actually processing the Ack for the message it is expecting one for? So, setting the flag to false is essentially acknowledging an MRP Ack, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it's processing an ack if we got an ack. It doesn't check what it's an ack for; that's checked down in
CheckAndRemRetransTable
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. But I meant where it is being currently set inside HandleRcvdAck() in your change after checking the retransmit table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, that is indeed the "We got the ack we were expecting" case. But in that case, IsWaitingForAck has already been set false, no? So I am not sure I understand the question here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, precisely. IsWaitingForAck is set to false when the RetransEntry is destroyed. So, gotMRPAck in ExchangeContext.cpp:494 should be assigned to !IsWaitingForAck() instead, isn't it? If we need separate tracking of App response and MRP Ack, we can achieve that using the existing ResponseExpected(for app response) and WaitingForAck(for MRP), no? My point was whether there is a need for ResponseOrAck flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which happens if either we get the ack or we stop waiting for the ack.
In other words, IsWaitingForAck tells you nothing about whether we got an ack. At least after this PR, or before #29173
Fundamentally, we have the following states that we need to be able to tell apart in OnResponseTimeout and in "message received" handling:
OnResponseTimeout needs to handle state 3 different from states 1 and 2, right? It cannot use the "got response" bit for that, since that bit is false by assumption in OnResponseTimeout, so we need a bit to tell those cases apart.
"message received" handling needs to handle state 1 differently from state 2: In state 1 it needs to drop messages that are acking the wrong thing, while in state 2 it needs to process them because it has no way to tell that it's the wrong thing. This needs a second bit to tell the cases apart, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this flag? Because once we receive the Message response, we would be practically done with the Exchange and not care about MRP Acks. And there is only one timeout for the Message response(compared to multiple for MRP retries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not at all true. Many exchanges have long trains of messages going back and forth.
For the rest, see #29640 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that we do. But we do not have multiple outstanding sent messages. So, an Exchange is a StopAndWait protocol and one and only one message is waiting for a delivery Ack at a given time.
I might have overlooked the case where there are multiple back-and-forth over a single exchange, but my point was about the single sent outstanding message. It is not a sliding window where a received response is not acking an outstanding sent message(since there is only one outstanding sent message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Again, we could treat a non-duplicate response with a piggyback ack as an implicit ack for "whatever the last thing we sent was, even if we have no way to check whether the piggyback is for the right thing", as #29640 (comment) describes. If you think that would be clearer, we can do that. We still need two flags to track the "no ack" and "we have no way to check" states, which are not identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is technically true in
our stop-and-wait scenario
, that a received response is actually a true acknowledgment for the message sent earlier(even if all MRP retries have been exhausted), we can ignore the piggybacked Ack in the response(if MRP timeout has happened), but send the response up to the application if the response timeout has not happened and the exchange is still open. For the 2 states, I think WaitingForAck for MRPAck and IsResponseExpected for app response should cover the 2 scenarios, isn't it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parenthetical is really important: There is nothing tracking whether this has happened right now. We could have a bit to track this, if we want, and then that bit plus the "waiting for ack" bit plus the "response expected" bit would let us detect the states we want.
There are more than 2 states. See #29640 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind-of raises a protocol question in the sense that, by not setting MessageNotAcked to true, we are stating that we are still open to receiving an acknowledgment of delivery via the response, although all our retransmits have been exhausted. This clearly is an indicator of misconfiguration of MRP timeouts or the application response timeout, isn't it? Ideally, if we have exhausted all retries, it should go up as an OnSendFailure() callback for the application to take a necessary action(of closing the exchange, etc) even if the ResponseTimeout is outstanding, isn't it?
Ideally, the ResponseTimeout should be some function of the MRP intervals and number of retries with some additional buffer time, and not entirely decoupled from the MRP values.
MRP, at its level has a window within which it is expected to process acknowledgments to its sent messages and if that window passes, should it still process late response messages as Acks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no more "message not acked" flag, precisely because the name did not accurately reflect what the flag did. What the flag did (and still does after this PR, under the new name) is track "We have not gotten an ack, and if we do get an ack for the relevant message counter value we will clear this flag". That last "if ..." happens to be false once we reach this code, which is why setting the flag to true here, which was added recently, was wrong.
Well... The application response timeout can be quite long for a long-running operation. MRP timeouts are quite short (4s for all resends) for devices that don't claim to be sleepy. Realistically, the use of sleepy params or MRP timeouts is kind of broken, but that's a brokenness required by the spec, unfortunately, which has been reaised as a spec issue.
Honestly: that would break the world as things stand, as long as we are following the spec-required algorithm here.
It's not, generally, decoupled. The default behavior is that the app provides how much time to allow on top of the MRP timeout. So the response timeout is nearly always the MRP timeout + some app-defined constant.
That is the big question, yes. The current implementation does not, because it has no way to tell whether they are in a sane way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So, app response timeout is well correlated with MRP timeout which is good.
However, OnSendFailure() is not implemented in our SDK. Curious what happens in scenarios where there is a critical issue in sending the message out(say the key is wrong and the message cannot be encrypted) and so all MRP retransmits fail. In that case, the application is forced to wait out the full ExchangeContext timeout when it could have been notified of a Send failure.
Also, regarding the ResponseTimeout, especially in Exchanges where there are multiple back-and-forth messaging, there is a distinction between the ResponseTimeout and a timeout for the entire Exchange. The former should, ideally capture the timeout for the current outstanding sent app message, whereas the latter would be a sum of all the individual back-and-forth. So, for each new send on the same exchange, the response timeout should be re-set for each new send. Not sure if we do this, or the Response timeout is essentially a timeout for all message send and receives.
I think, we can ignore the MRPAcking part of it since that timeout has already passed, but still send the response to the application if it is still waiting and the Exchange is valid. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
That's a synchronous failure from SendMessage.
We have no concept of "timeout for the entire Exchange" right now.
Yep, that's what we do. Or rather you can configure it once, and the timer is restarted for each send. Or you can change it as you go (and CASE does that, in fact).