You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a follow-on to #1821, continuing my hunt for the Gimlet Disk Wumpus.
In that bug I pointed out that we're generating bus resets in the middle of perfectly good transactions for no apparent reason, and doing it in a way that scrambles the I2C controller's state. We should stop doing that, obvs.
However, while we're doing that, it caused me to notice a different bug that's compounding the issue.
tl;dr: we treat a mux as having been reset if it NACKs.
This is at the tail end of some Very Bad Things happening to the I2C neighborhood, and the Saleae protocol decoder is badly out of sync. What it reports as a weird post-NACK 0xC1 byte is actually us attempting to address the mux at 0x70 to reset it. (0x70 + NACK (1), shifted left 1 because the decoder is desync'd, gives you 0xC1.)
We don't generate a start condition because shrug emoji. We've totally scrambled the controller at this point, all bets are off. But without a start condition, the mux is certainly not going to ack its address -- it doesn't even know we're speaking to it!
Normally this would be fine -- the transaction fails, we try again -- but this is a mux we're talking to, so things get weird.
We're handling mux errors wrong
The problem is what happens in the driver when a mux nacks. This was being indicated in the ringbuf output during my debugging session by this repeating pattern:
From inspecting the code, I can clarify what this means:
ResetMux does not mean the mux has been reset. It means that, if the mux has a GPIO reset line (which it does not on this rev of Gimlet), we have toggled it. It also means we have recorded the mux's state as Unknown.
Next time we try to talk to anything that involves mux 0x70 (which means anything on the same segment, iirc), we see the Unknown and attempt to put the mux into a known state, signalling success at this with MuxUnknownRecover
...but there's a MuxMissing between the two, that's weird
The I2C error codes in the driver have names that bear no relation to the I2C standard, which makes them hard to follow. I eventually worked out the data flow, which is as follows:
At the lowest level, when the driver observes NACKF set in the status register, it produces an error code NoDevice, which means that the device nacked once.
In the stm32xx_i2c crate's I2cMux::error_code function, this gets translated -- for muxes specifically -- into something called MuxMissing. So MuxMissing also means that the device nacked once.
If this should happen during a call to configure_mux in the stm32xx_i2c_server crate, the code special-cases MuxMissing and treats it as success. (A comment says this should not happen in production. Technically I'm on a bench, so the comment is technically not false yet.)
Because the mux reconfiguration "succeeded," the driver removes the Unknown state from the mux (preventing further recovery attempts) and records MuxUnknownRecover to trumpet its success.
In this case, the net effect is that mux 0x70 is left with port 0 enabled (to a disk), and we then go and enable a port on the mux at 0x72, and now we have two hotplug controllers and two disks with the same addresses, wired together. Needless to say, the next transaction fails.
This appears to be happening in practice once every two minutes or so. It's not every time, because the controller-stomping behavior makes it delightfully non-deterministic. I've written a tool that reads Saleae Logic captures, models the mux state, and reports any violations of our "one mux channel active" invariant, and it's finding a lot of meat.
What to do?
This section of the code is complex, with a lot of indirections and functions that take closures and the like, so I haven't totally wrapped my head around the fix. But the goal is: don't take a mux out of Unknown state unless we get acknowledgement that we've been able to update it.
The text was updated successfully, but these errors were encountered:
This is a follow-on to #1821, continuing my hunt for the Gimlet Disk Wumpus.
In that bug I pointed out that we're generating bus resets in the middle of perfectly good transactions for no apparent reason, and doing it in a way that scrambles the I2C controller's state. We should stop doing that, obvs.
However, while we're doing that, it caused me to notice a different bug that's compounding the issue.
tl;dr: we treat a mux as having been reset if it NACKs.
Clues!
I posted this on #1821 but let's look at it anew:
This is at the tail end of some Very Bad Things happening to the I2C neighborhood, and the Saleae protocol decoder is badly out of sync. What it reports as a weird post-NACK 0xC1 byte is actually us attempting to address the mux at 0x70 to reset it. (0x70 + NACK (1), shifted left 1 because the decoder is desync'd, gives you 0xC1.)
We don't generate a start condition because shrug emoji. We've totally scrambled the controller at this point, all bets are off. But without a start condition, the mux is certainly not going to ack its address -- it doesn't even know we're speaking to it!
Normally this would be fine -- the transaction fails, we try again -- but this is a mux we're talking to, so things get weird.
We're handling mux errors wrong
The problem is what happens in the driver when a mux nacks. This was being indicated in the ringbuf output during my debugging session by this repeating pattern:
From inspecting the code, I can clarify what this means:
ResetMux
does not mean the mux has been reset. It means that, if the mux has a GPIO reset line (which it does not on this rev of Gimlet), we have toggled it. It also means we have recorded the mux's state as Unknown.MuxUnknownRecover
MuxMissing
between the two, that's weirdThe I2C error codes in the driver have names that bear no relation to the I2C standard, which makes them hard to follow. I eventually worked out the data flow, which is as follows:
NoDevice
, which means that the device nacked once.stm32xx_i2c
crate'sI2cMux::error_code
function, this gets translated -- for muxes specifically -- into something calledMuxMissing
. SoMuxMissing
also means that the device nacked once.configure_mux
in thestm32xx_i2c_server
crate, the code special-casesMuxMissing
and treats it as success. (A comment says this should not happen in production. Technically I'm on a bench, so the comment is technically not false yet.)MuxUnknownRecover
to trumpet its success.In this case, the net effect is that mux 0x70 is left with port 0 enabled (to a disk), and we then go and enable a port on the mux at 0x72, and now we have two hotplug controllers and two disks with the same addresses, wired together. Needless to say, the next transaction fails.
This appears to be happening in practice once every two minutes or so. It's not every time, because the controller-stomping behavior makes it delightfully non-deterministic. I've written a tool that reads Saleae Logic captures, models the mux state, and reports any violations of our "one mux channel active" invariant, and it's finding a lot of meat.
What to do?
This section of the code is complex, with a lot of indirections and functions that take closures and the like, so I haven't totally wrapped my head around the fix. But the goal is: don't take a mux out of Unknown state unless we get acknowledgement that we've been able to update it.
The text was updated successfully, but these errors were encountered: