Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecar: after being reset, the SP seems to struggle with I2C (at least to the VPD) #1955

Closed
Aaron-Hartwig opened this issue Dec 16, 2024 · 7 comments

Comments

@Aaron-Hartwig
Copy link
Contributor

We came across this in manufacturing and had discussed it in chat but were basically stumped without being able to get traces from the bus. Well, we've got traces now!

First off, I attempt to reproduce by just running the sidecar_fru.sh flow on the station:

<a bunch of redacted env stuff>
pfexec /opt/oxide/dvt/scripts/sidecar_fru.sh program

 * using "/usr/bin/humility", version: "humility 0.11.13"
 * mainboard VPD...
 * turning power on (C1)...
sending message Message { id: 1, payload: EnsurePowerState { key: "FSj6UAtuE69uMXup8NYXpSsOVxVhn3fMMW6U5mupK8LvsbqCMKE6NYcEbUBbSGaD", on: true, pin: Some(1) } }
pin 1 power is now on
 * SHA-1 digest of "/data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip": 677d8ce8a65294e1c6a7494cd49a08f6ea14e8cf
 * image "/data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip" is:
     version => hubris build archive v8
     git rev => 9144307e317f57e81394779c15d205c49bdd0964
    image id => [98, 39, f2, e6, 63, fa, b2, a8]
       board => sidecar-c
        name => sidecar-c-lab
       image => default
      target => thumbv7em-none-eabihf
    features => dump
  total size => 870K
 kernel size => 30K
       tasks => 27
   i2c buses => 4 controllers, 9 buses
 i2c devices => 40 devices
   aux flash => 33554432 bytes (32 MiB), 16 slots (2 MiB/slot)
     sensors => 106 sensors

[
    ("FRU0", [
        ("BARC", [
            "0XV2:913-0000006:010:BRM23230005",
        ]),
        ("MAC0", [
            [
                168,
                64,
                37,
                5,
                31,
                0,
                0,
                1,
                1,
            ],
        ]),
    ]),
]
 * programming "Mainboard" VPD...
 * resetting SP...
humility: Opened 0483:3754:0039001D4741500920383733 via ST-Link V3
humility: attached to 0483:3754:0039001D4741500920383733 via ST-Link V3
humility vpd failed: operation timed out

The capture is stored here. Despite what the output says, it actually seems to work? I'd also like to highlight some glitches:
Image
Image

Next, I captured what it looked like to reset the SP and try to write to its VPD immediately. This fails and we never see the write actually happen. The capture is here.

pfexec humility -p 0483:3754:0039001D4741500920383733 -a /data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip reset && pfexec humility -p 0483:3754:0039001D4741500920383733 -a /data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip vpd -d Mainboard -w /tmp/thing.4009.ron
humility: Opened 0483:3754:0039001D4741500920383733 via ST-Link V3
humility: attached to 0483:3754:0039001D4741500920383733 via ST-Link V3
humility vpd failed: operation timed out

Finally, I do a reset, sleep for 10 seconds, and then write to the VPD. This all happens successfully. Experimentally it takes a sleep of something like 5-7 seconds for this to work! The capture is here.

pfexec humility -p 0483:3754:0039001D4741500920383733 -a /data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip reset && sleep 10 && pfexec humility -p 0483:3754:0039001D4741500920383733 -a /data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip vpd -d Mainboard -w /tmp/thing.4009.ron
humility: Opened 0483:3754:0039001D4741500920383733 via ST-Link V3
humility: attached to 0483:3754:0039001D4741500920383733 via ST-Link V3
humility: successfully wrote 92 bytes of VPD

cc @bcantrill since you were interested in this

@cbiffle
Copy link
Collaborator

cbiffle commented Dec 16, 2024

The first wave is actually not a glitch, in the sense that it's intentional. That's the driver's "bus unstick" cargo cult sequence, where it attempts to clock out 9 bits to reset things. It's not really supposed to do that unless the bus is stuck, though, so I guess it thought it saw an error?

The second thing is definitely a glitch, and I suspect it's the existing problem #1824. Not managing pin state correctly when switching pads, basically.

The ten second delay is surprising. If we can't find a software interlock that's blocking access (and I can't think why there'd be one), I'm starting to suspect that the CPU is starved during the FPGA loads. i2c_driver in sidecar is at an appropriately high priority. When you see the problem, could you capture the output of tasks? I wonder if you're seeing the dump overhead from #1868. That'd starve the CPU.

@cbiffle
Copy link
Collaborator

cbiffle commented Dec 16, 2024

I suspect the vpd command is using Hiffy behind the scenes. Hiffy is priority 5 on sidecar, which in particular puts it behind the sequencer and the SPI drivers. (This is probably the correct decision.) This means that if those tasks aren't yielding, Hiffy won't get an opportunity to do anything. If that's the case, though, I would kind of expect it to emit the write once it gets CPU, unless Humility restarts it or something.

@Aaron-Hartwig
Copy link
Contributor Author

The ten second delay is surprising.

I explicitly have a sleep 10 in between humility reset and humility vpd in that final capture to illustrate how things works after a period of time.

Here I reset, do the vpd interaction, then dump tasks. I don't think this is interesting?

pfexec humility -p 0483:3754:0039001D4741500920383733 -a /data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip reset && pfexec humility -p 0483:3754:0039001D4741500920383733 -a /data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip vpd -d Mainboard -w /tmp/thing.4009.ron ; pfexec humility -p 0483:3754:0039001D4741500920383733 -a /data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip tasks
humility: Opened 0483:3754:0039001D4741500920383733 via ST-Link V3
humility: attached to 0483:3754:0039001D4741500920383733 via ST-Link V3
humility vpd failed: operation timed out
humility: attached to 0483:3754:0039001D4741500920383733 via ST-Link V3
system time = 6944
ID TASK                       GEN PRI STATE    
 0 jefe                         0   0 recv, notif: fault timer(T+56)
 1 sys                          0   1 recv, notif: exti-wildcard-irq(irq6/irq7/irq8/irq9/irq10/irq23/irq40)
 2 rng_driver                   0   6 recv
 3 update_server                0   3 recv
 4 auxflash                     0   3 ready
 5 net                          0   5 wait: send to sequencer/gen0
 6 control_plane_agent          0   7 wait: send to net/gen0
 7 sprot                        0   4 recv
 8 udpecho                      0   6 wait: send to net/gen0
 9 udpbroadcast                 0   6 wait: send to net/gen0
10 monorail                     0   6 wait: send to sequencer/gen0
11 i2c_driver                   0   2 recv
12 hiffy                        0   5 ready
13 sensor                       0   4 recv
14 ecp5_mainboard               0   3 recv
15 ecp5_front_io                0   3 recv
16 transceivers                 0   6 wait: send to sequencer/gen0
17 packrat                      0   3 recv
18 sequencer                    0   4 RUNNING
19 thermal                      0   5 ready
20 power                        0   6 ready
21 validate                     0   5 recv
22 ignition                     0   5 wait: send to sequencer/gen0
23 vpd                          0   3 recv
24 dump_agent                   0   6 wait: send to net/gen0
25 idle                         0   8 ready
26 udprpc                       0   6 wait: send to net/gen0

@Aaron-Hartwig
Copy link
Contributor Author

This time I reset, waited 5 seconds, and then did a dump to understand what is happening in that strange period.

pfexec humility -p 0483:3754:0039001D4741500920383733 -a /data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip reset && sleep 5 && pfexec humility -p 0483:3754:0039001D4741500920383733 -a /data/local/images/sidecar/c/sp/build-sidecar-c-lab-image-default-v1.0.29.zip dump
humility: Opened 0483:3754:0039001D4741500920383733 via ST-Link V3
humility: attached to 0483:3754:0039001D4741500920383733 via ST-Link V3
humility: core halted
humility: dumping to hubris.core.0
humility: dumped 1.10MB in 49 seconds
humility: core resumed

link to dump

@Aaron-Hartwig
Copy link
Contributor Author

Link to another dump with a T+4 second offset.

In both of these dumps auxflash is in scan_for_active_slot, so the Eye has refocused there.

@cbiffle
Copy link
Collaborator

cbiffle commented Dec 17, 2024

The auxflash code appears to have a design flaw that causes Sidecar boot to get slower and slower as more updates are applied to auxflash. We're testing a fix in #1959. I'm waiting to confirm logic analyzer traces before I get too excited about it.

@Aaron-Hartwig
Copy link
Contributor Author

Capture uploaded here

No more 11.62 second gap!

Image

cbiffle added a commit that referenced this issue Dec 17, 2024
The auxflash server was computing a SHA3 of every potentially occupied
slot in the QSPI flash, only to compare it to _the stored SHA3_ and then
compare it to _the expected SHA3_ and then throw it away. This has been
causing Sidecar startup to be linear in the number of valid images that
have ever been flashed to auxflash, up to 16.

This change rearranges the logic, at least for startup. For each chunk,
we now see if it even claims to have the right SHA. Only then do we
compute the actual SHA to validate. This reduces an 11.6s delay observed
on one Sidecar to just over 1s, and knocks 6s off the boot (something
else is still delaying for 5s).

I haven't changed the behavior of the externally exposed
`read_slot_checksum` operation, because it has no documentation and I
can't figure out what it's used for, so I'm not sure I would get the
semantics right or know how to test it.

Fixes #1955.
cbiffle added a commit that referenced this issue Dec 26, 2024
The auxflash server was computing a SHA3 of every potentially occupied
slot in the QSPI flash, only to compare it to _the stored SHA3_ and then
compare it to _the expected SHA3_ and then throw it away. This has been
causing Sidecar startup to be linear in the number of valid images that
have ever been flashed to auxflash, up to 16.

This change rearranges the logic, at least for startup. For each chunk,
we now see if it even claims to have the right SHA. Only then do we
compute the actual SHA to validate. This reduces an 11.6s delay observed
on one Sidecar to just over 1s, and knocks 6s off the boot (something
else is still delaying for 5s).

I haven't changed the behavior of the externally exposed
`read_slot_checksum` operation, because it has no documentation and I
can't figure out what it's used for, so I'm not sure I would get the
semantics right or know how to test it.

Fixes #1955.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants