Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flash corruption on SAMD21 #170

Open
theacodes opened this issue Apr 20, 2021 · 14 comments
Open

Flash corruption on SAMD21 #170

theacodes opened this issue Apr 20, 2021 · 14 comments

Comments

@theacodes
Copy link

theacodes commented Apr 20, 2021

While stress testing Castor & Pollux, I noticed that a power cycle managed to "kill" one of the devices under test. Further investigation revealed that the device was fine, it simply was acting as if it didn't have anything loaded in flash. I confirmed this by reflashing the bootloader and firmware.

I was able to reproduce with another device under test, and I dumped the flash. It seems that the first page of flash has been erased. The rest of flash is fine.

EDIT by @dhalbert: writeup here: https://blog.thea.codes/sam-d21-brown-out-detector/

image

I'm not sure what is causing this to occur. The firmware does not modify NVM under normal circumstances (it only modifies NVM during factory setup and when in configuration mode), so I'm tentatively suspecting the bootloader.

@dhalbert
Copy link

This was a big problem with SAMD51, and we added some code to the bootloader to make sure the voltage was stable before proceeding. We had not seen this problem on SAMD21, but we could add the same kind of code.

When I asked MicroChip about spurious writes/erases like this, they said they had seen it before on an M0+ chip, when there was inadequate or no filtering on the VDDcore pin. I assume you have good filtering on your boards.

See #95 and #111 for lots of details.

@theacodes
Copy link
Author

We have their recommended filtering on VDDCore.

Upon further research, this seems to be caused by my misunderstanding of the bootloader's behavior:

Previously, we assumed that the bootloader would always set the bootloader write protection fuses, however, it turns out that it only sets it in two cases:

  1. The fuses have completely bogus values (0xFFFFFFFF)
  2. The bootloader self-updates via UF2

Since our boards have factory fuses set and we install the bootloader via flashing a .bin file, neither of these cases ever happened for our boards. This left the bootloader write protection disabled and lead to this memory corruption.

I can close this, but this behavior was surprising to me. What do we think about the bootloader always checking and setting the fuse if needed? (example of this in our firmware),

@dhalbert
Copy link

dhalbert commented Feb 2, 2023

User is having similar trouble here: https://forums.adafruit.com/viewtopic.php?p=958972 and is having trouble setting BOOTPROT (which I am investigating) on QT Py M0 boards.

User has asked for BOD protection on SAMD21, as on SAMD51.

@theacodes. I never really answered you above. Our factory testers/bootloader-loaders load the bootloader and then set BOOTPROT "manually". I have some factory boards that are set this way. There is something unusual going on with the forum user's boards, where even the self-updater is not successfully updating the BOOTPROT flags. I was unable to reproduce this locally but have some more recent QT Py M0 boards on order to check.

@theacodes
Copy link
Author

I'd say the bootloader should always check and enable bootprot.

@dhalbert
Copy link

dhalbert commented Feb 2, 2023

I'd say the bootloader should always check and enable bootprot.

I'm thinking this too, thanks.

@theacodes
Copy link
Author

theacodes commented Feb 2, 2023 via email

@dhalbert
Copy link

dhalbert commented Feb 2, 2023

Right, sorry, I meant it will reset all fuses if they're corrupted (as it already does), and reset BOOTPROT in any case.

@theacodes
Copy link
Author

theacodes commented Feb 2, 2023 via email

@nico202
Copy link

nico202 commented Feb 1, 2024

@theacodes Hi, we are trying to stress test a few boards on which we had your same flash erase you experienced, but for us trying trigger the failure at will is getting really complicated (and we'd like to replicate it so that we'll know for sure enabling the BOD33 will fix it). Do you still remember what "various intervals" you tried/what you did to break them?
That would really help us! Thanks!

The last stress test I wanted to put the hardware through was power cycling. I connected five units up to my bench supply and switched the power on and off at various intervals.
After about 50 cycles, one of the modules stopped working

@theacodes
Copy link
Author

@nico202 it's been quite some time so beyond what's in my article about it, I'm not sure! I will say that our boards likely spend more time in the "brown-in" state due to a bunch of analog components and big caps coming up to voltage. You can try introducing a capacitive load on your power supply line to exaggerate the brown-in period and increase the probability of this happening.

@dhalbert
Copy link

dhalbert commented Feb 1, 2024

I added BOD33 for SAMD21 in PR #198, which is in release 3.15.0. and lter. I forgot that this issue existed. I will mark this closed. @nico202 are you testing with 3.15.0?

@nico202
Copy link

nico202 commented Feb 1, 2024

@theacodes Yes, thanks! We indeed measured something like this when one broke (cyan and magenta curves)
image

but still we're experiencing problems in reproducing manually the issue

@dhalbert yes that's our supposed fix, we just wanted to be super-sure that this fix will prevent our problems, but having difficulties in causing the problem make it more difficult testing it! We'll try to break them a bit more and if we cannot we'll just apply the update with BOD33 enabled and hope it actually fix our problem 🤞

Thanks both again!

@dhalbert
Copy link

dhalbert commented Feb 1, 2024

Yes, it seems hard to reproduce in a consistent way. The user in https://forums.adafruit.com/viewtopic.php?p=958972#p958972 just described applying power. That thread is long because there was also a problem with BOOTPROT not being set properly on a batch of boards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants