Ignoring read errors in alloc_cluster() can cause an "almost infinite" loop with broken SD cards #15

FreddieChopin · 2023-04-14T13:32:11Z

In our application we are using your ufat library (via my RTOS). In one of the devices the SD card got damaged - it seems that it is able to correctly read (maybe also write) about 600-700 initial blocks, but for further blocks reads (possibly also writes) cause the SD card to fail with a timeout. This is the behaviour observed on the MCU, on a PC the card just fails to mount and trying to dump it allows you to read just the first ~700 blocks and about 350 kB of data, then it throws an I/O error each block.

The problem now seems to be that if the FAT file system mounts properly (and it does, because the initial part of card is fine), then trying to do some operation - in our case creating a folder - will result in a practically infinite loop due to ignored read errors in alloc_cluster(), exactly here:

ufat/ufat.c

Line 652 in e2a7466

if (!ufat_read_fat(uf, idx, &c) && c == UFAT_CLUSTER_FREE) {

This loop will try to iterate all existing clusters (in case of our 16 GB card - about 2 million...) to find a free one, ignoring read errors. Given that the mode of error of the SD card is "timeout", then each read attempt lasts about 100 ms, so it will take "forever" for the loop to end due to checking all the clusters.

Is ignoring read errors here intentional (e.g. to ignore single broken clusters, without completely failing the top-level operation) or just an omission? I'm not sure how to understand that part of code, so I'm starting with a question before potentially providing a fix.

The text was updated successfully, but these errors were encountered:

alloc_cluster() just ignores read errors, trying next cluster until it either succeeds (finds an empty one) or runs out of clusters (after checking all of them). A large volume may have quite a lot of clusters - eg. a 16 GB SD card with a standard format has about 2 million clusters. When a "persistent" read error happens during the alloc_cluster() (after a successful mount operation) - for example a volume is physically disconnected in a very inconvenient moment or the volume is/gets damaged and all further reads fail - then this loop becomes practically infinite. In one application we found a damaged SD card, for which reads of first 600-700 blocks work perfectly fine, but any read beyond that results in a SDIO interface timing-out (the card will not switch to expected state within specified time). As the timeout for the operation is ~100 ms, then the function would loop for over 2 days. The same card just fails to work in a PC, where any read beyond first ~350 kB (which is about 700 blocks) fails with an I/O error. Fix this by returning from alloc_cluster() with an error when any read operation fails. Fixes dlbeer#15

Updated version fixes a potential "infinite" loop when dealing with damaged devices (dlbeer/ufat#15) and also has some other fixes (dlbeer/ufat#12 & dlbeer/ufat#13).

FreddieChopin linked a pull request Apr 25, 2023 that will close this issue

Handle read errors in alloc_cluster() to avoid "infinite" loop #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignoring read errors in alloc_cluster() can cause an "almost infinite" loop with broken SD cards #15

Ignoring read errors in alloc_cluster() can cause an "almost infinite" loop with broken SD cards #15

FreddieChopin commented Apr 14, 2023

Ignoring read errors in alloc_cluster() can cause an "almost infinite" loop with broken SD cards #15

Ignoring read errors in alloc_cluster() can cause an "almost infinite" loop with broken SD cards #15

Comments

FreddieChopin commented Apr 14, 2023