Linux Kernel – Defining State of Failed SD Cards by Kernel Tracing

debugginglinux-kernelsd card

I'm having a series of failing/sometimes failing SD cards. They either give one of the following dmesg outputs:

The completely dead ones (don't list at /dev/mmcblk0):

[  +0,000010] mmc0: error -110 whilst initializing SD card 
[  +2,819983] mmc0: card never left busy state

The failing ones (can occasionally still be mounted):

[Jun16 06:28] mmc0: new high speed SDHC card at address 0001
[  +0,000339] mmcblk0: mmc0:0001 00000 3.68 GiB 
[  +0,002835]  mmcblk0: p1 p2 p3 p4
[ +10,256689] mmcblk0: timed out sending r/w cmd command, card status 0x900
[ +11,264358] mmcblk0: timed out sending r/w cmd command, card status 0x900
[  +0,000016] print_req_error: I/O error, dev mmcblk0, sector 7716736
[ +10,239972] mmcblk0: timed out sending r/w cmd command, card status 0x900
[  +0,000018] print_req_error: I/O error, dev mmcblk0, sector 7716736
[  +0,000008] Buffer I/O error on dev mmcblk0, logical block 964592, async page read
[ +10,239931] mmcblk0: timed out sending r/w cmd command, card status 0x900
[  +0,000009] print_req_error: I/O error, dev mmcblk0, sector 81792
[Jun16 06:29] mmcblk0: timed out sending r/w cmd command, card status 0x900
[  +0,000020] print_req_error: I/O error, dev mmcblk0, sector 1066880
[ +10,240219] mmcblk0: timed out sending r/w cmd command, card status 0x900
[  +0,000011] print_req_error: I/O error, dev mmcblk0, sector 2101120

The best I've got with error -110 is that it's a timeout of a sort, but tells very little on what actually happened with the SDCard.

Background on how this came to be

The SDcards end up in those states on some (seemingly random) of the embedded devices I'm working on, and I'm trying to understand if it's a matter of bad SD cards or if there might be something wrong with the controller driver that is pushing the cards to corruption.

About 5% of the cards have died completely, and I'm trying to see if that is to expect from the other ones.

I have tried to force the SDcard to reproduce the issue, but the ones under test (same brand, same type of device with same software) aren't showing any traces of wear after hundreds of GB of data written on them in a continuous manner as a part of the test. I use stressdisk for that.

I don't have a track of how often the device might of have abruptly lost power, and the power supply is a regular 2A AC-DC adapter that is working ok for all the other needs of the device.

Update

The question seems to be suggested to be closed or answered in way that helps me prevent failed SD cards in the future as opposed to using Linux to diagnose what is the current state of the SDcards.

Let me try to rephrase then:

What is the thoroughest way that you can analyze an SDcard failure on Linux?

  • Is it possible to enable debug logs for the MMC subsystem to get more info?
  • What is a card status 0x900?
  • Is it possible to sniff SD-bus or SD-bus communication from userspace to get indications that the card is starting to fail?

Best Answer

This seems more an hardware/use case problem than something else. I bet common sense might be more important than Linux skills here. Are you doing heavy I/O in the cards, MySQL/Apache/compiling stuff...syslog/frequent system updates? -- comment by Rui F Ribeiro

I can expand on the above. But I agree with the first point, and I agree this was the first question to ask.


  • Should I enable some debug logs for the MMC subsystem?
  • Is there a userspace tool that can sniff what's going on?
  • How do I make the error codes make more sense?

The only confidence I've had from attributing failures has come more from the "history" and general results I get, not the specific errors from low-level commands. Which are likely to vary between implementations anyway.

Even with an SSD, from a reasonable brand, I believe I've had bad data returned in place of I/O errors. This has certainly been one of the known failure modes across many SSDs. [2013][2017]. (Possibly surprising to people familiar with the contemporary filesystems and database implementations which often hope for a more manageable set of failure modes). Notice that the papers I link here focused on the data returned; they did not make any more distinction in reported errors, except for the dead drive / bad sector distinction which you have already measured.

My SSD fault was on a "seller refurbished" laptop, which had already been "repaired" once, and was starting to show failures again - plausibly causing a power interruption to the drive just as in the linked papers. It might also have failed to provide stable voltage levels.

I'm trying to understand if it's a matter of bad SD cards or if there might be something wrong with the controller driver that is pushing the cards to corruption.

Good hardware with a good mains power supply doesn't tend to destroy a good SD card - unless you're putting too much load on it. The workload is a very important variable, which you did not [originally] mention. These memory cards are relatively small, usually cheap hardware that is designed for relatively un-demanding use storing media files (hence MMC, "MultiMediaCard"). Particularly cheaper ones won't necessarily be very good at "wear-levelling" (redistributing the load from hotspot logical blocks across a large number of physical blocks).

I have measured workload with a quick hack, scheduling a daily cron job to run tunefs -l /dev/mmcblk0p4 | grep writes >> /var/log/writes.log.

But if we set the workload aside, you'd be right to consider a possible controller-side issue from the information you gave so far. I've had repeated bad sectors on an SD card due to writes from a pocket device, possibly when its battery was low. This was a card from the one name brand. The sectors were recoverable and I'm still using the same card. I've also had some sort of transient initialization failure on this card, I think it was associated with bad sectors too (once I got past the initialization failure), but I could be mis-remembering.

I'm having a series of failing/sometimes failing SD cards.

The impression I get from your [original] question is that this is a small scale operation, and running a rigorous test matrix with different cards, controllers, and workloads would be overkill.

After the workload, the first variable you control is the card.

Writing in 2018, there is one global name brand which can be considered "canonical" for sd cards -

see results at: https://www.amazon.com/s/field-keywords=sd+card

- and you hopefully have a number of retail channels that can be considered... at least reliable enough for comparison purposes. (Remembering that various popular online retailers act as "marketplace" as well as selling their own goods).

Official Raspbery PI hardware might also be acceptible. I.e. SD cards, sold officially for running Linux on a small board computer, which have been reported to work well. (Being a more demanding workload than media files).

As a broad brush, if you get a card that's faster than you strictly need, I also think of that as a potentially higher endurance rating. (Given that speed rating tends to be much more available than endurance).

If you control / measure these two variables, then you can focus your judgements on the rest of the relevant hardware.


The failing ones (can occasionally still be mounted)

Note, in the most general case, if you think a device has been badly written, you may attempt to clear this fault:

  1. recover what data you can if desired
  2. then stop trying to read bad blocks. simply recreate the entire formatting (partition table + filesystem).
  3. but if you're not sure and think the device might still be dying, you probably also want to test it.

If you have nice native MMC hardware like you do, you can use the Linux command blkdiscard as a more efficient way to test erasing all the blocks of the device, before you "reformat" it. But efficiency is the only advantage, compared to testing for errors when overwriting the whole drive with zeros i.e. dd bs=1M if=/dev/zero of=/dev/mmcblk0. (As well as avoiding any need to write the erased blocks, blkdiscard could also in theory provide a more "as-new" performance afterwards, and increase endurance, by giving the device a bit more freedom).

(If this was a SATA drive - there is a dedicated "secure erase" command to discard the entire logical drive contents (see man hdparm). However I am not aware of any equivalent MMC command. Certain SSD vendors took advantage of this command to reset their block mapping tables, as a workaround for their failure to recover "as-new" performance with the equivalent blkdiscard sequence. Note this command does not necessarily test a full-drive erase. In some cases it will only erase an internal encryption key).

Since you asked what my errors looked like

My SanDisk micro-SD card played up again recently. It seems the specific errors below were due to flaky connection. It was resolved by removing & re-inserting the micro-SD into the micro-SD to SD adaptor, after superstitiously blowing on all the metal pads.

In the reader on my Dell Latitude E5450 laptop (sdhci-pci kernel driver, Fedora Linux kernel version circa v4.17), it was failing to initialize the card. On my SheevaPlug (same hardware and software details as this question), this card seems to have been able to initialize but it showed IO errors. Perhaps on the Dell the error-handling timeouts are not quite set up correctly.

Dell:

[    2.436566] mmc0: Unknown controller version (3). You may experience problems.
[    2.449019] mmc0: SDHCI controller on PCI [0000:01:00.0] using ADMA
...
[509227.374012] mmc0: error -84 whilst initialising SD card
[509227.621510] mmc0: error -84 whilst initialising SD card
[509227.865472] mmc0: error -84 whilst initialising SD card
[509228.142120] mmc0: error -84 whilst initialising SD card

Sheevaplug:

[6076613.118617] mmcblk0: mmc0:aaaa SC16G 14.8 GiB 
[6076613.295811] mmcblk0: error -110 transferring data, sector 0, nr 8, cmd response 0x900, card status 0x0
[6076613.545740] mmcblk0: error -110 transferring data, sector 0, nr 8, cmd response 0x900, card status 0x0
[6076613.555301] mmcblk0: retrying using single block read
[6076613.728413] mmcblk0: error -110 transferring data, sector 0, nr 8, cmd response 0x900, card status 0x0
[6076613.737965] blk_update_request: I/O error, dev mmcblk0, sector 0
[6076613.912043] mmcblk0: error -110 transferring data, sector 1, nr 7, cmd response 0x900, card status 0x0
[6076613.921599] blk_update_request: I/O error, dev mmcblk0, sector 1
...
Related Question