Linux Hardware – Stress Testing SD Cards

hardwarelinuxsd card

I got into a little debate with someone yesterday regarding the logic and/or veracity of my answer here, vis., that logging and maintaining fs meta-data on a decent (GB+) sized SD card could never be significant enough to wear the card out in a reasonable amount of time (years and years). The jist of the counter-argument seemed to be that I must be wrong since there are so many stories online of people wearing out SD cards.

Since I do have devices with SD cards in them containing rw root filesystems that are left on 24/7, I had tested the premise before to my own satisfaction. I've tweaked this test a bit, repeated it (using the same card, in fact) and am presenting it here. The two central questions I have are:

  1. Is the method I used to attempt to wreck the card viable, keeping in mind it's intended to reproduce the effects of continuously re-writing small amounts of data?
  2. Is the method I used to verify the card was still okay viable?

I'm putting the question here rather than S.O. or SuperUser because an objection to the first part would probably have to assert that my test didn't really write to the card the way I'm sure it does, and asserting that would require some special knowledge of linux.

[It could also be that SD cards use some kind of smart buffering or cache, such that repeated writes to the same place would be buffered/cached somewhere less prone to wear. I haven't found any indication of this anywhere, but I am asking about that on S.U.]

The idea behind the test is to write to the same small block on the card millions of times. This is well beyond any claim of how many write cycles such devices can sustain, but presuming wear leveling is effective, if the card is of a decent size, millions of such writes still shouldn't matter much, as "the same block" would not literally be the same physical block. To do this, I needed to make sure every write was truly flushed to the hardware, and to the same apparent place.

For flushing to hardware, I relied on the POSIX library call fdatasync():

#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>

// Compile std=gnu99

#define BLOCK 1 << 16

int main (void) {
    int in = open ("/dev/urandom", O_RDONLY);
    if (in < 0) {
        fprintf(stderr,"open in %s", strerror(errno));
        exit(0);
    }

    int out = open("/dev/sdb1", O_WRONLY);
    if (out < 0) {
        fprintf(stderr,"open out %s", strerror(errno));
        exit(0);
    }

    fprintf(stderr,"BEGIN\n");

    char buffer[BLOCK];
    unsigned int count = 0;
    int thousands = 0;
    for (unsigned int i = 1; i !=0; i++) {
        ssize_t r = read(in, buffer, BLOCK);
        ssize_t w = write(out, buffer, BLOCK);
        if (r != w) {
            fprintf(stderr, "r %d w %d\n", r, w);
            if (errno) {
                fprintf(stderr,"%s\n", strerror(errno));
                break;
            }
        }
        if (fdatasync(out) != 0) {
            fprintf(stderr,"Sync failed: %s\n", strerror(errno));
            break;
        }
        count++;
        if (!(count % 1000)) {
            thousands++;
            fprintf(stderr,"%d000...\n", thousands);
        }
        lseek(out, 0, SEEK_SET);
    }
    fprintf(stderr,"TOTAL %lu\n", count);
    close(in);
    close(out);

    return 0;
}                                 

I ran this for ~8 hours, until I had accumulated 2 million+ writes to the beginning of the /dev/sdb1 partition.1 I could just have easily used /dev/sdb (the raw device and not the partition) but I cannot see what difference this would make.

I then checked the card by trying to create and mount a filesystem on /dev/sdb1. This worked, indicating the specific block I had been writing to all night was feasible. However, it does not mean that some regions of the card had not been worn out and displaced by wear levelling, but left accessible.

To test that, I used badblocks -v -w on the partition. This is a destructive read-write test, but wear levelling or not, it should be a strong indication of the feasibility of the card since it must still provide space for each rolling write. In other words, it is the literal equivalent of filling the card completely, then checking that all of that was okay. Several times, since I let badblocks work through a few patterns.

[Contra Jason C's comments below, there is nothing wrong or false about using badblocks this way. While it would not be useful for actually identifying bad blocks due to the nature of SD cards, it is fine for doing destructive read-write tests of an arbitrary size using the -b and -c switches, which is where the revised test went (see my own answer). No amount of magic or caching by the card's controller can fool a test whereby several megabytes of data can be written to hardware and read back again correctly. Jason's other comments seem based on a misreading — IMO an intentional one, which is why I have not bothered to argue. With that head's up, I leave it to the reader to decide what makes sense and what does not.]

1 The card was an old 4 GB Sandisk card (it has no "class" number on it) which I've barely used. Once again, keep in mind that this is not 2 million writes to literally the same physical place; due to wear leveling the "first block" will have been moved constantly by the controller during the test to, as the term states, level out the wear.

Best Answer

I think stress testing an SD card is in general problematic given 2 things:

  1. wear leveling There are no guarantees that one write to the next is actually exercising the same physical locations on the SD. Remember that most of the SD systems in place are actively taking a block as we know it and moving the physical location that backs it around based on the perceived "wear" that each location has been subjected to.

  2. different technologies (MLC vs. SLC) The other issue that I see with this is the difference in technologies. SLC types of SSD I would expect to have a far longer life vs. the MLC variety. Also there are much tighter tolerances on MLC that you just don't have to deal with on SLC's, or at least they're much more tolerant to failing in this way.

    • MLC - Multi Level Cell
    • SLC - Single Level Cell

The trouble with MLC is that a given cell can store multiple values, the bits are essentially stacked using a voltage, rather than just being a physical +5V or 0V, for example, so this can lead to much higher failure rate potential than their SLC equivalent.

Life expectancy

I found this link that discusses a bit about how long the hardware can last. It's titled: Know Your SSDs - SLC vs. MLC.

SLC

SLC ssds can be calculated, for the most part, to live anywhere between 49 years and 149 years, on average, by the best estimates. The Memoright testing can validate the 128Gb SSD having a write endurance lifespan in excess of 200 years with an average write of 100Gb per day.

MLC

This is where the mlc design falls short. None have been released as of yet. Nobody has really examined what kind of life expectancy is assured with the mlc except that, it will be considerably lower. I have received several different beliefs which average out to a 10 to 1 lifespan in favour of the slc design. A conservative guess is that most lifespan estimates will come between 7 and 10 years, depending on the advancement of ‘wear leveling algorythms ’ within the controllers of each manufacturer.

Comparisons

To draw comparison by way of write cycles, a slc would have a lifetime of 100,000 complete write cycles in comparison to the mlc which has a lifetime of 10,000 write cycles. This could increase significantly depending on the design of ‘wear leveling’ utilized.