Linux – Why is dd using direct slower writing to disk than to a file

coreutilsfilesystemshard-disklinuxperformance

I am trying to compare aggregate write rates when writing to a file in a GPFS file system, as compared to writing directly to a disk on a system with Red Hat Enterprise Linux Server release 6.4 (Santiago). For my application I need to measure the raw rate, i.e. without taking advantage of cache. I do not understand the impact of the direct option used with dd to bypass cache. When writing directly to a block device, I get a drastically lower rate when I use oflag=direct, as compared with writing to a file in the GPFS file system. Why does this happen?

To measure aggregate rates I create p processes running dd that writes concurrently to the block device or file. I then sum the p rates obtained to get the aggregate write rate.

    #!/bin/bash
    directdiskrate=~/scratch/rate5
    syncdiskrate=~/scratch/rate4
    filerate=~/scratch/rate3
    numruns=1
    numthreads=30

    #to disk use both conv=fsync and oflag=direct
    writetodiskdirect="dd if=/dev/zero of=/dev/sdac bs=256k count=4096 conv=fsync oflag=direct iflag=fullblock"
    for p in $(seq $numthreads)
    do
             #parses output of dd, rate is on last line, each field separated by ,s
            $writetodiskdirect 2>&1|tail -n 1|awk 'BEGIN { FS = "," } ; { print $3 }'|sed -e 's/MB\/s//g'>>$directdiskrate&
    done
    wait

    #to disk use only conv=fsync option
    writetodisksync="dd if=/dev/zero of=/dev/sdac bs=256k count=4096 conv=fsync iflag=fullblock"
    for p in $(seq $numthreads)
    do
       #parses output of dd, rate is on last line, each field separated by ,s
       $writetodisksync 2>&1|tail -n 1|awk 'BEGIN { FS = "," } ; { print $3 }'|sed -e 's/MB\/s//g'>>$syncdiskrate&
    done
    wait

    #to file use both conv=fsync and oflag=direct
    for p in $(seq $numthreads)
    do
        writetofile="dd if=/dev/zero of=/gpfs1/fileset6/file$p bs=256k count=4096 conv=fsync oflag=direct"
        #parses output of dd, rate is on last line, each field separated by ,s
        $writetofile 2>&1|tail -n 1|awk 'BEGIN { FS = "," } ; { print $3 }'|sed -e 's/MB\/s//g'>>$filerate&
    done
    wait

Results: The write rate of each of 30 processes is as follows:

  1. Writing to disk using conv=fsync option, each process gets a write rate of ~180MB/s
  2. Writing to disk using both conv=fsync and oflag=direct, each process gets a write rate of ~9MB/s
  3. Writing to a file in GPFS file system, using both conv=fsync and oflag=direct, gets a write rate of ~80MB/s

Best Answer

This difference undoubtedly comes down to one thing: caching.

It will be really difficult to pin down where, especially from userland, but all Linux kernels buffer (cache) filesystem writes, unless you perform the tricks to get synchronous writes. That is, the kernel will save the data dd sends to a file somewhere in kernel memory. The kernel probably uses file system code to do this. Some time in the future, the kernel will schedule a disk block to go out to the disk. That will happen "asynchronously", sometime after the kernel tells dd that the write finished.

The reason for this is that moving bytes over a bus and into a disk drive, and then on to the disk platters is much slower than even copying from user to kernel memory. Ordinarily, programs don't care too much that the data they just "wrote" won't make it to the disk for a while. Hardware reliability is high enough that the data makes it to platter almost always.

That's the simple answer, but once you've got reads/writes/deletes all buffered up in the kernel, the file system code can take advantage of short file lifetimes by never writing out the data of files that get deleted before they make it to disk. The file system code can group writes to take advantage of disk blocks larger than a group of writes and consolidate them into one write. There's tons of optimizations that can be done in most file systems.

Related Question