Confused about dd write speed and VFS page cache

benchmarkddhard-diskstoragevfs

After reading some articles on the Linux VFS page cache and the tunable parameters like dirty_ratio i was under the impression that page cache would operate as both read and write caching layer.

But using the simple test below it works well to improve read speed for files that are located in the page cache but doesn't seem to work on writes.

e.g.

Clear the cache and write to file.

# swapoff -a
# echo 3 > /proc/sys/vm/drop_caches

# dd if=/dev/zero of=/home/flo/test bs=1M count=30
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 0.182474 s, 172 MB/s

Check that file is actually in page cache

# vmtouch /home/flo/test 
           Files: 1
     Directories: 0
  Resident Pages: 7680/7680  30M/30M  100%
         Elapsed: 0.000673 seconds

Read from file to confirm is actually coming from cache.

# dd if=/home/flo/test of=/dev/null bs=1M count=30
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 0.00824169 s, 3.8 GB/s

Drop cache and read again to prove speed difference.

# echo 3 > /proc/sys/vm/drop_caches
# dd if=/home/flo/test of=/dev/null bs=1M count=30
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 0.132531 s, 237 MB/s

Since i'm not using DIRECT_IO with dd I was expecting the page cache to be used as a writeback type of cache. And based on dirty_ratio or dirty_expire_centiseconds… eventually the data would be committed to disk.

Can someone please explain how VFS handles the read and write process differently, especially during writes and why there is no speed gain.

Is there any way to make the vfs more aggressive in write caching so it behaves more like the writeback cache you might find on a raid controller for example.

Thank you

fLo

Best Answer

To see the fast behaviour, I have to do rm test first. E.g. I see dd report 1GB/s instead of 150MB/s.

References:

Although the references only explain why I thought to try this, it doesn't actually explain why it causes the IO to block.

On my computer the blocking only seemed to happen inside the new WBT ("writeback throttling") code... which was added in 2016, after you asked your question. I haven't analyzed why it would cause this. And it went away when WBT is disabled.

My kernel version is 4.18.16-200.fc28.x86_64.

strace -T shows that all the time was spent in close(), which makes the most sense to me. I tried to use perf as well. It didn't work how it was supposed to, but it showed stack traces like

dd 17068 [003] 475165.381526:       sched:sched_switch: dd:17068 [120] T ==> kworker/3:1H:19326 [100]
    ffffffffa390c172 __sched_text_start+0x352 ([kernel.kallsyms])
    ffffffffa390c172 __sched_text_start+0x352 ([kernel.kallsyms])
    ffffffffa390c6a8 schedule+0x28 ([kernel.kallsyms])
    ffffffffa30def32 io_schedule+0x12 ([kernel.kallsyms])
    ffffffffa3461ed7 wbt_wait+0x337 ([kernel.kallsyms])
    ffffffffa342ee33 blk_queue_bio+0x123 ([kernel.kallsyms])
    ffffffffa342d114 generic_make_request+0x1a4 ([kernel.kallsyms])
    ffffffffa342d3c5 submit_bio+0x45 ([kernel.kallsyms])
    ffffffffa3377d78 ext4_io_submit+0x48 ([kernel.kallsyms])
    ffffffffa335da2c ext4_writepages+0x70c ([kernel.kallsyms])
    ffffffffa3209311 do_writepages+0x41 ([kernel.kallsyms])
    ffffffffa31f808e __filemap_fdatawrite_range+0xbe ([kernel.kallsyms])
    ffffffffa334b9ec ext4_release_file+0x6c ([kernel.kallsyms])
    ffffffffa32a9d4e __fput+0xae ([kernel.kallsyms])
    ffffffffa30cf474 task_work_run+0x84 ([kernel.kallsyms])
    ffffffffa3003e6e exit_to_usermode_loop+0xce ([kernel.kallsyms])
    ffffffffa300425d do_syscall_64+0x14d ([kernel.kallsyms])
    ffffffffa3a00088 entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms])
        7fcca3a60654 __close+0x14 (/usr/lib64/libc-2.27.so)

which reminded me I was currently testing the deadline I/O scheduler, with WBT ("writeback throttling") enabled. Disabling WBT (including by switching to CFQ, which is incompatible) gave me the fast behaviour again!

The perf commands I used to see this were:

sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -gP -o ~/perf.data dd if=/dev/zero of=test bs=1M count=30
sudo perf script -i ~/perf.data | cat
Related Question