Debian – Rsync hangs during file transfer to USB disk

debianrsync

I'm using rsync to backup my data to an external USB3 disk (encrypted with dmcrypt/luks) and the problem is that the transfer hangs on a file, for an amount of time that can go from seconds to minutes, and then resumes with no errors or issues. This happens to several files (apparently random) during the same rsync "session" making it very slow, even though some file transfers can reach speeds of 100MB/s.

I'm running Debian Jessie 8.5, rsync is at version 3.1.1, the source file system is formatted with btrfs (version 3.17) and the external disk was encrypted with crypsetup 1.6.6.
The encrypted partition was formatted with btrfs, but after noticing this issue and finding this apparently unrelated ubuntu bug, I reformatted the partition to ext4 and, although it seemed to make the issue less frequent, the problem was still there.

During these "hangs" no strange CPU or memory usage is detected but disk reads and writes drop to zero. This is an iotop output during a freeze:

Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
21879 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.00 % [kworker/6:1]
 1085 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.00 % [kworker/3:2]
31994 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.00 % [kworker/4:3]
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    3 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
    5 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:0H]
    7 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_sched]

The kworker processes are always changing but keep the 99% IO.
This is an iostat output during one of the freezes (the external disk is sdg):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.25   99.75    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00  141.00     0.00 16920.00   240.00   135.94  868.20    0.00  868.20   7.09 100.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00 20343.88    0.00    0.00    0.00   0.00 100.00

I also ran ps aux | awk '$8 ~ /D/ { print $0 }' with watch and during the freeze it's this:

root      1080  0.1  0.0      0     0 ?        D    16:23   0:00 [kworker/0:0]
root      5851  0.0  0.0      0     0 ?        D    01:41   0:02 [btrfs-transacti]
root     17455  4.4  0.0 105028  5192 pts/3    D+   14:10   6:11 rsync -avr --stats --progress --inplace --delete /data/ /media/BKP-DISK/
root     24219  0.1  0.0      0     0 ?        D    15:16   0:08 [kworker/5:0]
root     31892  0.2  0.0      0     0 ?        D    Aug02   2:08 [usb-storage]
root     31956  0.1  0.0      0     0 ?        D    15:41   0:04 [kworker/7:0]
root     31994  0.0  0.0      0     0 ?        D    15:42   0:01 [kworker/4:3]
root     32100  0.1  0.0      0     0 ?        D    15:52   0:03 [kworker/u16:2]

When the transfer is running ok it's this:

root     17453  4.4  0.1 105020 33304 pts/3    D+   14:10   6:32 rsync -avr --stats --progress --inplace --delete /data/ /media/BKP-DISK/

I'm out of ideas and know-how so I need help to troubleshoot this further.


Edit

@derobert
I tested in an USB2 port but the issue continues to appear (found a gap of 11 seconds in the strace log and then stopped the test).
The last dmesg backtrace was when the external disk was still formatted with btrfs and here's the output (there were more but all the same):

INFO: task kworker/u16:21:12881 blocked for more than 120 seconds.
      Not tainted 3.16.0-4-amd64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:21  D ffff8807f72bfa48     0 12881      2 0x00000000
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
 ffff8807f72bf5f0 0000000000000046 0000000000012f00 ffff88022dcfbfd8
 0000000000012f00 ffff8807f72bf5f0 ffff880103b92c00 ffff88026de241f0
 ffff88026de241f0 0000000000000001 0000000000000000 ffff88010c4662d8
Call Trace:
 [<ffffffffa02843ef>] ? wait_current_trans.isra.20+0x9f/0xf0 [btrfs]
 [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
 [<ffffffffa0285948>] ? start_transaction+0x298/0x570 [btrfs]
 [<ffffffffa028da90>] ? btrfs_finish_ordered_io+0x250/0x5c0 [btrfs]
 [<ffffffffa02b2f25>] ? normal_work_helper+0xb5/0x290 [btrfs]
 [<ffffffff810817c2>] ? process_one_work+0x172/0x420
 [<ffffffff81081e53>] ? worker_thread+0x113/0x4f0
 [<ffffffff81510d61>] ? __schedule+0x2b1/0x700
 [<ffffffff81081d40>] ? rescuer_thread+0x2d0/0x2d0
 [<ffffffff8108809d>] ? kthread+0xbd/0xe0
 [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
 [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
 [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180

@roaima
Since I run rsync with --progress I can see the current state of the transfer and that's how I first caught the issue. For example, for a file that is 1GB it could hang at 100MB and I would see all the transfer info (transfered bytes, speed, etc) stop updating (this is where iotop would show disk reads and write at 0), and when the info would start updating again iotop would show normal read and write values.

@activesheetd
Here is a section of the strace log (I added the timestamp option):

29253 03:47:18 <... select resumed> )   = 1 (in [0], left {59, 999999})
29253 03:47:18 read(0, "\355\1H\347?~\0\255", 8) = 8
29251 03:47:18 select(6, [5], [4], [5], {60, 0} <unfinished ...>
29253 03:47:18 write(1, "\235\356\374|\f\230\310u\330{\7\24\3169<\255\213>\347m\335kX\350\234\253\1\226M\6#\341"..., 262144) = 262144
29253 03:47:31 select(1, [0], [], [0], {60, 0}) = 1 (in [0], left {59, 999997})
29253 03:47:31 read(0,  <unfinished ...>
29251 03:47:31 <... select resumed> )   = 1 (out [4], left {47, 597230}) 

Between the 4th and the 5th lines we can see a gap of 13 seconds, which corresponded with a hang, and then it resumed.

@Fiximan
The log option doesn't give me more info on this problem. Since the freeze is in the middle of a file transfer for the logs it's like nothing happened (even the strace logs show timestamp gaps).

Best Answer

Try the following to understand what is going on:

Execute strace in order to analize the system calls and it's responses:

sudo strace -o strace-output.log -f <your rsync command here>

And then analize strace-output.log log file, look at "man strace" for more details or share the log file with us.

Execute lsof while running rsync, it will show you on which file it gets stuck

sudo lsof -p <rsync PID> | grep data
Related Question