Debian – Rsync hangs during file transfer to USB disk

debianrsync

I'm using rsync to backup my data to an external USB3 disk (encrypted with dmcrypt/luks) and the problem is that the transfer hangs on a file, for an amount of time that can go from seconds to minutes, and then resumes with no errors or issues. This happens to several files (apparently random) during the same rsync "session" making it very slow, even though some file transfers can reach speeds of 100MB/s.

I'm running Debian Jessie 8.5, rsync is at version 3.1.1, the source file system is formatted with btrfs (version 3.17) and the external disk was encrypted with crypsetup 1.6.6.
The encrypted partition was formatted with btrfs, but after noticing this issue and finding this apparently unrelated ubuntu bug, I reformatted the partition to ext4 and, although it seemed to make the issue less frequent, the problem was still there.

During these "hangs" no strange CPU or memory usage is detected but disk reads and writes drop to zero. This is an iotop output during a freeze:

Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
21879 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.00 % [kworker/6:1]
 1085 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.00 % [kworker/3:2]
31994 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.00 % [kworker/4:3]
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    3 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
    5 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:0H]
    7 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_sched]

The kworker processes are always changing but keep the 99% IO.
This is an iostat output during one of the freezes (the external disk is sdg):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.25   99.75    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00  141.00     0.00 16920.00   240.00   135.94  868.20    0.00  868.20   7.09 100.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00 20343.88    0.00    0.00    0.00   0.00 100.00

I also ran ps aux | awk '$8 ~ /D/ { print $0 }' with watch and during the freeze it's this:

root      1080  0.1  0.0      0     0 ?        D    16:23   0:00 [kworker/0:0]
root      5851  0.0  0.0      0     0 ?        D    01:41   0:02 [btrfs-transacti]
root     17455  4.4  0.0 105028  5192 pts/3    D+   14:10   6:11 rsync -avr --stats --progress --inplace --delete /data/ /media/BKP-DISK/
root     24219  0.1  0.0      0     0 ?        D    15:16   0:08 [kworker/5:0]
root     31892  0.2  0.0      0     0 ?        D    Aug02   2:08 [usb-storage]
root     31956  0.1  0.0      0     0 ?        D    15:41   0:04 [kworker/7:0]
root     31994  0.0  0.0      0     0 ?        D    15:42   0:01 [kworker/4:3]
root     32100  0.1  0.0      0     0 ?        D    15:52   0:03 [kworker/u16:2]

When the transfer is running ok it's this:

root     17453  4.4  0.1 105020 33304 pts/3    D+   14:10   6:32 rsync -avr --stats --progress --inplace --delete /data/ /media/BKP-DISK/

I'm out of ideas and know-how so I need help to troubleshoot this further.

Edit

@derobert
I tested in an USB2 port but the issue continues to appear (found a gap of 11 seconds in the strace log and then stopped the test).
The last dmesg backtrace was when the external disk was still formatted with btrfs and here's the output (there were more but all the same):

INFO: task kworker/u16:21:12881 blocked for more than 120 seconds.
      Not tainted 3.16.0-4-amd64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:21  D ffff8807f72bfa48     0 12881      2 0x00000000
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
 ffff8807f72bf5f0 0000000000000046 0000000000012f00 ffff88022dcfbfd8
 0000000000012f00 ffff8807f72bf5f0 ffff880103b92c00 ffff88026de241f0
 ffff88026de241f0 0000000000000001 0000000000000000 ffff88010c4662d8
Call Trace:
 [<ffffffffa02843ef>] ? wait_current_trans.isra.20+0x9f/0xf0 [btrfs]
 [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
 [<ffffffffa0285948>] ? start_transaction+0x298/0x570 [btrfs]
 [<ffffffffa028da90>] ? btrfs_finish_ordered_io+0x250/0x5c0 [btrfs]
 [<ffffffffa02b2f25>] ? normal_work_helper+0xb5/0x290 [btrfs]
 [<ffffffff810817c2>] ? process_one_work+0x172/0x420
 [<ffffffff81081e53>] ? worker_thread+0x113/0x4f0
 [<ffffffff81510d61>] ? __schedule+0x2b1/0x700
 [<ffffffff81081d40>] ? rescuer_thread+0x2d0/0x2d0
 [<ffffffff8108809d>] ? kthread+0xbd/0xe0
 [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
 [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
 [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180

@roaima
Since I run rsync with --progress I can see the current state of the transfer and that's how I first caught the issue. For example, for a file that is 1GB it could hang at 100MB and I would see all the transfer info (transfered bytes, speed, etc) stop updating (this is where iotop would show disk reads and write at 0), and when the info would start updating again iotop would show normal read and write values.

@activesheetd
Here is a section of the strace log (I added the timestamp option):

29253 03:47:18 <... select resumed> )   = 1 (in [0], left {59, 999999})
29253 03:47:18 read(0, "\355\1H\347?~\0\255", 8) = 8
29251 03:47:18 select(6, [5], [4], [5], {60, 0} <unfinished ...>
29253 03:47:18 write(1, "\235\356\374|\f\230\310u\330{\7\24\3169<\255\213>\347m\335kX\350\234\253\1\226M\6#\341"..., 262144) = 262144
29253 03:47:31 select(1, [0], [], [0], {60, 0}) = 1 (in [0], left {59, 999997})
29253 03:47:31 read(0,  <unfinished ...>
29251 03:47:31 <... select resumed> )   = 1 (out [4], left {47, 597230})

Between the 4th and the 5th lines we can see a gap of 13 seconds, which corresponded with a hang, and then it resumed.

@Fiximan
The log option doesn't give me more info on this problem. Since the freeze is in the middle of a file transfer for the logs it's like nothing happened (even the strace logs show timestamp gaps).

Best Answer

Try the following to understand what is going on:

Execute strace in order to analize the system calls and it's responses:

sudo strace -o strace-output.log -f <your rsync command here>

And then analize strace-output.log log file, look at "man strace" for more details or share the log file with us.

Execute lsof while running rsync, it will show you on which file it gets stuck

sudo lsof -p <rsync PID> | grep data

Related Solutions

Linux – Resume transfer of a single file by rsync

To resume an interrupted copy, you should use rsync --append. From the man page's explanation of --append:

This causes rsync to update a file by appending data onto the end of the file, which presumes that the data that already exists on the receiving side is identical with the start of the file on the sending side. [...] Implies --inplace, [...]

Option --inplace makes rsync (over)write the destination file contents directly; without --inplace, rsync would:

create a new file with a temporary name,
copy updated content into it,
swap it with the destination file, and finally
delete the old copy of the destination file.

The normal mode of operation mainly prevents conflicts with applications that might have the destination file open, and a few other mishaps which are duly listed in the rsync manpage.

Note that, if a copy/update operation fails in steps 1.-3. above, rsync will delete the temporary destination file; the --partial option disables this behavior and rsync will leave partially-transferred temporary files on the destination filesystem. Thus, resuming a single file copy operation will not gain much unless you called the first rsync with --partial or --partial-dir (same effect as --partial, in addition instructs rsync to create all temporary files in a specific directory).

How to skip special / socket files during rsync transfer

rsync -a --no-specials --no-devices would tell rsync to skip these files. It will still print an information message, but it would return 0 if no other error occurs.

If there's a set of known paths that you don't want to transfer, you could exclude them altogether. Also, do pass the -x option to skip all mounted filesystems (including /dev, which takes care of the biggest offender), and if there are multiple on-disk filesystems, list all the mount points (e.g. rsync -ax / /home /destination).

rsync -ax --exclude='/var/spool/postfix/dev/*' / /mnt/shared/backup

If none of that is satisfactory, make a list of files you want to skip. Beware that if some of the file names are under control of an adversary, they could cause some files to be omitted from a backup. For example, if they create a directory whose name is a newline and create a named socket called * inside it, then using the output of find -type s as an exclude list would result in /* being excluded. To prevent such problems, keep problematic names out of the exclude list.

{
  cd /path/to/source &&
  find . -name '*[\[?*
]*' -prune -o \
               \( -type b -o -type c -o -type p -o -type s \) -print |
  sed 's/^\.//'
} | rsync -a --exclude-from=- /path/to/source /destination

Best Answer

Related Solutions

Linux – Resume transfer of a single file by rsync

How to skip special / socket files during rsync transfer

Related Question