Transferring large (8 GB) files over ssh

large filesscpsftp

I tried it with SCP, but it says "Negative file size".

>scp matlab.iso xxx@xxx:/matlab.iso
matlab.iso: Negative file size

Also tried using SFTP, worked fine until 2 GB of the file had transferred, then stopped:

sftp> put matlab.iso
Uploading matlab.iso to /home/x/matlab.iso
matlab.iso                                           -298% 2021MB -16651.-8KB/s   00:5d
o_upload: offset < 0

Any idea what could be wrong? Don't SCP and SFTP support files that are larger than 2 GB? If so, then how can I transfer bigger files over SSH?

The destination file system is ext4. The Linux distribution is CentOS 6.5. The filesystem currently has (accessible) large files on it (up to 100 GB).

Best Answer

The original problem (based on reading all comments to the OP question) was that the scp executable on the 64-bit system was a 32-bit application. A 32-bit application that isn't compiled with "large-file support" ends up with seek pointers that are limited to 2^32 =~ 4GB.

You may tell if scp is 32-bit by using the file command:

file `which scp`

On most modern systems it will be 64-bit, so no file truncation would occur:

$ file `which scp`
/usr/bin/scp: ELF 64-bit LSB  shared object, x86-64 ...

A 32-application should still be able to support "large files" but it has to be compiled from source with large-file support which this case apparently wasn't.

The recommended solution is perhaps to use a full standard 64-bit distribution where apps are compiled as 64-bit by default.

Related Solutions

Why rsync does not do delta transfer

To answer the main question in short, rsync seems to write double the number of bytes, because it spawns two processes/threads to do the copy, and there's one stream data between the processes, and another from the receiving process to the target file.

We can tell this by looking at the strace output in more detail, the process IDs in the beginning of the file, and also the file descriptor numbers in the write calls can be used to tell different write "streams" from each other.

Presumably, this is so that a local transfer can work just like a remote transfer, only the source and destination are on the same system.

Using something like strace -e trace=process,socketpair,open,read,write would show some threads spawned off, the socket pair being created between them, and different threads opening the input and output files.

A test run similar to yours:

$ rm test2
$ strace -f -e trace=process,socketpair,open,close,dup,dup2,read,write -o rsync.log rsync -avcz --progress test1 test2
$ ls -l test1 test2
-rw-r--r-- 1 itvirta itvirta 81920004 Jun 21 20:20 test1
-rw-r--r-- 1 itvirta itvirta 81920004 Jun 21 20:20 test2

Let's take a count of bytes written for each thread separately:

$ for x in 15007 15008 15009  ; do echo -en "$x: " ; grep -E "$x (<... )?write"  rsync.log | awk 'BEGIN {FS=" = "} {sum += $2} END {print sum}'  ; done 
15007: 81967265
15008: 49
15009: 81920056

Which matches pretty much with the theory above. I didn't check what the other 40kB written by the first thread is, but I'll assume it prints the progress output, and whatever metadata about the synced file rsync needs to transfer to the other end.

I didn't check, but I'll suggest that even with delta compression enabled, perhaps the "remote" end of rsync still writes out (most of) the file in full, resulting in approximately the same amount of writes as with cp. The transfer between the rsync threads is smaller, but the final output is still the same.

SSHFS/SFTP – How to Fix Vim Hanging and Upload Issues Over SSHFS and SFTP

The sshfs FUSE filesystem is implemented by presenting a filesystem on top of sftp, the file transfer protocol. As a result, any file access such as editing with vi[m] requires the sshfs subsystem first to copy the file to a cache on the local filesystem. If the file is particularly large, or the network between your client and the server is particularly slow, it will take a measurable amount of time to transfer the file before it's accessible locally.

It's (very) broadly equivalent to the following (except it uses sftp instead of scp)

# Copy the remote file to a temporary local cache
scp -p remote:/path/to/file /tmp/file.tmp
checksum=$(cksum /tmp/file.tmp)

# Action on remote file is implemented by performing the action locally
vi /tmp/file.tmp

# Simplified; we would also need to handle local rm/mv -> remote rm/mv, etc.
[[ "$(cksum /tmp/file.tmp)" != "$checksum" ]] && scp -p /tmp/file.tmp remote:/path/to/file

As a consequence, you'll find that trying to run gcc locally will be measurably slower than just logging in to the remote server and running it there. To be honest I'm not overly surprised that "gcc crashes when trying to compile files on the remote fs". It shouldn't, of course, but then think about what's actually going on in the background...

Best Answer

Related Solutions

Why rsync does not do delta transfer

SSHFS/SFTP – How to Fix Vim Hanging and Upload Issues Over SSHFS and SFTP

Related Question