Filesystems – Append Huge Files Without Copying

filesfilesystems

There are 5 huge files ( file1, file2, .. file5) about 10G each and extremely low free space left on the disk and I need to concatenate all this files into one.
There is no need to keep original files, only the final one.

Usual concatenation is going with cat in sequence for files file2 .. file5:

cat file2 >> file1 ; rm file2

Unfortunately this way requires a at least 10G free space I don't have.
Is there a way to concatenate files without actual copying it but tell filesystem somehow that file1 doesn't end at original file1 end and continues at file2 start?

ps. filesystem is ext4 if that matters.

Best Answer

AFAIK it is (unfortunately) not possible to truncate a file from the beginning (this may be true for the standard tools but for the syscall level see here). But with adding some complexity you can use the normal truncation (together with sparse files): You can write to the end of the target file without having written all the data in between.

Let's assume first both files are exactly 5GiB (5120 MiB) and that you want to move 100 MiB at a time. You execute a loop which consists of

copying one block from the end of the source file to the end of the target file (increasing the consumed disk space)

truncating the source file by one block (freeing disk space)

for((i=5119;i>=0;i--)); do
  dd if=sourcefile of=targetfile bs=1M skip="$i" seek="$i" count=1
  dd if=/dev/zero of=sourcefile bs=1M count=0 seek="$i"
done

But give it a try with smaller test files first, please...

Probably the files are neither the same size nor multiples of the block size. In that case the calculation of the offsets becomes more complicated. seek_bytes and skip_bytes should be used then.

If this is the way you want to go but need help for the details then ask again.

Warning

Depending on the dd block size the resulting file will be a fragmentation nightmare.

Related Solutions

Tar files from another directory without copying them

You want to create a tar file away from the place the files you need to tar reside?

There are many ways to do this.

If it is to be created locally (= on the same machine) :

tar czvf /path/to/destination/newfile.tar.gz  ./SOURCEDIR_OR_FILES

You can add additionnal files or directories to tar at the end of that command.

If it is to be created remotely (ie, you want to create the tar file on a remote host from the one containing the data to be tared):

tar czvf - ./SOURCEDIR_OR_FILES | ssh user@host 'cat > newfile.tar.gz'

The later version is very versatile. For example you can also "duplicate" a directory + subdirs using the same technique:

Duplicate a directory+subdirs to another local directory:

tar cf - ./SOURCEDIR_OR_FILES | ( cd LOCAL_DEST_DIR && tar xvf - )

Duplicate a directory+subdirs to another remote directory:

tar cvf - ./SOURCEDIR_OR_FILES | ssh user@host 'cd REMOTE_DEST_DIR && tar xf - '

Drop the 'v' if you don't need it to display files as they are tar-ed (or untarred): it will then go much faster, but won't say much unless there is an error.

I use "./..." for the source to force tar to store it as a RELATIVE path. In some cases you'll want to add additionnal path information:

For example to tar the crontab files, including the one in /etc, you could do:

cd / ; tar czf all_crons.tgz  ./etc/*cron*  ./var/spool/cron

I use on purpose the relative path: some OLD versions of tar may be dangerous and extract files with their original GLOBAL path, meaning you could do : cd /safedir ; tar xvf sometar and have the files with global names overwrite files at their original path, which is OUTSIDE of /safedir and not underneath it! Very dangerous, and still possible as there are old production servers out there. Better to be used to use relative paths all the time, even if you use a more recent tar.

“write-once archive”: ext2 vs ext4^has_journal vs

I don't agree with the squashfs recommendations. You don't usually write a squashfs to a raw block device; think of it as an easily-readable tar archive. That means you would still need an underlaying filesystem.

ext2 has several severe limitations that limit its usefulness today; I would therefore recommend ext4. Since this is meant for archiving, you would create compressed archives to go on it; that means you would have a small number of fairly large files that rarely change. You can optimize for that:

specify -I 128 to reduce the size of individual inodes, which reduces the size of the inode table.
You can play with the -i option too, to reduce the size of the inode table even further. If you increase this value, there will be less inodes created, and therefore the inode table will also be smaller. However, that would mean the filesystem wastes more space on average per file. This is therefore a bit of a trade-off.
You can indeed switch off the journal with -O ^has_journal. If you go down that route, though, I recommend that you set default options to mount the filesystem read-only; you can do this in fstab, or you could use tune2fs -E mount_opts=ro to record a default in the filesystem (you cannot do this at mkfs time)
you should of course compress your data into archive files, so that the inode wastage isn't as bad a problem as it could be. You could create squashfs images, but xz compresses better, so I would recommend tar.xz files instead.
You could also reduce the number of reserved blocks with the -m option to either mkfs or tune2fs. This sets the percentage (set to 5 by default) which is reserved for root only. Don't set it to zero; the filesystem requires some space for efficient operation.

Best Answer

Related Solutions

Tar files from another directory without copying them

“write-once archive”: ext2 vs ext4^has_journal vs

Related Question