(cp is to cat AS mv is to ?) mv multiple files into one file instead of cat * rm *

catmv

For everything in this question pretend the system only has one disk and filesystem. (we are not writing to different partitions, disks or filesystems)

I am working on a project that cats very large .MTS files into one huge .MTS file. This requires reading each small file and writing them to a new bigger file then deleting the small files. This takes a very long time with files this big.

My understandingcp takes longer than mv because cp reads the file and writes it to a different place on the disk. mv on the other hand doesn't copy or move the file. mv removes the reference to the file and creates a new one at the new location. For instance mv /tmp/foo /tmp/bar leaves the file as is on disk and removes the reference that directs /tmp/foo to the file on disk and adds the new reference that points /tmp/bar to the file on disk.

The Question:

cat is like cp because it copies the file to the new location. With such large files and no need for the smaller files when I am done, is there something similar to cat that uses mv instead of cp?

Theory (I may have it wrong)

It is already common for files to be stored scattered about the drive. For instance a 2GB file might have several smaller chunks stored in different parts of the drive. This way when a 5K files is deleted it can be overwritten with part of a 20MB file. If we left the 2GB files where they are and just reference all the parts it seems like we could make the same effect as cat foo/* >> bar/bigfile.MTS; rm foo/* in a fraction of the time.

If there is nothing out there that does this and it is a bad idea, can anyone give me example of why? Is it bad to encourage mucking up the disk with scattered file chunks?

Best Answer

The largest obstacle to a tool like this existing is that unless each file's size (except the last one) being concatenated is exactly divisible by the block size (I'm a little uncertain about the right terminology here), you'll end up with "gaps" with garbage data between your concatenated files in the final file.

This is because file data is typically stored in blocks with specific sizes on the file system, such that a 618 byte file stored on a file system using 32 byte blocks would take up 618 / 32 = 19.3125 blocks, i.e. 19 full blocks, and about 1/3 of an additional block.

Assuming you wanted to combine two files like this without regarding my obstacle, you'd simply point the "new file" to the blocks of the first file, plus the blocks of the second file, right?

With that naïve approach, you'd end up with a file of 40 blocks, with its block 20 being 1/3 sensible and 2/3 garbage, and block 21 starting the second file's data.

With some file formats, you might be able to do some clever calculations and manipulations of file headers to basically tell the application that will be using the file to skip the garbage parts, but that's more of a band-aid solution than a proper one.

Related Question