Freebsd – How to merge pre-sorted files into a single BIG file, without excessive memory or temporary disk use

freebsdmergesort

I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.

Splitting and sorting the individual chunks is not an issue, but reassembling them is turning out to be a challenge. I expected from the man page that 'sort -m' (Under FreeBSD 11) would do a simple merge, creating an aggregate perfectly sorted output, optionally suppressing duplicates with the -u option.

But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.

I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?

=== Outcome ===

I ended up using a "standard" sort. It took around 50 hours of high CPU and disk load to process, including the generation of several hundred temporary files. This was despite the input already being perfectly sorted. I'm still interested in learning if there is a simple utility to neatly merge pre-sorted files.

Best Answer

Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use

https://en.m.wikipedia.org/wiki/External_sorting

A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy

https://en.m.wikipedia.org/wiki/Zram

https://en.m.wikipedia.org/wiki/Category:Compression_file_systems

If there is space for output without removing input and input is pre-sorted then it's trivial.

Related Question