How to Concatenate and Re-Sort Thousands of Files Quickly

memoryout of memorysort

I have ~100000 files each of one with unique rows such as:

File1.txt

chr1_1_200  
chr1_600_800  
...

File2.txt

chr1_600_800  
chr1_1000_1200  
...

File3.txt

chr1_200_400    
chr1_600_800  
chr1_1000_1200  
...  

Every file has around ~30 million rows and when its time to perform the command:

cat *txt | sort -u > Unique_Position.txt

the system runs out of memory. How can I handle this with normal command lines in Linux?

Best Answer

If the files are already sorted in an acceptable way, you could merge-sort them and then uniq them:

sort -t_ -k2,2n -k3,3n -m -- *.txt | uniq > Unique_Position.txt

... which sorts numerically on the second field (as delimited by underscores _) and if those keys are unique, by the third field. The resulting output is then piped through uniq before being redirected into the output file.

Given the (short) sample input above, the results are:

chr1_1_200
chr1_200_400
chr1_600_800
chr1_1000_1200

If you're able to fully specify the sort fields for the lines that you want to keep, you could do it all within sort by adding the -u option:

sort -t_ -k1 -k2,2n -k3,3n -m -u *.txt > Unique_Position.txt

This would preserve unique lines among the three listed fields without needing to call out to uniq (notice the addition of the -u option). These sort fields need to be match the way that the input files are sorted.

Related Question