How to Concatenate and Re-Sort Thousands of Files Quickly

memoryout of memorysort

I have ~100000 files each of one with unique rows such as:

File1.txt

chr1_1_200  
chr1_600_800  
...

File2.txt

chr1_600_800  
chr1_1000_1200  
...

File3.txt

chr1_200_400    
chr1_600_800  
chr1_1000_1200  
...

Every file has around ~30 million rows and when its time to perform the command:

cat *txt | sort -u > Unique_Position.txt

the system runs out of memory. How can I handle this with normal command lines in Linux?

Best Answer

If the files are already sorted in an acceptable way, you could merge-sort them and then uniq them:

sort -t_ -k2,2n -k3,3n -m -- *.txt | uniq > Unique_Position.txt

... which sorts numerically on the second field (as delimited by underscores _) and if those keys are unique, by the third field. The resulting output is then piped through uniq before being redirected into the output file.

Given the (short) sample input above, the results are:

chr1_1_200
chr1_200_400
chr1_600_800
chr1_1000_1200

If you're able to fully specify the sort fields for the lines that you want to keep, you could do it all within sort by adding the -u option:

sort -t_ -k1 -k2,2n -k3,3n -m -u *.txt > Unique_Position.txt

This would preserve unique lines among the three listed fields without needing to call out to uniq (notice the addition of the -u option). These sort fields need to be match the way that the input files are sorted.

Related Solutions

RHEL 6 – Maximum Memory Usable by a 32 Bit System

Well, I do not expect a concise answer than the one available from here.

What I understand about 32-bit OS is, the address is expressed in 32 bits, so at most the OS could use 2^32 = 4GB memory space

The most that the process can address is 4GB. You are potentially confusing memory with address space. A process can have more memory than address space. That is perfectly legal and quite common in video processing and other memory intensive applications. A process can be allocated dozens of GB of memory and swap it into and out of the address space at will. Only 2 GB can go into the user address space at a time.

If you have a four-car garage at your house, you can still own fifty cars. You just can't keep them all in your garage. You have to have auxiliary storage somewhere else to store at least 46 of them; which cars you keep in your garage and which ones you keep in the parking lot down the street is up to you.

Does this mean any 32-bit OS, be it Windows or unix, if the machine has RAM + page file on hard disk more than 4GB, for example 8GB RAM and 20GB page file, there will never be "memory used up"?

Absolutely it does not mean that. A single process could use more memory than that! Again the amount of memory a process uses is almost completely unrelated to the amount of virtual address space a process uses. Just like the number of cars you keep in your garage is completely unrelated to the number of cars you own.

Moreover, two processes can share non-private memory pages. If twenty processes all load the same DLL, the processes all share the memory pages for that code. They don't share virtual memory address space, they share memory.

My point, in case it is not clear, is that you should stop thinking of memory and address space as the same thing, because they're not the same thing at all.

if this 32-bit OS machine has 2GB RAM and 2GB page file, increasing the page file size won't help the performance. Is this true?

You have fifty cars and a four-car garage, and a 100 car parking lot down the street. You increase the size of the parking lot to 200 spots. Do any of your cars get faster as a result of you now having 150 extra parking spaces instead of 50 extra parking spaces?

How to concatenate two files in a new one and sort the output in one line

You have already redirected the output of file1 and file2 to the new file file3.

With this command cat file1 file2 > file3 | sort, sort after pipe.

This could be verified as below.

cat file1
A
Z
B
cat file2
F
G
C

Now when I run the command as, cat file1 file2 > file3 | sort we could see that the contents of file1 and file2 are written to file3 but it is not sorted.

I believe what you are trying to achieve could be fairly easily accomplished as,

cat file1 file2 | sort > file3

However, it doesn't show the output in the console window.

If you need the output of two files after sorted to be written to a new file and at the same time the sorted output to be available in the console, you could do something like below.

cat file1 file2 | sort > file3; cat file3

However, it is good to make use of tee in this case. tee could be effectively used as,

cat file1 file2 | sort | tee file3

The above command basically concatenates 2 files and sorts them and displays the output in the console and at the same time writes the output of the pipe to the new file specified using the tee command.

As user casey points out, if you have zsh shell available on your system, you could use the below command as well.

sort <file1 <file2 | tee file3

Best Answer

Related Solutions

RHEL 6 – Maximum Memory Usable by a 32 Bit System

How to concatenate two files in a new one and sort the output in one line

Related Question