Ubuntu – Why is the sorted file bigger

text processing

I have a 2958616 byte text file. When I run sort < file.txt | uniq > sorted-file.txt, I get a 3213965 byte text file. Why is my sorted text file bigger?

You can download the text files here.

Best Answer

While your original file has lines that end with \n, your sorted file has \r\n. The addition of the \r is what changes the size.

To illustrate, here's what happens when I run your command on my Linux system:

$ sort < file.txt | uniq > sorted-file.linux.txt
$ ls -l file.txt sorted-file.linux.txt 
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
$ wc -l file.txt sorted-file.linux.txt 
273882 file.txt
271576 sorted-file.linux.txt

As you can see, the sorted de-duped file is a few lines shorter and, consequently, a few bytes smaller. Your file, however, is different:

$ wc -l sorted-file.linux.txt sorted-file.txt 
271576 sorted-file.linux.txt
271576 sorted-file.txt

The two files have exactly the same number of lines, but:

$ ls -l file.txt sorted-file.linux.txt sorted-file.txt 
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
-rw-r--r-- 1 terdon terdon 3213965 Jul 10 12:11 sorted-file.txt

The sorted-file.txt, the one I downloaded from your link, is larger. If we now examine the first line, we can see the extra \r:

$ head -n1 sorted-file.txt | od -c
0000000   a  \r  \n
0000003

Which aren't present in the one I created on Linux:

$ head -n1 sorted-file.linux.txt | od -c
0000000   a  \n
0000002

If we now remove the \r from your file:

$ tr -d '\r' < sorted-file.txt > new-sorted-file.txt

We get the expected result, a file that is smaller than the original, just like the one I created on my system:

$ ls -l sorted-file.linux.txt new-sorted-file.txt file.txt
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:19 new-sorted-file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt