Sort a file by lines, regardless of their content

sort

I have a very large file which I want to sort alphabetically. It is a tab separated file, but I really need to be sure that the file is sorted starting by the first character in the line regardless it is space or anything else.

Example of the input file:

2090802 V19 I must be the third in the group 
20908 02    V18 I must be the first in file, as col 1 is another value
2090802 V17 I must be the second in the group 
2090802 V16 I must be the first in the group of 2090802

With the command sort test.txt > test-s.txt I get this output:

2090802 V16 I must be the first in the group of 2090802
2090802 V17 I must be the second in the group 
20908 02    V18 I must be the first in file, as col 1 is another value
2090802 V19 I must be the third in the group 

It seems that the sort program sees the first column having the same value(ignoring the space in row 3), and sorts the file using the next one (V16, V17, V18 and V19).

However, I want the value 20908 02 to be considered distinct, and my expected result should be this:

20908 02    V18 I must be the first in file, as col 1 is another value
2090802 V16 I must be the first in the group of 2090802
2090802 V17 I must be the second in the group 
2090802 V19 I must be the third in the group 

I tried with the -b argument, and also -t to give another separator, but still didn't get the desired result.

How can I sort the file by considering every character in the line, not ignoring whitespaces?

Best Answer

The sort order depends on the locale. In most locales, spacing is ignored in first approximation (see how the space (U+0020) and TAB (U+0009) have IGNORE as the first 3 weights in ISO14651).

If you want a sort order where every character (actually byte) counts and the order is based on byte value (for UTF-8 encoded text, that coincides with sort based on Unicode codepoint value), use the C aka POSIX locale:

LC_ALL=C sort your-file

Setting LC_ALL affects all localisation categories. Sort order is affected by the LC_COLLATE category, but here, setting LC_CTYPE (which affects how characters and encoded/decoded to/from byte sequences) to C is likely a good idea as it guarantees any sequence of bytes can be decoded into characters and sorted (by byte value). LC_COLLATE=C sort your-file would also not work if LC_ALL was otherwise also set.

Related Question