I have a very large file which I want to sort alphabetically. It is a tab separated file, but I really need to be sure that the file is sorted starting by the first character in the line regardless it is space or anything else.
Example of the input file:
2090802 V19 I must be the third in the group
20908 02 V18 I must be the first in file, as col 1 is another value
2090802 V17 I must be the second in the group
2090802 V16 I must be the first in the group of 2090802
With the command sort test.txt > test-s.txt
I get this output:
2090802 V16 I must be the first in the group of 2090802
2090802 V17 I must be the second in the group
20908 02 V18 I must be the first in file, as col 1 is another value
2090802 V19 I must be the third in the group
It seems that the sort program sees the first column having the same value(ignoring the space in row 3), and sorts the file using the next one (V16, V17, V18 and V19).
However, I want the value 20908 02
to be considered distinct, and my expected result should be this:
20908 02 V18 I must be the first in file, as col 1 is another value
2090802 V16 I must be the first in the group of 2090802
2090802 V17 I must be the second in the group
2090802 V19 I must be the third in the group
I tried with the -b
argument, and also -t
to give another separator, but still didn't get the desired result.
How can I sort the file by considering every character in the line, not ignoring whitespaces?
Best Answer
The sort order depends on the locale. In most locales, spacing is ignored in first approximation (see how the space (U+0020) and TAB (U+0009) have
IGNORE
as the first 3 weights in ISO14651).If you want a sort order where every character (actually byte) counts and the order is based on byte value (for UTF-8 encoded text, that coincides with sort based on Unicode codepoint value), use the
C
akaPOSIX
locale:Setting
LC_ALL
affects all localisation categories. Sort order is affected by theLC_COLLATE
category, but here, settingLC_CTYPE
(which affects how characters and encoded/decoded to/from byte sequences) toC
is likely a good idea as it guarantees any sequence of bytes can be decoded into characters and sorted (by byte value).LC_COLLATE=C sort your-file
would also not work ifLC_ALL
was otherwise also set.