Python – Why is coreutils sort slower than Python

benchmarkcoreutilsperformancepythonsort

I wrote the following script to test the speed of Python's sort functionality:

from sys import stdin, stdout
lines = list(stdin)
lines.sort()
stdout.writelines(lines)

I then compared this to the coreutils sort command on a file containing 10 million lines:

$ time python sort.py <numbers.txt >s1.txt
real    0m16.707s
user    0m16.288s
sys     0m0.420s

$ time sort <numbers.txt >s2.txt 
real    0m45.141s
user    2m28.304s
sys     0m0.380s

The built-in command used all four CPUs (Python only used one) but took about 3 times as long to run! What gives?

I am using Ubuntu 12.04.5 (32-bit), Python 2.7.3, and sort 8.13

Best Answer

Izkata's comment revealed the answer: locale-specific comparisons. The sort command uses the locale indicated by the environment, whereas Python defaults to a byte order comparison. Comparing UTF-8 strings is harder than comparing byte strings.

$ time (LC_ALL=C sort <numbers.txt >s2.txt)
real    0m5.485s
user    0m14.028s
sys     0m0.404s

How about that.

Related Question