I wrote the following script to test the speed of Python's sort functionality:
from sys import stdin, stdout
lines = list(stdin)
lines.sort()
stdout.writelines(lines)
I then compared this to the coreutils sort
command on a file containing 10 million lines:
$ time python sort.py <numbers.txt >s1.txt
real 0m16.707s
user 0m16.288s
sys 0m0.420s
$ time sort <numbers.txt >s2.txt
real 0m45.141s
user 2m28.304s
sys 0m0.380s
The built-in command used all four CPUs (Python only used one) but took about 3 times as long to run! What gives?
I am using Ubuntu 12.04.5 (32-bit), Python 2.7.3, and sort
8.13
Best Answer
Izkata's comment revealed the answer: locale-specific comparisons. The
sort
command uses the locale indicated by the environment, whereas Python defaults to a byte order comparison. Comparing UTF-8 strings is harder than comparing byte strings.How about that.