Linux – Parallel processing slower than sequential

imagemagicklinuxparallel processingperformanceterminal

EDIT: For anyone who stumbles upon this in the future:
Imagemagick uses a MP library. It's faster to use available cores if they're around, but if you have parallel jobs, it's unhelpful.

Do one of the following:

  • do your jobs serially (with Imagemagick in parallel mode)
  • set MAGICK_THREAD_LIMIT=1 for your invocation of the imagemagick binary in question.

By making Imagemagick use only one thread, it slows down by 20-30% in my test cases, but meant I could run one job per core without issues, for a significant net increase in performance.

Original question:

While converting some images using ImageMagick, I noticed a somewhat strange effect. Using xargs was significantly slower than a standard for loop. Since xargs limited to a single process should act like a for loop, I tested that, and found it to be about the same.

Thus, we have this demonstration.

  • Quad core (AMD Athalon X4, 2.6GHz)
  • Working entirely on a tempfs (16g ram total; no swap)
  • No other major loads

Results:

/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level

real        0m3.784s
user        0m2.240s
sys         0m0.230s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 2 convert -auto-level

real        0m9.097s
user        0m28.020s
sys         0m0.910s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 10 convert -auto-level

real        0m9.844s
user        0m33.200s
sys         0m1.270s

Can anyone think of a reason why running two instances of this program takes more than twice as long in real time, and more than ten times as long in processor time to complete the same task? After that initial hit, more processes do not seem to have as significant of an effect.

I thought it might have to do with disk seeking, so I did that test entirely in ram. Could it have something to do with how Convert works, and having more than one copy at once means it cannot use processor cache as efficiently or something?

EDIT: When done with 1000x 769KB files, performance is as expected. Interesting.

/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level

real    3m37.679s
user    5m6.980s
sys 0m6.340s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level

real    3m37.152s
user    5m6.140s
sys 0m6.530s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 2 convert -auto-level

real    2m7.578s
user    5m35.410s
sys     0m6.050s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 4 convert -auto-level

real    1m36.959s
user    5m48.900s
sys     0m6.350s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 10 convert -auto-level

real    1m36.392s
user    5m54.840s
sys     0m5.650s

Best Answer

How big are the files you wish to convert when compared to your L1 cache? Your L2 cache?

Without a better look inside, I would suspect cache contention is causing your CPUs to idle while they wait for data to be re-cached because the other process(es) keep kicking important things out of the fast memory.

See also this answer I gave on Stack Overflow.

Related Question