“Useless” use of ‘cat’ increases performance. Why

performance

The files 1..64 are each 160 MBytes and stored in a RAM disk.

Generated by:

seq 120 | parallel -k 'seq {}0000000 {}9999999 | fmt -30' | head -c 10G > 10G
parallel --pipepart --block -1 -a 10G 'cat > {#}'

nocat:

#!/bin/bash

export LC_ALL=C
sort -m \
     <(sort -m  \
            <(sort -m  \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 1; sort ) < 1) \
                                        <((rm 2; sort ) < 2) ) \
                                 <(sort -m  \
                                        <((rm 3; sort ) < 3) \
                                        <((rm 4; sort ) < 4) ) ) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 5; sort ) < 5) \
                                        <((rm 6; sort ) < 6) ) \
                                 <(sort -m  \
                                        <((rm 7; sort ) < 7) \
                                        <((rm 8; sort ) < 8) ) ) ) \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 9; sort ) < 9) \
                                        <((rm 10; sort ) < 10) ) \
                                 <(sort -m  \
                                        <((rm 11; sort ) < 11) \
                                        <((rm 12; sort ) < 12) ) ) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 13; sort ) < 13) \
                                        <((rm 14; sort ) < 14) ) \
                                 <(sort -m  \
                                        <((rm 15; sort ) < 15) \
                                        <((rm 16; sort ) < 16) ) ) ) ) \
            <(sort -m  \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 17; sort ) < 17) \
                                        <((rm 18; sort ) < 18) ) \
                                 <(sort -m  \
                                        <((rm 19; sort ) < 19) \
                                        <((rm 20; sort ) < 20) ) ) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 21; sort ) < 21) \
                                        <((rm 22; sort ) < 22) ) \
                                 <(sort -m  \
                                        <((rm 23; sort ) < 23) \
                                        <((rm 24; sort ) < 24) ) ) ) \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 25; sort ) < 25) \
                                        <((rm 26; sort ) < 26) ) \
                                 <(sort -m  \
                                        <((rm 27; sort ) < 27) \
                                        <((rm 28; sort ) < 28) ) ) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 29; sort ) < 29) \
                                        <((rm 30; sort ) < 30) ) \
                                 <(sort -m  \
                                        <((rm 31; sort ) < 31) \
                                        <((rm 32; sort ) < 32) ) ) ) ) ) \
     <(sort -m  \
            <(sort -m  \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 33; sort ) < 33) \
                                        <((rm 34; sort ) < 34) ) \
                                 <(sort -m  \
                                        <((rm 35; sort ) < 35) \
                                        <((rm 36; sort ) < 36) ) ) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 37; sort ) < 37) \
                                        <((rm 38; sort ) < 38) ) \
                                 <(sort -m  \
                                        <((rm 39; sort ) < 39) \
                                        <((rm 40; sort ) < 40) ) ) ) \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 41; sort ) < 41) \
                                        <((rm 42; sort ) < 42) ) \
                                 <(sort -m  \
                                        <((rm 43; sort ) < 43) \
                                        <((rm 44; sort ) < 44) ) ) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 45; sort ) < 45) \
                                        <((rm 46; sort ) < 46) ) \
                                 <(sort -m  \
                                        <((rm 47; sort ) < 47) \
                                        <((rm 48; sort ) < 48) ) ) ) ) \
            <(sort -m  \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 49; sort ) < 49) \
                                        <((rm 50; sort ) < 50) ) \
                                 <(sort -m  \
                                        <((rm 51; sort ) < 51) \
                                        <((rm 52; sort ) < 52) ) ) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 53; sort ) < 53) \
                                        <((rm 54; sort ) < 54) ) \
                                 <(sort -m  \
                                        <((rm 55; sort ) < 55) \
                                        <((rm 56; sort ) < 56) ) ) ) \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 57; sort ) < 57) \
                                        <((rm 58; sort ) < 58) ) \
                                 <(sort -m  \
                                        <((rm 59; sort ) < 59) \
                                        <((rm 60; sort ) < 60) ) ) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 61; sort ) < 61) \
                                        <((rm 62; sort ) < 62) ) \
                                 <(sort -m  \
                                        <((rm 63; sort ) < 63) \
                                        <((rm 64; sort ) < 64) ) ) ) ) ) |
    md5sum

withcat:

#!/bin/bash

export LC_ALL=C
sort -m \
     <(sort -m  \
            <(sort -m  \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 1; sort ) < 1) \
                                        <((rm 2; sort ) < 2) | cat) \
                                 <(sort -m  \
                                        <((rm 3; sort ) < 3) \
                                        <((rm 4; sort ) < 4) | cat) | cat) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 5; sort ) < 5) \
                                        <((rm 6; sort ) < 6) | cat) \
                                 <(sort -m  \
                                        <((rm 7; sort ) < 7) \
                                        <((rm 8; sort ) < 8) | cat) | cat) | cat) \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 9; sort ) < 9) \
                                        <((rm 10; sort ) < 10) | cat) \
                                 <(sort -m  \
                                        <((rm 11; sort ) < 11) \
                                        <((rm 12; sort ) < 12) | cat) | cat) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 13; sort ) < 13) \
                                        <((rm 14; sort ) < 14) | cat) \
                                 <(sort -m  \
                                        <((rm 15; sort ) < 15) \
                                        <((rm 16; sort ) < 16) | cat) | cat) | cat) | cat) \
            <(sort -m  \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 17; sort ) < 17) \
                                        <((rm 18; sort ) < 18) | cat) \
                                 <(sort -m  \
                                        <((rm 19; sort ) < 19) \
                                        <((rm 20; sort ) < 20) | cat) | cat) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 21; sort ) < 21) \
                                        <((rm 22; sort ) < 22) | cat) \
                                 <(sort -m  \
                                        <((rm 23; sort ) < 23) \
                                        <((rm 24; sort ) < 24) | cat) | cat) | cat) \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 25; sort ) < 25) \
                                        <((rm 26; sort ) < 26) | cat) \
                                 <(sort -m  \
                                        <((rm 27; sort ) < 27) \
                                        <((rm 28; sort ) < 28) | cat) | cat) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 29; sort ) < 29) \
                                        <((rm 30; sort ) < 30) | cat) \
                                 <(sort -m  \
                                        <((rm 31; sort ) < 31) \
                                        <((rm 32; sort ) < 32) | cat) | cat) | cat) | cat) | cat) \
     <(sort -m  \
            <(sort -m  \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 33; sort ) < 33) \
                                        <((rm 34; sort ) < 34) | cat) \
                                 <(sort -m  \
                                        <((rm 35; sort ) < 35) \
                                        <((rm 36; sort ) < 36) | cat) | cat) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 37; sort ) < 37) \
                                        <((rm 38; sort ) < 38) | cat) \
                                 <(sort -m  \
                                        <((rm 39; sort ) < 39) \
                                        <((rm 40; sort ) < 40) | cat) | cat) | cat) \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 41; sort ) < 41) \
                                        <((rm 42; sort ) < 42) | cat) \
                                 <(sort -m  \
                                        <((rm 43; sort ) < 43) \
                                        <((rm 44; sort ) < 44) | cat) | cat) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 45; sort ) < 45) \
                                        <((rm 46; sort ) < 46) | cat) \
                                 <(sort -m  \
                                        <((rm 47; sort ) < 47) \
                                        <((rm 48; sort ) < 48) | cat) | cat) | cat) | cat) \
            <(sort -m  \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 49; sort ) < 49) \
                                        <((rm 50; sort ) < 50) | cat) \
                                 <(sort -m  \
                                        <((rm 51; sort ) < 51) \
                                        <((rm 52; sort ) < 52) | cat) | cat) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 53; sort ) < 53) \
                                        <((rm 54; sort ) < 54) | cat) \
                                 <(sort -m  \
                                        <((rm 55; sort ) < 55) \
                                        <((rm 56; sort ) < 56) | cat) | cat) | cat) \
                   <(sort -m  \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 57; sort ) < 57) \
                                        <((rm 58; sort ) < 58) | cat) \
                                 <(sort -m  \
                                        <((rm 59; sort ) < 59) \
                                        <((rm 60; sort ) < 60) | cat) | cat) \
                          <(sort -m  \
                                 <(sort -m  \
                                        <((rm 61; sort ) < 61) \
                                        <((rm 62; sort ) < 62) | cat) \
                                 <(sort -m  \
                                        <((rm 63; sort ) < 63) \
                                        <((rm 64; sort ) < 64) | cat) | cat) | cat) | cat) | cat) | cat |
    md5sum

The only difference is that in withcat every sort -m is piped to cat.

The "useless use of cat"-people will make you believe withcat would be slower than nocat. However, the opposite is true by a wide margin:

$ time bash nocat
c933d81faea7b8dec8eb64ca0b044d74  -

real    3m40.854s
user    2m48.687s
sys     0m49.135s
$ time bash withcat
c933d81faea7b8dec8eb64ca0b044d74  -

real    2m21.812s
user    2m16.651s
sys     1m36.135s

The test is run on a 64-core machine, that does nothing else. Everything is in RAM (so this is not due to slow disks). Every test was run 3 times, and the best time is given above. All three tests completed within 5 seconds of the best time (so it is not a fluke).

Why is it faster to pipe the output to cat?

Edit

Does cat group input in bigger chunks? And/or does sort flush output for every line?

To test this I tried:

$ strace -ff sort -m <(sort 1) <(sort 2) 2>fromsort | cat >/dev/null
$ strace -ff sort -m <(sort 1 | cat ) <(sort 2 | cat) 2>fromcat | cat >/dev/null

If cat made it into bigger chunks, we would expect read to return larger chunks. But it does not:

$ grep -E 'read|write' fromsort |field 1,5|sort | uniq -c 
      1 openat(AT_FDCWD,        3
      8 pread64(3,      =
      1 read(3, 3771
  40989 read(3, 4096
      2 read(3, 832
      1 read(3, unknown
      1 read(4, 0
      1 read(4, 2241
  40959 read(4, 4096
      1 write(1,        1916
  81949 write(1,        4096
$ grep -E 'read|write' fromcat |field 1,5|sort | uniq -c 
      1 openat(AT_FDCWD,        3
      8 pread64(3,      =
      1 read(3, 3771
  40989 read(3, 4096
      2 read(3, 832
      1 read(3, unknown
      1 read(4, 2241
  40959 read(4, 4096
      1 read(4, unknown
      1 write(1,        1916
  81949 write(1,        4096

In both cases both read and write are 4K.

(Incidentally, sort does read (much) bigger chunks if reading from a file and not from a pipe, but that is not the case here).

Edit 2

The goal of the above is to demonstrate that an additional cat is not always useless; and to figure out what causes this.

The goal is not to sort data.

But if your goal was to sort data, why not just use sort's built-in --parallel?

By default sort seems to use --parallel 8 on a 64-core machine. top shows it uses up to 800% CPU. You can force it to use 64 cores with --parallel 64:

$ time sort {1..64} | md5sum
real    9m4.005s
user    29m56.454s
sys     5m49.560s

$ time sort --parallel 64 {1..64} | md5sum
real    6m50.332s
user    35m55.040s
sys     11m37.609s

So GNU sort's --parallel is much slower than the above. The above is now available as parsort: http://git.savannah.gnu.org/cgit/parallel.git/tree/src/parsort

Best Answer

This is by no means a "useless use of cat".

some_command | cat | some_command

This isn't a traditional "useless use of cat" which are usually derived from ignorance of the shell. Instead this appears to be a deliberate attempt to do something using the dynamics of cat. In this case I believe it's caching.


My Second thoughts

Even if the size of read and write isn't any different there are a couple of things which might be undetectable which could also be in play.

Firstly (and this is very important): Why is processing a sorted array faster than processing an unsorted array?. If you do anything to change the sequence in which the CPU is processing this, the timing may change. If cat succeeds in making each sort run for longer without suspending (and switching to a different process) then this could dramatically affect the CPU's branch prediction and result in a much larger or smaller time.

Secondly, even if the number and size of read is left unaffected number of times a task has to suspend (block) may be different. This in itself is likely to add or remove an overhead. So even if the reads and writes are of the same size, the cat (caching) layer might be reducing the number of times each read() and write() occurs.

Cat might simply be forcing sort to wait longer and thus have more available to do without suspending and reducing the number of times each process blocks. This would be very difficult to detect.


My First thoughts

My expectation here would be that if you put both versions in their own script and ran strace -f on each script, you would see fewer read or / write calls in the example with cat. At least, I would expect to see much larger reads at each layer using cat. My expectation for sort would be that it writes single lines and doesn't internally buffer much. Indeed I would expect it to read() in large enough blocks but only write() in single lines. This means it's not well designed for piping to itself.

As laktak points out in his answer, cat reads in blocks of 128KB (see here) but pipes typically only buffer 64KB. If I'm right then when cat is suspended waiting for a read() to complete this will give a large (128 + 64 KB) buffer for the writing sort operation to write into without ever needing to suspend. By the time cat is resumed there will be a good chunk of data (much more than sort sent in a single write) to pass onto the next sort. As a result the next sort can read from this quite a lot without being suspended.

I also suspect that adding a layer of cat closest the files would have next to no impact or negative impact on performance. These files are already cached on your ram disk. But the layers in-between calls to sort will be acting as a buffer and should reduce the number. That is the truly "useless uses of cat" are those which use cat to read from a file. That's those of the form:

cat some_file | some_command

An interesting experiment

I would be interested to know if the same effect can be induced by increasing the buffer size on the pipes. If you setup the same pipeline from a proper programming language (not a shell). Eg in C you could create your pipeline using pipe(),dup2(),fork(),exec() and call ioctl() on each pipe first to raise the buffer size (see Pipe Capacity)

Related Question