Bash – way to make this one-liner faster

awkbashgrepshell-scriptxargs

Context

I have a directory of thousands of zip files that are dated in the form YYYYMMDD_hhmmss.zip and each about 300K. Within each zip file is about 400 xml files each about 3K.

The problem

I need to be able to search and find a given string within a date-range of the zip files.

The current (albeit mediocre) solution

I have the following one-liner

find /home/mydir/ -type f | sort | \
awk "/xml_20140207_000016.zip/,/xml_20140207_235938.zip/" | \
xargs -n 1 -P 10 zipgrep "my search string"

The point of it is to

list all the files in my thousand-file directory
sort this list of files
retrieve a range of files based on given dates (this awk command only prints lines after that first matched string and up to that second matched string)
pass each line of the result which corresponds to a single file to zipgrep

The question

This one-liner runs horribly slow, even with 10 processes on a 24-core machine. I believe it's slow because of the zipgrep command but I'm not wise enough to know how to improve it. I don't know if I should be, but I'm a little embarrassed that a colleague wrote a java tool that runs faster than this script. I'd like to reverse that if possible. Then, does anyone know how to make this command faster in this context? Or to improve any part of it at all?

Best Answer

There's a part you can easily improve, but it isn't the slowest part.

find /home/mydir/ -type f | sort | \
awk "/xml_20140207_000016.zip/,/xml_20140207_235938.zip/"

This is somewhat wasteful because it first lists all files, then sorts the file names and extracts the interesting ones. The find command has to run to completion before the sorting can begin.

It would be faster to list only the interesting files in the first place, or at least as small a superset as possible. If you need a finer-grained filter on names than find is capable of, pipe into awk, but don't sort: awk and other line-by-line filters can process lines one by one but sort needs the complete input.

find /home/mydir/ -name 'xml_20140207_??????.zip' -type f | \
awk 'match($0, /_[0-9]*.zip$/) &&
     (time = substr($0, RSTART+1, RLENGTH-5)) &&
     time >= 16 && time <= 235938' |
xargs -n 1 -P 10 zipgrep "my search string"

The part which is most obviously suboptimal is zipgrep. Here there is no easy way to improve performance because of the limitations of shell programming. The zipgrep script operates by listing the file names in the archive, and calling grep on each file's content, one by one. This means that the zip archive is parsed again and again for each file. A Java program (or Perl, or Python, or Ruby, etc.) can avoid this by processing the file only once.

If you want to stick to shell programming, you can try mounting each zip instead of using zipgrep.

… | xargs -n1 -P2 sh -c '
    mkdir "mnt$$-$1";
    fuse-zip "$1" "mnt$$-$1";
    grep -R "$0" "mnt$$-$1"
    fusermount -u "mnt$$-$1"
' "my search string"

Note that parallelism isn't going to help you much: the limiting factor on most setups will be disk I/O bandwidth, not CPU time.

I haven't benchmarked anything, but I think the biggest place for improvement would be to use a zipgrep implementation in a more powerful language.

Related Solutions

Grep – Searching for a String in Multiple Zip Files

It is in general not possible to search for content within a compressed file without uncompressing it one way or another. Since zipgrep is only a shellscript, wrapping unzip and egrep itself, you might just as well do it manually:

for file in *.zip; do unzip -c "$file" | grep "ORA-1680"; done

If you need just the list of matching zip files, you can use something like:

for file in *.zip; do
    if ( unzip -c "$file" | grep -q "ORA-1680"); then
        echo "$file"
    fi
done

This way you are only decompressing to stdout (ie. to memory) instead of decompressing the files to disk. You can of course try to just grep -a the zip files but depending on the content of the file and your pattern, you might get false positives and/or false negatives.

AWK – How to Apply the Same Action to Multiple Files

If you modify the awk code, can be solved by a single awk process and no shell loop:

awk 'FNR==1{if(o)close(o);o=FILENAME;sub(/\.tex/,"_sorted.tex",o)}{ORS=FNR%3?" ":"\n";print>o}' *.tex

Not a beauty, just insignificantly faster.

Explanations as requested in comment.

FNR (file number or record) is similar to NR (number or record), but while NR is a continuous sequence number of all input records, FNR is reset to 1 when processing of a new input file is started.

A gawk 4.0 only alternative for the FNR==1 is the BEGINFILE special pattern.

awk '
FNR==1{   # first record of an input file?
  if(o)close(o);   # was previous output file? close it
  o=FILENAME;sub(/\.tex/,"_sorted.tex",o)   # new output file name
}
{
  ORS=FNR%3?" ":"\n";   # set ORS based on FNR (not NR as in the original code)
  print>o   # print to the current output file
}
' *.tex