Shell – To count number of matches in a mega string quickly

findsedshell-script

I have big text-data without spaces and without other rows in one line.
In reality, the streams are 0.2 Gb/s, similar situation here, but in this task, counting occurrences which is more challenging computationally than just counting empty lines.
The match is

585e0000fe5a1eda480000000d00030007000000cd010000

Example data subset is here called 30.6.2015_data.txt and its full binary data here called 0002.raw.
The match occurs 1 time in 30.6.2015_data.txt but 10 times in the full data 0002.raw in one line.
I prepared the txt data by xxd -ps 0002.raw > /tmp/1 && fold -w2 /tmp/1 > /tmp/2 && gsed ':a;N;$!ba;s/\n//g' /tmp/2 > /tmp/3.
The faster implementation, the better.
To prepare the mega string in column, you can use this xxd -ps 0002.raw > /tmp/1 && fold -w2 /tmp/1 > /tmp/2.
My current rate is 0.0012 s per match i.e. 0.012 s per ten matches in the full data file, which is slow.

Grep does this in rows so not possible in counting.
In Vim, %s/veryLongThing//gn is insufficient for the task.
The command wc is giving only character, byte and lines so not correct tool but probably by combining it to something else.
Possibly GNU Find and Sed combination but all implementations seems to be too complicated.

Outputs of Mikeserv's answer

$ cat 1.7.2015.sh 
time \
    ( export ggrep="$(printf '^ \376Z\36\332H \r \3 \a \315\1')" \
             gtr='\1\3\a\r\36HZ^\315\332\376'
             LC_ALL=C
      gtr -cs "$gtr" ' [\n*]' |
      gcut -sd\  -f1-6       |
      ggrep -xFc "$ggrep"
    ) <0002.raw

$ sh 1.7.2015.sh 
1

real    0m0.009s
user    0m0.006s
sys 0m0.007s

-----------

$ cat 1.7.2015.sh 
time \
    (  set      x58 x5e x20 x20 xfe x5a x1e xda \
                x48 x20 x20 x20 x0d x20 x03 x20 \
                x07 x20 x20 x20 xcd x01 x20 x20
        export  ggrep="$(shift;IFS=\\;printf "\\$*")"    \
                gtr='\0\1\3\a\r\36HXZ^\315\332\376'      \
                LC_ALL=C i=0
        while [ "$((i+=1))" -lt 1000 ]
        do    gcat 0002.raw; done            |
        gtr -cd "$gtr" |gtr 'X\0' '\n '      |
        gcut -c-23    |ggrep -xFc "$ggrep"
    ) 

$ sh 1.7.2015.sh 
9990

real    0m4.371s
user    0m1.548s
sys 0m2.167s

where all tools are GNU coreutils and they have all options you provide in the code. They may however differ with GNU devtools.
Mikeserv runs his code 990 times and there are 10 events so total 9990 events is correct.

How can you count the number of matches in a megastring efficiently?

Best Answer

The GNU implementation of grep (also found in most modern BSDs though the latest versions are a complete (mostly compatible) rewrite) supports a -o option to output all the matched portions.

LC_ALL=C grep -ao CDA | wc -l

would then count all the occurrences.

LC_ALL=C grep -abo CDA

to locate them with their byte offset.

LC_ALL=C makes sure grep doesn't try and do some expensive UTF-8 parsing (though here, with a fixed ASCII string search, grep should be able to optimise away the UTF-8 parsing by itself). -a is another GNUism to tell grep to consider binary files.

Related Solutions

Bash – Count the number of occurrences of a substring in a string

With perl:

printf '%s' "$SUB_STRING" |
  perl -l -0777 -ne '
    BEGIN{$sub = <STDIN>}
    @matches = m/\Q$sub\E/g;
    print scalar @matches' <(printf '%s' "$STRING")

With bash alone, you could always do something like:

s=${STRING//"$SUB_STRING"}
echo "$(((${#STRING} - ${#s}) / ${#SUB_STRING}))"

That is $s contains $STRING with all occurrences of $SUB_STRING within it removed. We find out the number of $SUB_STRINGs that were removed by computing the difference in number of characters in between $STRING and $s and dividing by the length of $SUB_STRING itself.

POSIXly, you could do something like:

s=$STRING count=0
until
  t=${s#*"$SUB_STRING"}
  [ "$t" = "$s" ]
do
  count=$((count + 1))
  s=$t
done
echo "$count"

find Command – How to Limit Number of Matches

As you're not using find for very much other than walking the directory tree, I'd suggest instead using the shell directly to do this. See variations for both zsh and bash below.

Using the zsh shell

mv ./**/*(-.D[1,1000]) /path/to/collection1    # move first 1000 files
mv ./**/*(-.D[1,1000]) /path/to/collection2    # move next 1000 files

The globbing pattern ./**/*(-.D[1,1000]) would match all regular files (or symbolic links to such files) in or under the current directory, and then return the 1000 first of these. The -. restricts the match to regular files or symbolic links to these, while D acts like dotglob in bash (matches hidden names).

This is assuming that the generated command would not grow too big through expanding the globbing pattern when calling mv.

The above is quite inefficient as it would expand the glob for each collection. You may therefore want to store the pathnames in an array and then move slices of that:

pathnames=( ./**/*(-.D) )

mv $pathnames[1,1000]    /path/to/collection1
mv $pathnames[1001,2000] /path/to/collection2

To randomise the pathnames array when you create it (you mentioned wanting to move random files):

pathnames=( ./**/*(-.Doe['REPLY=$RANDOM']) )

You could do a similar thing in bash (except you can't easily shuffle the result of a glob match in bash, apart for possibly feeding the results through shuf, so I'll skip that bit):

shopt -s globstar dotglob nullglob

pathnames=()
for pathname in ./**/*; do
    [[ -f $pathname ]] && pathnames+=( "$pathname" )
done

mv "${pathnames[@]:0:1000}"    /path/to/collection1
mv "${pathnames[@]:1000:1000}" /path/to/collection2
mv "${pathnames[@]:2000:1000}" /path/to/collection3

Outputs of Mikeserv's answer

Best Answer

Related Solutions

Bash – Count the number of occurrences of a substring in a string

find Command – How to Limit Number of Matches

Related Question