Shell – To count number of matches in a mega string quickly

findsedshell-script

I have big text-data without spaces and without other rows in one line.
In reality, the streams are 0.2 Gb/s, similar situation here, but in this task, counting occurrences which is more challenging computationally than just counting empty lines.
The match is

585e0000fe5a1eda480000000d00030007000000cd010000

Example data subset is here called 30.6.2015_data.txt and its full binary data here called 0002.raw.
The match occurs 1 time in 30.6.2015_data.txt but 10 times in the full data 0002.raw in one line.
I prepared the txt data by xxd -ps 0002.raw > /tmp/1 && fold -w2 /tmp/1 > /tmp/2 && gsed ':a;N;$!ba;s/\n//g' /tmp/2 > /tmp/3.
The faster implementation, the better.
To prepare the mega string in column, you can use this xxd -ps 0002.raw > /tmp/1 && fold -w2 /tmp/1 > /tmp/2.
My current rate is 0.0012 s per match i.e. 0.012 s per ten matches in the full data file, which is slow.

Grep does this in rows so not possible in counting.
In Vim, %s/veryLongThing//gn is insufficient for the task.
The command wc is giving only character, byte and lines so not correct tool but probably by combining it to something else.
Possibly GNU Find and Sed combination but all implementations seems to be too complicated.

Outputs of Mikeserv's answer

$ cat 1.7.2015.sh 
time \
    ( export ggrep="$(printf '^ \376Z\36\332H \r \3 \a \315\1')" \
             gtr='\1\3\a\r\36HZ^\315\332\376'
             LC_ALL=C
      gtr -cs "$gtr" ' [\n*]' |
      gcut -sd\  -f1-6       |
      ggrep -xFc "$ggrep"
    ) <0002.raw

$ sh 1.7.2015.sh 
1

real    0m0.009s
user    0m0.006s
sys 0m0.007s

-----------

$ cat 1.7.2015.sh 
time \
    (  set      x58 x5e x20 x20 xfe x5a x1e xda \
                x48 x20 x20 x20 x0d x20 x03 x20 \
                x07 x20 x20 x20 xcd x01 x20 x20
        export  ggrep="$(shift;IFS=\\;printf "\\$*")"    \
                gtr='\0\1\3\a\r\36HXZ^\315\332\376'      \
                LC_ALL=C i=0
        while [ "$((i+=1))" -lt 1000 ]
        do    gcat 0002.raw; done            |
        gtr -cd "$gtr" |gtr 'X\0' '\n '      |
        gcut -c-23    |ggrep -xFc "$ggrep"
    ) 

$ sh 1.7.2015.sh 
9990

real    0m4.371s
user    0m1.548s
sys 0m2.167s

where all tools are GNU coreutils and they have all options you provide in the code. They may however differ with GNU devtools.
Mikeserv runs his code 990 times and there are 10 events so total 9990 events is correct.

How can you count the number of matches in a megastring efficiently?

Best Answer

The GNU implementation of grep (also found in most modern BSDs though the latest versions are a complete (mostly compatible) rewrite) supports a -o option to output all the matched portions.

LC_ALL=C grep -ao CDA | wc -l

would then count all the occurrences.

LC_ALL=C grep -abo CDA

to locate them with their byte offset.

LC_ALL=C makes sure grep doesn't try and do some expensive UTF-8 parsing (though here, with a fixed ASCII string search, grep should be able to optimise away the UTF-8 parsing by itself). -a is another GNUism to tell grep to consider binary files.

Related Question