I have big text-data without spaces and without other rows in one line.
In reality, the streams are 0.2 Gb/s, similar situation here, but in this task, counting occurrences which is more challenging computationally than just counting empty lines.
The match is
585e0000fe5a1eda480000000d00030007000000cd010000
Example data subset is here called 30.6.2015_data.txt and its full binary data here called 0002.raw.
The match occurs 1 time in 30.6.2015_data.txt but 10 times in the full data 0002.raw in one line.
I prepared the txt data by xxd -ps 0002.raw > /tmp/1 && fold -w2 /tmp/1 > /tmp/2 && gsed ':a;N;$!ba;s/\n//g' /tmp/2 > /tmp/3
.
The faster implementation, the better.
To prepare the mega string in column, you can use this xxd -ps 0002.raw > /tmp/1 && fold -w2 /tmp/1 > /tmp/2
.
My current rate is 0.0012 s per match i.e. 0.012 s per ten matches in the full data file, which is slow.
Grep does this in rows so not possible in counting.
In Vim, %s/veryLongThing//gn
is insufficient for the task.
The command wc
is giving only character, byte and lines so not correct tool but probably by combining it to something else.
Possibly GNU Find and Sed combination but all implementations seems to be too complicated.
Outputs of Mikeserv's answer
$ cat 1.7.2015.sh
time \
( export ggrep="$(printf '^ \376Z\36\332H \r \3 \a \315\1')" \
gtr='\1\3\a\r\36HZ^\315\332\376'
LC_ALL=C
gtr -cs "$gtr" ' [\n*]' |
gcut -sd\ -f1-6 |
ggrep -xFc "$ggrep"
) <0002.raw
$ sh 1.7.2015.sh
1
real 0m0.009s
user 0m0.006s
sys 0m0.007s
-----------
$ cat 1.7.2015.sh
time \
( set x58 x5e x20 x20 xfe x5a x1e xda \
x48 x20 x20 x20 x0d x20 x03 x20 \
x07 x20 x20 x20 xcd x01 x20 x20
export ggrep="$(shift;IFS=\\;printf "\\$*")" \
gtr='\0\1\3\a\r\36HXZ^\315\332\376' \
LC_ALL=C i=0
while [ "$((i+=1))" -lt 1000 ]
do gcat 0002.raw; done |
gtr -cd "$gtr" |gtr 'X\0' '\n ' |
gcut -c-23 |ggrep -xFc "$ggrep"
)
$ sh 1.7.2015.sh
9990
real 0m4.371s
user 0m1.548s
sys 0m2.167s
where all tools are GNU coreutils and they have all options you provide in the code. They may however differ with GNU devtools.
Mikeserv runs his code 990 times and there are 10 events so total 9990 events is correct.
How can you count the number of matches in a megastring efficiently?
Best Answer
The GNU implementation of
grep
(also found in most modern BSDs though the latest versions are a complete (mostly compatible) rewrite) supports a-o
option to output all the matched portions.would then count all the occurrences.
to locate them with their byte offset.
LC_ALL=C
makes suregrep
doesn't try and do some expensive UTF-8 parsing (though here, with a fixed ASCII string search,grep
should be able to optimise away the UTF-8 parsing by itself).-a
is another GNUism to tellgrep
to consider binary files.