Bash – String pattern-matching with =~

bashregular expressionstring

I have problems to understand the string pattern matching with =~ in bash.

I wrote following function (don't be alarmed – it's just experimenting, not a security approach with md5sum):

md5 ()  { 
     [[ "$(md5sum $1)" =~ $2* ]] && echo fine || echo baarr; 
}

and tested it with some input. Here some reference:

md5sum wp.laenderliste
b1eb0d822e8d841249e3d68eeb3068d3  wp.laenderliste

It's unnecessarily hard to compare, if the source for the control sum does not contain the two blanks with the filename already. That's where the observations origins from, but more interesting than the many ways to solve that problem was my observation:

I define a control variable, and test my function with too short, but matching strings:

ok=b1eb0d822e8d841249e3d68eeb3068d3
for i in {29..32}; do md5 wp.laenderliste ${ok:1:$i} ;done 
fine
fine
fine
fine

That's expected and fine, since it is the purpose of the function, to ignore the mismatch of the missing " wp.laenderliste" and therefore even longer mismatches.

Now, if I append random stuff, which does not match, I expect, of course, errors, and get them:

for i in {29..32}; do md5 wp.laenderliste ${ok:1:$i}GU ;done 
baarr
baarr
baarr
baarr

As expected. But when there is only one, last mismatching character, see what happens:

for i in {29..32}; do md5 wp.laenderliste ${ok:1:$i}G ;done 
fine
fine
fine
fine

Is this me, not realizing how this is supposed to work (select is broken), or is there really an off-by-one-error in bash's pattern matching?

Mismatches in the mid of the string matter from count 1:

for i in 5 9 e ; do echo md5 wp.laenderliste ${ok//$i/_} ;done 
md5 wp.laenderliste b1eb0d822e8d841249e3d68eeb3068d3
md5 wp.laenderliste b1eb0d822e8d84124_e3d68eeb3068d3
md5 wp.laenderliste b1_b0d822_8d841249_3d68__b3068d3

for i in 5 9 e ; do md5 wp.laenderliste ${ok//$i/_} ;done 
fine
baarr
baarr

The bash-version:

bash -version
GNU bash, Version 4.3.48(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2013 Free Software Foundation, Inc.
Lizenz GPLv3+: GNU GPL Version 3 oder jünger <http://gnu.org/licenses/gpl.html>

Disclaimer: md5sum is only a useful against unintentional mistakes, not against attacks. I don't encourage using it.

And this question is not a search for better solutions or workarounds. It's about the =~ Operator, whether it should act as it does and if so, why.

Best Answer

=~ in ([[ ]]) is a regular expression pattern match (or rather, a search, see below). That's different from = (or ==) which uses the same patterns as with filename wildcards.

In particular, the asterisk in regular expressions means "zero or one copies of the preceding unit", so abc* means ab plus zero or more cs.

In your case, the trailing asterisk makes the final character of the function argument optional. In your final example, the pattern becomes ...68d3G*, and since G* matches the empty string, it matches a string like ...68d3. Regexese for "any string" is of .*, or "any character, any number of times".

Note that the regexp match searches for a match anywhere in the string, it doesn't need to be the whole string. So the pattern cde would be found in the string abcdefgh.

You might want to use something like this:

[[ "$(md5sum "$1")" = "$2 "* ]] && echo ok

We don't really need a regular expression match here, and since md5sum outputs the trailing space (plus filename) anyway, we can use that in the pattern to check that we match against the full pattern. So giving the function a truncated hash would not match.

Related Question