Escaping of meta characters in basic/extended posix regex strings in grep

escape-charactersgrepperlregular expression

Is it possible to escape all meta characters of a string inside a variable before passing it to grep? I know similar question has been asked before on SE
(here) and also a good explanation here, but I was just curious about whether it is possible with basic/extended posix regex pattern instead of perl pattern? (currently I'm reading perl regex syntax to understand it first instead of jumping into the solution)

Why this requirement: (Meta, not required for answer)

I was trying to write a small script for splitting large files where I split files to file_name.ext.000, file_name.ext.001… etc. which works fine. Now I don't like to split those files which are already split (ie. have files names having 3 character extension which are all digits, and their size sum up to original file size. Now if I use a plain shell expansion like file_name.ext.* it also matchs files having file_name.ext.ext2 and hence the total size mismatches and split occurs even though there's no need to resplit. So I would check only for those files having name file_name.ext.### where ### are digits. My current expression to find file size of these parts look like this:

FILE_SIZE_EXISTING=$( (find "$DESTINATION" -type f -regextype posix-extended -regex "^$DESTINATION/$FILE_BASENAME(\.[[:digit:]]{3})?$" -print0 | xargs -0 stat --printf="%s\\n" 2>/dev/null || echo 0) | paste -sd+ | bc)

This works for simple file names. However, it does not work if some fancy name e.g. containing [ ] etc. Is there a workaround? I'm new to shell scripting and hence don't know perl much.

Best Answer

How to quote special characters (portably)

The following snippet adds a backslash before each character that's special in extended regular expressions, using sed to replace any occurence of one of the characters ][()\.^$?*+ by a backslash followed by that character:

raw_string='test[string]\.wibble'
quoted_string=$(printf %s "$raw_string" | sed 's/[][()\.^$?*+]/\\&/g')

This will remove trailing newlines in $raw_string; if that's a problem, ensure that the string doesn't end with a newline by adding an inert character at the end, then strip off that character.

quoted_string=$(printf %sa "$raw_string" | sed 's/[][()\.^$?*+]/\\&/g')
quoted_string=${quoted_string%?}

How to quote special characters (in bash or zsh)

Bash and zsh have a pattern replacement feature, which can be faster if the string is not very long. It's cumbersome here because the replacement must be a string, so each character needs to be replaced separately. Note that you must escape the backslashes first.

quoted_string=${raw_string//\\//\\\\}
for c in \[ \] \( \) \. \^ \$ \? \* \+; do
  quoted_string=${quoted_string//"$c"/"\\$c"}
done

How to quote special characters (in ksh93)

Ksh's string replacement construct is more powerful than the watered-down version in bash and zsh. It supports references to groups in the pattern.

quoted_string=${raw_string//@([][()\.^$?*+])/\\\1}

What you actually want

You don't need find here: shell patterns are sufficient to match files ending with three digits. If no part file exists, the glob pattern is left unexpanded. There's also a simpler way of adding the file sizes: rather than use stat (which exists on many unix variants but has a different syntax on each) and do complex pipelining to sum the values, you can call wc -c (on regular files, on most systems, wc will look at the file size and not bother to open the file and read the bytes).

set -- "$DESTINATION/$FILE_BASENAME".[0-9][0-9][0-9]
case $1 in
  *\]) # The glob was left intact, so no part exists
    do_split …;;
  *) # The glob was expanded, so at least one part exists
    FILE_SIZE_EXISTING=$(wc -c "$@" | sed -n '$s/[^0-9]//gp')
    if [ "$FILE_SIZE_EXISTING" -ne "$(wc -c <"$DESTINATION/$FILE_BASENAME")" ]; then
      do_split …
    fi

Note that your test on the total size is not very reliable: if the file has changed but remained the same size, you'll end up with stale parts. That's ok if the files never change and the only risk is that parts may be truncated or missing.

Related Question