Regular Expression – Splitting Text Files Based on a Pattern

regular expressionsplit

I have a text file which I want to split into 64 unequal parts, according to the 64 hexagrams of the Yi Jing. Since the passage for each hexagram begins with some digit(s), a period, and two newlines, the regex should be pretty easy to write.

But how do I actually split the text file into 64 new files according to this regex? It seems like more of a task for perl. But maybe there's a more obvious way that I'm just totally missing.

Best Answer

This would be csplit except that the regex has to be a single line. That also makes sed difficult; I'd go with Perl or Python.

You could see if

csplit foo.txt '/^[0-9][0-9]*\.$/' '{64}'

is good enough for your purposes. (csplit requires a POSIX BRE, so it can't use \d or +, among others.)

How to quote special characters (portably)

The following snippet adds a backslash before each character that's special in extended regular expressions, using sed to replace any occurence of one of the characters ][()\.^$?*+ by a backslash followed by that character:

raw_string='test[string]\.wibble'
quoted_string=$(printf %s "$raw_string" | sed 's/[][()\.^$?*+]/\\&/g')

This will remove trailing newlines in $raw_string; if that's a problem, ensure that the string doesn't end with a newline by adding an inert character at the end, then strip off that character.

quoted_string=$(printf %sa "$raw_string" | sed 's/[][()\.^$?*+]/\\&/g')
quoted_string=${quoted_string%?}

How to quote special characters (in bash or zsh)

Bash and zsh have a pattern replacement feature, which can be faster if the string is not very long. It's cumbersome here because the replacement must be a string, so each character needs to be replaced separately. Note that you must escape the backslashes first.

quoted_string=${raw_string//\\//\\\\}
for c in \[ \] \( \) \. \^ \$ \? \* \+; do
  quoted_string=${quoted_string//"$c"/"\\$c"}
done

How to quote special characters (in ksh93)

Ksh's string replacement construct is more powerful than the watered-down version in bash and zsh. It supports references to groups in the pattern.

quoted_string=${raw_string//@([][()\.^$?*+])/\\\1}

What you actually want

You don't need find here: shell patterns are sufficient to match files ending with three digits. If no part file exists, the glob pattern is left unexpanded. There's also a simpler way of adding the file sizes: rather than use stat (which exists on many unix variants but has a different syntax on each) and do complex pipelining to sum the values, you can call wc -c (on regular files, on most systems, wc will look at the file size and not bother to open the file and read the bytes).

set -- "$DESTINATION/$FILE_BASENAME".[0-9][0-9][0-9]
case $1 in
  *\]) # The glob was left intact, so no part exists
    do_split …;;
  *) # The glob was expanded, so at least one part exists
    FILE_SIZE_EXISTING=$(wc -c "$@" | sed -n '$s/[^0-9]//gp')
    if [ "$FILE_SIZE_EXISTING" -ne "$(wc -c <"$DESTINATION/$FILE_BASENAME")" ]; then
      do_split …
    fi

Note that your test on the total size is not very reliable: if the file has changed but remained the same size, you'll end up with stale parts. That's ok if the files never change and the only risk is that parts may be truncated or missing.

How to search for the word stored in the hold space with sed

That was a hard one. Assuming you have a file like this:

$ cat file
word
line with a word and words and wording wordy words.

Where:

Line 1: is the search pattern that should be held in the hold space and quoted to `word`.
Line 2: is the line to seach and replace globally.

The sed command:

sed -n '1h; 2{x;G;:l;s/^\([^\n]\+\)\n\(.*[^`]\)\1\([^`]\)/\1\n\2`\1`\3/;tl;p}' file

Explanation:

1h; save the first line to the hold space (this is wait we want to search for).
- hold space contains: word
2{...} applies to the second line.
x; exchange the pattern space and the hold space.
G; append the hold space to the pattern space. In the pattern space we have now:

word # I will call this line the "pattern line" from now on
line with a word and words and wording wordy words.

:l; set a label called l as point for later.
s/// do the actual search/replace in the pattern space mentioned above:
- ^$[^\n]\+$\n search in the "pattern line" for all characters (from the beginning of the line ^) which are not a newline [^\n] (one or more times \+), until a newline \n. This is now stored in the back-reference \1. It contains the "pattern line".
- (.*[^`]) search for any character .* followed by a character, which is not a backtick [^`]. This is stored in \2. \2 contains now: line with a word and words and wording wordy, until the last occurence of word, because...
- \1 is the next search term (the back-reference \1, word), hence what the "pattern line" contains.
- ([^`]) this is followed by another character which is not a backtick; saved to reference \3. If we don't do this (and the part in \2 from above), we would end of in an endless loop quoting the same word, again and again -> ````word````, because s/// would always be successful and tl; jumps back to :l (see tl; further down).
- \1\n\2\1\3 all of the above is replaced by the back-references. The second \1 is the one we should quote (note the first reference is the "pattern line").
tl; if the s/// was successful (we replaced something) jump to the label called l and start again until there is nothing more to search and replace. This is the case, when all occurences of word are replaced/quoted.
p; when all is done, print the altered line (pattern space).

The output:

$ sed -n '1h; 2{x;G;:l;s/^\([^\n]\+\)\n\(.*[^`]\)\1\([^`]\)/\1\n\2`\1`\3/;tl;p}' file
word
line with a `word` and `word`s and `word`ing `word`y `word`s.

Best Answer

Related Solutions

Escaping of meta characters in basic/extended posix regex strings in grep

How to quote special characters (portably)

How to quote special characters (in bash or zsh)

How to quote special characters (in ksh93)

What you actually want

How to search for the word stored in the hold space with sed

Related Question