The ERE regex to split() string between a delimiter and end-of-word

awkgawkregular expressionsplit

I'm using a long gawk 3.1.6 script to do a complex conversion of Zim markdown text into GtkDialog code and am stuck on the following problem…

Sample ASCII input…

[[link|label label]] [[link]] @tag more text

Commandline test to find right regex…

re="[][][][]"; echo '[[link|label label]] [[link]] @tag more text' | awk -v RE=$re '{split($0,A,RE); printf "\n(" A[1] ")(" A[2] ")(" A[3] ")(" A[4] ")(" A[5] ")(" A[6] ")(" A[7] ")(" A[8] ")\n"}'

The regex "[][][][]" splits out the two hyperlink forms quite nicely so that's not a problem.

It would be more understandable if we could divided it in two —
"[][]" and "[][]". We are looking for either "[[" or "]]" to split
on. The order of the characters in the class have to be reversed to
comply with class meta-character restrictions.

The problem is in also splitting out the "@tag" into just "tag". "tag" could be any alphanumeric text either followed by a space or the end of the string.

Executing the commandline test above yields…

()(link|label label)( )(link)( @tag more text)()()

But I need it to yield…

()(link|label label)( )(link)( )(tag)(more text)

I've tried numerous regex strings like "[][][][]|@[[:alnum:]]*" which drops the entire word and yields…

()(link|label label)( )(link)( )( more text)()

and "[][][][]|@" which yields…

()(link|label label)( )(link)( )(tag more text)()

Any ideas?

Best Answer

I don't think you can do this in a single regex, but since you're using gawk, you can use some gawk extensions:

save the separators using the split() function
use the match() function

awk '{
    n = split($0, a, /\[\[|\]\]|@[[:alnum:]]+/, s)
    for (i=1; i<=n; i++) {
        printf "(%s)", a[i]
        if (match(s[i], /^@(.+)/, m))
            printf "(%s)", m[1]
    }
    print ""
}' <<END
[[link|label label]] [[link]] @tag more text
some text with @anothertag and [[another|link]]
END

()(link|label label)( )(link)( )(tag)( more text)
(some text with )(anothertag)( and )(another|link)()

Related Solutions

Split string by delimiter and get N-th element

Use cut with _ as the field delimiter and get desired fields:

A="$(cut -d'_' -f2 <<<'one_two_three_four_five')"
B="$(cut -d'_' -f4 <<<'one_two_three_four_five')"

You can also use echo and pipe instead of Here string:

A="$(echo 'one_two_three_four_five' | cut -d'_' -f2)"
B="$(echo 'one_two_three_four_five' | cut -d'_' -f4)"

Example:

$ s='one_two_three_four_five'

$ A="$(cut -d'_' -f2 <<<"$s")"
$ echo "$A"
two

$ B="$(cut -d'_' -f4 <<<"$s")"
$ echo "$B"
four

Beware that if $s contains newline characters, that will return a multiline string that contains the 2^nd/4^th field in each line of $s, not the 2^nd/4^th field in $s.

Bash – RegEx in bash to extract string after the first delimiter

Just match 2 and then capture everything beyond by .*:

[[ $string =~ 2(.*) ]] && echo "${BASH_REMATCH[1]}"

Example:

$ string="ananas1kiwi2apple1banana2tree"

$ [[ $string =~ 2(.*) ]] && echo "${BASH_REMATCH[1]}"
apple1banana2tree

What's wrong with your one:

.* is greedy, it is matching upto last 2 when you use .*2, to have non-greediness (as .*? is not available in ERE) use [^2]*2
Also {1,} is just +

So do:

[[ $string =~ [^2]*2([[:alnum:]]+) ]]

In any case, no need to match from the start, just do:

[[ $string =~ 2([[:alnum:]]+) ]]

Best Answer

Related Solutions

Split string by delimiter and get N-th element

Bash – RegEx in bash to extract string after the first delimiter

Related Question