The ERE regex to split() string between a delimiter and end-of-word

awkgawkregular expressionsplit

I'm using a long gawk 3.1.6 script to do a complex conversion of Zim markdown text into GtkDialog code and am stuck on the following problem…

Sample ASCII input…

[[link|label label]] [[link]] @tag more text

Commandline test to find right regex…

re="[][][][]"; echo '[[link|label label]] [[link]] @tag more text' | awk -v RE=$re '{split($0,A,RE); printf "\n(" A[1] ")(" A[2] ")(" A[3] ")(" A[4] ")(" A[5] ")(" A[6] ")(" A[7] ")(" A[8] ")\n"}'

The regex "[][][][]" splits out the two hyperlink forms quite nicely so that's not a problem.

It would be more understandable if we could divided it in two —
"[][]" and "[][]". We are looking for either "[[" or "]]" to split
on. The order of the characters in the class have to be reversed to
comply with class meta-character restrictions.

The problem is in also splitting out the "@tag" into just "tag". "tag" could be any alphanumeric text either followed by a space or the end of the string.

Executing the commandline test above yields…

()(link|label label)( )(link)( @tag more text)()()

But I need it to yield…

()(link|label label)( )(link)( )(tag)(more text)

I've tried numerous regex strings like "[][][][]|@[[:alnum:]]*" which drops the entire word and yields…

()(link|label label)( )(link)( )( more text)()

and "[][][][]|@" which yields…

()(link|label label)( )(link)( )(tag more text)()

Any ideas?

Best Answer

I don't think you can do this in a single regex, but since you're using gawk, you can use some gawk extensions:

awk '{
    n = split($0, a, /\[\[|\]\]|@[[:alnum:]]+/, s)
    for (i=1; i<=n; i++) {
        printf "(%s)", a[i]
        if (match(s[i], /^@(.+)/, m))
            printf "(%s)", m[1]
    }
    print ""
}' <<END
[[link|label label]] [[link]] @tag more text
some text with @anothertag and [[another|link]]
END
()(link|label label)( )(link)( )(tag)( more text)
(some text with )(anothertag)( and )(another|link)()
Related Question