Regex & Sed/Perl: Match word that ISN’T preceded by another word

perlregular expressionsed

I'd like to use sed or perl to replace all occurrences of a word that doesn't have a certain word in front of it.

For example, I have a text file that contains a plot of a movie and I want to replace all occurrences of a character's last name with their first name, but only if their first name doesn't come immediately before their last name.

Sample text might look like this:

John Smith and Jane Johnson talk about Smith's car.

I want it to look like this:

John Smith and Jane Johnson talk about John's car.

If I just do sed 's/Smith/John/' file, then I would have:

John John and Jane Johnson talk about John's car.

The first name that comes before the last name will always be the same. I don't have to deal with John Smith and Frank Smith. I just need a way to match Smith that doesn't have John preceding it.

Best Answer

Would be easy with any language where the regular expressions are capable to lookbehind. Of course, Perl is the first on list:

perl -pe 's/(?<!John\W)Smith/John/g' <<< "John Smith and Jane Johnson talk about Smith's car."

The weak point is having more than one non-word character between “John” and “Smith”. Unfortunately a quantifier like + for \W would raise “Variable length lookbehind not implemented” error.

Related Solutions

Why isn’t this sed regex matching

The 1 in the number 10 matches [^049] so it's deleted.

Shell – How to exchange words in a filename using the shell

Since the description part of the filename can contain the pattern - (a hyphen between two spaces), you can change that to some symbol that doesn't occur in the description part. I chose £, but that's purely arbitrary.

rename 's/ - /£/' *
rename 's/([^,]*), ([^£]*)£/$2 $1 - /' *

s/ - /£/' tells rename to replace the first instance of - it finds with£. The second command is a bit more complicated. Parenthases (()) are used to group selections together -- so everything that matches the pattern within the first set of parens can be called later as$1(all the way up to$9).[^,]means 'any character, except for,';means 'zero or more of the previous character';[^,]` means 'zero or more of any character except for a comma'. Since rename is greedy by default, it matches the longest string possible.

The rest, I think, follows pretty naturally.

If you're not sure whether a symbol appears anywhere in the filename, just run:

printf '%s\n' *£*

If you have perl-rename:

rename 's/([^,]*), (.*) - /$2 $1 - /' *

This will break if the description part of the filenames contain - (that is, a - between two spaces).

You should test this with the -n flag, which won't rename any files, but will show what it would have done:

rename -n 's/([^,]*), (.*) - /$2 $1 - /' *

Best Answer

Related Solutions

Why isn’t this sed regex matching

Shell – How to exchange words in a filename using the shell

Related Question