Bash shell script to locate and remove substring within a filename

bashcommand lineuti

I am trying to write a bash shell script (which is called by an Automator Action) to rename TV show DVD rips that I have named badly over the years. I want to remove part of the text in the filenames. I want to remove the text that appears after a specific series of characters that I know will always appear in the filename. But I do not know how many characters will appear before or after the known series of characters. I also don't know if the before or after text will contain multiple periods or dashes. An example would probably help:

The.Big.Bang.Theory.S01E01.xxxxxxxx.mp4

I know that each file will always contain a string in the format of SxxExx where the x's are always numbers. But I do not know what the numbers will be. I want to get the filename up to and including the SxxExx string and the file extension but strip out everything else. So for the above example I would end-up with:

The.Big.Bang.Theory.S01E01.mp4

I have tried using bash's built-in string replacement commands. I thought the expr index command would give me the integer start point of the SxxExx string and then I could use ${filename:offset:length} to extract only the required part of the filename (i already know the extension so that can be re-added). But it seems the OS X version of expr doesn't include the index functionality. I have only scripted in Basic and LotusScript before. In those environments this would have been fairly easy using commands such as 'Like' and 'Instr' or 'Mid'. But in bash I just can't figure it out. I have spent hours googling trying to understand how to use regular expressions to locate the 'SxxExx' substring in the filename but I just can't figure it out. I hope some clever UNIX scripters will be able to help me!

Best Answer

ls | perl -nl -e '/(.*)(S[0-9]+E[0-9]+).*(\.mp4)/ && print "mv \"" . $_ . "\" \"". $1 . $2 . $3 . "\""'

How does this work? First ls outputs the list of files, one per line, like so:

The.Big.Bang.Theory.S01E01.xxxxxxxx.mp4
The.Big.Bang.Theory.S01E02.somecrap.mp4
The.Big.Bang.Theory.S04E12.otherjunk.mp4

Then perl -nl splits this into lines, feeding each to the regex, much like awk*. The regex captures 3 groups (denoted by parentheses), first the bit before SxxEyy, then that, then the file suffix. It then simply assembles a mv command suitable for renaming the files, like so:

mv "The.Big.Bang.Theory.S01E01.xxxxxxxx.mp4" "The.Big.Bang.Theory.S01E01.mp4"
mv "The.Big.Bang.Theory.S01E02.somecrap.mp4" "The.Big.Bang.Theory.S01E02.mp4"
mv "The.Big.Bang.Theory.S04E12.otherjunk.mp4" "The.Big.Bang.Theory.S04E12.mp4"

This can then be inspected and once you're satisfied it does what you want, piped into a shell by appending: | sh.

*awk would normally be a good tool to use for this, but sadly only GNU awk supports regex capture groups and Mac OS X doesn't include gawk by default.