Bash shell script to locate and remove substring within a filename

bashfilenamesshell-scriptstring

I am trying to write a bash shell script in Mac OS X 10.6 (which is called by an Automator Action) to rename TV show DVD rips that I have named badly over the years. I want to remove part of the text in the filenames. I want to remove the text that appears after a specific series of characters that I know will always appear in the filename. But I do not know how many characters will appear before or after the known series of characters. I also don't know if the before or after text will contain multiple periods or dashes. An example would probably help:

The.Big.Bang.Theory.S01E01.xxxxxxxxxxx.mp4

I know that each file will always contain a string in the format of SxxExx where the x's are always numbers. But I do not know what the numbers will be. I want to get the filename up to and including the SxxExx string and the file extension but strip out everything else. So for the above example I would end-up with:

The.Big.Bang.Theory.S01E01.mp4

I have tried using bash's built-in string replacement commands. I thought the expr index command would give me the start point of the SxxExx string and then I could use ${filename:offset:length} to extract only the required part of the filename (I already know the extension so that can be re-added). But it seems the OS X version of expr doesn't include the index functionality. I have only scripted in Basic and LotusScript before. In those environments this would have been fairly easy using commands such as 'Like' and 'Instr' or 'Mid'. But in bash I just can't figure it out. I have spent hours googling trying to understand how to use regular expressions to locate the 'SxxExx' substring in the filename but I just can't figure it out. I hope some clever UNIX scripters will be able to help me!

Best Answer

Try this:

newname=`echo "$filename" | sed -e 's/\(S[0-9][0-9]E[0-9][0-9]\).*\.mp4/\1.mp4/'`

The regular expression is:

  • start a group ( \( )
  • match SXXEXX where X is a numeral between 0 and 9
  • end group ( \) )
  • match any number of any character (except a newline)
  • match a explicit string ( .mp4 )

In the replacement expression:

  • replace with string matched in first group ( \1 )
  • replace with explicit string ( .mp4 )
Related Question