Bash – Strategy to extract movies’s name from this ununiform dataset

awkbashgrepregular expressionsed

I am working on a movie database problem to improve regular expressions, this is the problem I'm running into. My dataset looks like this:

Movie Name (variable space and tabs) year
Movie1(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year1
Movie2(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year2
Movie3(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year3
Movie4(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year4

I want to extract names of all of the movies. These are the challenges I'm facing while doing it:

1: The delimiter is variable. If it was colon or something unique, I would have used an awk command to extract them like this awk -F 'separator' '{print $1}'
In this case, it can be single space, two or more spaces or combination of \t or spaces.

2: For those rows where delimiter is \t, I can use a \t to extract it, because that does not come in movie names. But what if the delimiter is one space or two spaces. They can very easily appear in the movie's name. In those cases, I don't know what to do.

I know the question is very rigid and specific. But as I described earlier, I'm very much blocked here. I can't think of any way around this problem.

Is there any combination of grep/sed/awk with reg-ex that can be used to achieve the objective?

Best Answer

Using gawk and assuming that the year always ends the record:

awk -F"[0-9]{4}$" '{print $1}' movies

Related Solutions

How to strip multiple spaces to one using sed

The use of grep is redundant, sed can do the same. The problem is in the use of * that match also 0 spaces, you have to use \+ instead:

iostat | sed -n '/hdisk1/s/ \+/ /gp'

If your sed do not supports \+ metachar, then do

iostat | sed -n '/hdisk1/s/  */ /gp'

Ubuntu – sed regex issue

I don't have any problem with [[:space:]]. Here's a really silly little example showing the mixed-replacement of spaces and tabs:

$ echo -e 'A \t \t B' | sed 's/A[[:space:]]*B/WORKED/'
WORKED

You can also use \s which is often preferable with big sed strings because it's much shorter:

$ echo -e 'A \t \t B' | sed 's/A\s*B/WORKED/'
WORKED

Anyway, I think your actual problem is escaping those troublesome single quotes. I find the easiest way is to break out of the single quote string and have a double-quoted single quote and then (if needed) go back into the single quote line. Bash will automatically concatenate this all up for you.

$ echo 'This is a nice string and this is a single quote:'"'"' Nice?'
This is a nice string and this is a single quote:' Nice?

So all the space we saved with \s is about to get destroyed by this mega-quote situation:

$ echo -e '$RELEASE  \t = '"'"'1234'"'"';' |\
  sed 's/$RELEASE\s*=\s*'"'"'[0-9]*'"'"'\;/REPLACEMENT/'

Of course there is an argument that (because this looks like a PHP script) that you might be able to assume that if the line starts with $RELEASE[\s=]+ you can just replace the whole line. Not always true obviously (the entire app could be one hideous line) but it makes your search and replace more palatable:

sed 's/$RELEASE[\s=]*.*/REPLACEMENT/'

And yes, general sed usage rules apply. Don't echo into a stream-editor (like sed) and redirect back into that file. If it works you could easily knacker the file.

Either use the -i argument (works for sed) or pipe into a application like sponge (which is like a delayed output):

sed -i '...' file
sed '...' file | sponge file

Best Answer

Related Solutions

How to strip multiple spaces to one using sed

Ubuntu – sed regex issue

Related Question