Bash – Strategy to extract movies’s name from this ununiform dataset

awkbashgrepregular expressionsed

I am working on a movie database problem to improve regular expressions, this is the problem I'm running into. My dataset looks like this:

Movie Name (variable space and tabs) year
Movie1(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year1
Movie2(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year2
Movie3(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year3
Movie4(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year4

I want to extract names of all of the movies. These are the challenges I'm facing while doing it:

1: The delimiter is variable. If it was colon or something unique, I would have used an awk command to extract them like this awk -F 'separator' '{print $1}'
In this case, it can be single space, two or more spaces or combination of \t or spaces.

2: For those rows where delimiter is \t, I can use a \t to extract it, because that does not come in movie names. But what if the delimiter is one space or two spaces. They can very easily appear in the movie's name. In those cases, I don't know what to do.

I know the question is very rigid and specific. But as I described earlier, I'm very much blocked here. I can't think of any way around this problem.

Is there any combination of grep/sed/awk with reg-ex that can be used to achieve the objective?

Best Answer

Using gawk and assuming that the year always ends the record:

awk -F"[0-9]{4}$" '{print $1}' movies
Related Question