Perl Text Processing – Add ‘Exception’ Words to Matching Titles Script

perltext processing

I have been using this perl script (thanks to Jeff Schaller) to match 3 or more words in the title fields of two separate csv files as answered here:

Matching 3 or more words from fields in separate csv files

The script is:

#!/usr/bin/env perl

my @csv2 = ();
open CSV2, "<csv2" or die;
@csv2=<CSV2>;
close CSV2;

my %csv2hash = ();
for (@csv2) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  $csv2hash{$_} = $title;
}

open CSV1, "<csv1" or die;
while (<CSV1>) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  my @titlewords = split /\s+/, $title;    #/ get words
  my $desired = 3;
  my $matched = 0;
  foreach my $csv2 (keys %csv2hash) {
    my $count = 0;
    my $value = $csv2hash{$csv2};
    foreach my $word (@titlewords) {
      ++$count if $value =~ /\b$word\b/i;
      last if $count >= $desired;
    }
    if ($count >= $desired) {
      print "$csv2\n";
      ++$matched;
    }
  }
  print "$_\n" if $matched;
}
close CSV1;

I have since realised that I would like to ignore certain words between the titles and not class them as matching words. I've been using sed to remove them before the csv files are compared but this isn't ideal as I lose data in the process. How can I add words which would be considered as exceptions to this perl script? For example, let's say if I wanted the script to ignore the three separate words and if and the when matching the titles so that they would be exceptions to the rule.

Best Answer

After the line

my @titlewords = split /\s+/, $title;    #/ get words

add the code to remove the words from the array:

my @new;
foreach my $t (@titlewords){
    push(@new, $t) if $t !~ /^(and|if|the)$/i;
}
@titlewords = @new;
Related Question