I have been using this perl script (thanks to Jeff Schaller) to match 3 or more words in the title fields of two separate csv files as answered here:
Matching 3 or more words from fields in separate csv files
The script is:
#!/usr/bin/env perl
my @csv2 = ();
open CSV2, "<csv2" or die;
@csv2=<CSV2>;
close CSV2;
my %csv2hash = ();
for (@csv2) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
$csv2hash{$_} = $title;
}
open CSV1, "<csv1" or die;
while (<CSV1>) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
my @titlewords = split /\s+/, $title; #/ get words
my $desired = 3;
my $matched = 0;
foreach my $csv2 (keys %csv2hash) {
my $count = 0;
my $value = $csv2hash{$csv2};
foreach my $word (@titlewords) {
++$count if $value =~ /\b$word\b/i;
last if $count >= $desired;
}
if ($count >= $desired) {
print "$csv2\n";
++$matched;
}
}
print "$_\n" if $matched;
}
close CSV1;
I have since realised that I would like to ignore certain words between the titles and not class them as matching words. I've been using sed to remove them before the csv files are compared but this isn't ideal as I lose data in the process. How can I add words which would be considered as exceptions to this perl script? For example, let's say if I wanted the script to ignore the three separate words and
if
and the
when matching the titles so that they would be exceptions to the rule.
Best Answer
After the line
add the code to remove the words from the array: