Perl Text Processing – Add ‘Exception’ Words to Matching Titles Script

perltext processing

I have been using this perl script (thanks to Jeff Schaller) to match 3 or more words in the title fields of two separate csv files as answered here:

Matching 3 or more words from fields in separate csv files

The script is:

#!/usr/bin/env perl

my @csv2 = ();
open CSV2, "<csv2" or die;
@csv2=<CSV2>;
close CSV2;

my %csv2hash = ();
for (@csv2) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  $csv2hash{$_} = $title;
}

open CSV1, "<csv1" or die;
while (<CSV1>) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  my @titlewords = split /\s+/, $title;    #/ get words
  my $desired = 3;
  my $matched = 0;
  foreach my $csv2 (keys %csv2hash) {
    my $count = 0;
    my $value = $csv2hash{$csv2};
    foreach my $word (@titlewords) {
      ++$count if $value =~ /\b$word\b/i;
      last if $count >= $desired;
    }
    if ($count >= $desired) {
      print "$csv2\n";
      ++$matched;
    }
  }
  print "$_\n" if $matched;
}
close CSV1;

I have since realised that I would like to ignore certain words between the titles and not class them as matching words. I've been using sed to remove them before the csv files are compared but this isn't ideal as I lose data in the process. How can I add words which would be considered as exceptions to this perl script? For example, let's say if I wanted the script to ignore the three separate words and if and the when matching the titles so that they would be exceptions to the rule.

Best Answer

After the line

my @titlewords = split /\s+/, $title;    #/ get words

add the code to remove the words from the array:

my @new;
foreach my $t (@titlewords){
    push(@new, $t) if $t !~ /^(and|if|the)$/i;
}
@titlewords = @new;

Related Solutions

How to search for the word stored in the hold space with sed

That was a hard one. Assuming you have a file like this:

$ cat file
word
line with a word and words and wording wordy words.

Where:

Line 1: is the search pattern that should be held in the hold space and quoted to `word`.
Line 2: is the line to seach and replace globally.

The sed command:

sed -n '1h; 2{x;G;:l;s/^\([^\n]\+\)\n\(.*[^`]\)\1\([^`]\)/\1\n\2`\1`\3/;tl;p}' file

Explanation:

1h; save the first line to the hold space (this is wait we want to search for).
- hold space contains: word
2{...} applies to the second line.
x; exchange the pattern space and the hold space.
G; append the hold space to the pattern space. In the pattern space we have now:

word # I will call this line the "pattern line" from now on
line with a word and words and wording wordy words.

:l; set a label called l as point for later.
s/// do the actual search/replace in the pattern space mentioned above:
- ^$[^\n]\+$\n search in the "pattern line" for all characters (from the beginning of the line ^) which are not a newline [^\n] (one or more times \+), until a newline \n. This is now stored in the back-reference \1. It contains the "pattern line".
- (.*[^`]) search for any character .* followed by a character, which is not a backtick [^`]. This is stored in \2. \2 contains now: line with a word and words and wording wordy, until the last occurence of word, because...
- \1 is the next search term (the back-reference \1, word), hence what the "pattern line" contains.
- ([^`]) this is followed by another character which is not a backtick; saved to reference \3. If we don't do this (and the part in \2 from above), we would end of in an endless loop quoting the same word, again and again -> ````word````, because s/// would always be successful and tl; jumps back to :l (see tl; further down).
- \1\n\2\1\3 all of the above is replaced by the back-references. The second \1 is the one we should quote (note the first reference is the "pattern line").
tl; if the s/// was successful (we replaced something) jump to the label called l and start again until there is nothing more to search and replace. This is the case, when all occurences of word are replaced/quoted.
p; when all is done, print the altered line (pattern space).

The output:

$ sed -n '1h; 2{x;G;:l;s/^\([^\n]\+\)\n\(.*[^`]\)\1\([^`]\)/\1\n\2`\1`\3/;tl;p}' file
word
line with a `word` and `word`s and `word`ing `word`y `word`s.

Count unique associated values in awk (or perl)

With awk:

awk 'function p(){print l,c,d; delete a; delete b; c=d=0} 
  NR!=1&&l!=$1{p()} ++a[$2]==1{c++} ++b[$3]==1{d++} {l=$1} END{p()}' file

Explanation:

function p(): defines a function called p(), which prints the values and deletes the used variables and arrays.
NR!=1&&l!=$1 if its not the first line and the variable l equals the first field $1, then run the p() function.
++a[$2]==1{c++} if the increment of the element value of the a array with index $2 equals 1, then that value is first seen, and therefore increment the c variable. The ++ before the element, returns the new value, therefore causes an increment before the comparsion with 1.
++b[$3]==1{d++} the same as above but with the 3rd field and the d variable.
{l=$1} The l to the first field (for the next iteration.. above)
END{p()} after the last line is processed, awk has to print the values for the last block

With your given input the outout is:

apple 3 2
banana 4 5
cucumber 2 3

Best Answer

Related Solutions

How to search for the word stored in the hold space with sed

Count unique associated values in awk (or perl)

Related Question