How to delete all occurrences of a list of words from a text file

grepsedtext processingtext;

I have a file containing a list of words. I want to remove all occurrences of all the words in this file from a big text file.

Example:

File 1

queen
king

Text file sample

Both the king and queen are monarchs. Will the queen live? Queen, it is!

This is what I have tried:

sed -i 's/queen/ /g' page.txt
sed -i 's/Queen/ /g' page.txt

Output

Both the and are monarchs. Will the live? , it is!

The list of words I have is big (over 50000 words). How can I do this without having to specify the pattern in the command line?

Best Answer

For your actual use case I recommend terdon's answer using Perl.

However, the simple version, without handling words that are substrings of other words (e.g. removing "king" from "hiking"), is to use one Sed command to generate the command run by a different Sed instance on your actual file.

In this case, with wordfile containing "king" and "queen" and textfile containing your text:

sed -e "$(sed 's:.*:s/&//ig:' wordfile)" textfile

Note that the "ignore case" flag is a GNU extension, not standard.

Related Solutions

Shell – How Do I Remove Duplicate Words With Suffixes

You might need a word stemming algorithm for this. For example, Lingua::Stem is a word stemmer module written in Perl.

If this fits your needs, you would need to install Lingua::Stem via CPAN. Then, the following Perl script would do the job:

#!/usr/bin/perl

require Lingua::Stem;

# Read lines into array
chomp(my @words = <STDIN>);

# Stem in English
my $s = Lingua::Stem->new( -locale => 'en' );
my $stemmed = $s->stem_in_place( @words );

# Output result of stemmed words with duplicates removed
my $oldw = undef;
foreach $w (sort @$stemmed) {
    print "$w\n" unless ($w eq $oldw);
    $oldw = $w;
}

Example output:

$ ./stem.pl < inputfile
curl
curler
iron
pan
park
parker
railroad

Obviously, this deviates slightly from your example output due to the stemmer's interpretation of word suffixes which differs from yours in some cases. If this affects a moderate number of words in your application only, it is possible to define exceptions with the add_exceptions method:

...
$s->add_exceptions( { "parker" => "park", "curler" => "curl" } );
$stemmed = $s->stem_in_place( @words );
...

Shell – Insert new line after all occurrences of a pattern

You can try this with GNU sed or other sed implementations that now also treat \n as newline in the replacement:

sed 's|optype[^>]*/>|&\n|g' test.pmml

POSIXly:

sed 's|optype[^>]*/>|&\
|g' test.pmml

Best Answer

Related Solutions

Shell – How Do I Remove Duplicate Words With Suffixes

Shell – Insert new line after all occurrences of a pattern

Related Question