Bash – Remove lines from a file depending on lines found in another file

bashcsvsedshell-scripttext processing

File file1.txt contains lines like:

/api/purchase/<hash>/index.html

For example:

/api/purchase/12ab09f46/index.html

File file2.csv contains lines like:

<hash>,timestamp,ip_address

For example:

12ab09f46,20150812235200,22.231.113.64 
a77b3ff22,20150812235959,194.66.82.11

I want to filter file2.csv removing all lines where the value of hash is present also in file1.txt. That's to say:

cat file1.txt | extract <hash> | sed '/<hash>/d' file2.csv

or something like this.

It should be straightforward, but I seem unable to make it work.

Can anyone please provide a working pipeline for this task?

Best Answer

cut -d / -f 4 file1.txt | paste -sd '|' | xargs -I{} grep -v -E {} file2.csv

Explanation:

cut -d / -f 4 file1.txt will select the hashes from the first file

paste -sd '|' will join all the hashes into a regular expression ex. H1|H2|H3

xargs -I{} grep -v -E {} file2.csv will invoke grep with the previous pattern as an argument, xargs will replace {} with the content of the STDIN

If you don't have paste you could replace it with tr "\\n" "|" | sed 's/|$//'

Related Question