Bash – Remove lines from a file depending on lines found in another file

bashcsvsedshell-scripttext processing

File file1.txt contains lines like:

/api/purchase/<hash>/index.html

For example:

/api/purchase/12ab09f46/index.html

File file2.csv contains lines like:

<hash>,timestamp,ip_address

For example:

12ab09f46,20150812235200,22.231.113.64 
a77b3ff22,20150812235959,194.66.82.11

I want to filter file2.csv removing all lines where the value of hash is present also in file1.txt. That's to say:

cat file1.txt | extract <hash> | sed '/<hash>/d' file2.csv

or something like this.

It should be straightforward, but I seem unable to make it work.

Can anyone please provide a working pipeline for this task?

Best Answer

cut -d / -f 4 file1.txt | paste -sd '|' | xargs -I{} grep -v -E {} file2.csv

Explanation:

cut -d / -f 4 file1.txt will select the hashes from the first file

paste -sd '|' will join all the hashes into a regular expression ex. H1|H2|H3

xargs -I{} grep -v -E {} file2.csv will invoke grep with the previous pattern as an argument, xargs will replace {} with the content of the STDIN

If you don't have paste you could replace it with tr "\\n" "|" | sed 's/|$//'

Related Solutions

Shell – Remove lines from tab-delimited file with missing values

If your fields can never contain whitespace, an empty field means either a tab as a first character (^\t), a tab as the last character (\t$) or two consecutive tabs (\t\t). You could therefore filter out lines containing any of those:

grep -Ev $'^\t|\t\t|\t$' file

If you can have whitespace, things get more complex. If your fields can begin with spaces, use this instead (it considers a field with only spaces to be empty):

grep -Pv '\t\s*(\t|$)|\t$|^\t' file

The change filters out lines matching a tab followed by 0 or more spaces and then either another tab or the end of the line.

That will also fail if the last field contains nothing but spaces. To avoid that too, use perl with the -F and -a options to split input into the @F array, telling it to print unless one of the fields is empty (/^$/):

perl -F'\t' -lane 'print unless grep{/^$/} @F' file

Use another file to extract part of a line that matches with grep, as well as the following line, then save to new file

As they say, there's more than one way to skin a cat:

grep -F -f File2.txt -A 1 File1.fasta > File3.log

< File2.txt sed -e 's|[.]|\\&|g; s|.*|g/^>&/.,.+1W File3.log|' | ed -s - File1.fasta

Here we are making the sequence identifiers suitable for generating an ed batch script dynamically. Which is then passed on to ed which uses it to munge your fasta file and stores the results in File3.log

Best Answer

Related Solutions

Shell – Remove lines from tab-delimited file with missing values

Use another file to extract part of a line that matches with grep, as well as the following line, then save to new file

Related Question