Shell – Select Lines from Text File with IDs Listed in Another File

csvshell

I use a lot of grep awk sort in my unix shell to work with medium-sized (around 10M-100M lines) tab-separated column text files. In this respect unix shell is my spreadsheet.

But I have one huge problem, that is selecting records given a list of IDs.

Having table.csv file with format id\tfoo\tbar... and ids.csv file with list of ids, only select records from table.csv with id present in ids.csv.

kind of https://stackoverflow.com/questions/13732295/extract-all-lines-from-text-file-based-on-a-given-list-of-ids but with shell, not perl.

grep -F obviously produces false positives if ids are variable width.
join is an utility I could never figure out. First of all, it requires alphabetic sorting (my files are usually numerically sorted), but even then I can't get it to work without complaining about incorrect order and skipping some records. So I don't like it.
grep -f against file with ^id\t-s is very slow when number of ids is large.
awk is cumbersome.

Are there any good solutions for this? Any specific tools for tab-separated files? Extra functionality will be most welcome too.

UPD: Corrected sort -> join

Best Answer

I guess you meant grep -f not grep -F but you actually need a combination of both and -w:

grep -Fwf ids.csv table.csv

The reason you were getting false positives is (I guess, you did not explain) because if an id can be contained in another, then both will be printed. -w removes this problem and -F makes sure your patterns are treated as strings, not regular expressions. From man grep:

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)
   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

If your false positives are because an ID can be present in a non-ID field, loop through your file instead:

while read pat; do grep -w "^$pat" table.csv; done < ids.csv

or, faster:

xargs -I {} grep "^{}" table.csv < ids.csv

Personally, I would do this in perl though:

perl -lane 'BEGIN{open(A,"ids.csv"); while(<A>){chomp; $k{$_}++}} 
            print $_ if defined($k{$F[0]}); ' table.csv
Related Question