Shell – Select Lines from Text File with IDs Listed in Another File

csvshell

I use a lot of grep awk sort in my unix shell to work with medium-sized (around 10M-100M lines) tab-separated column text files. In this respect unix shell is my spreadsheet.

But I have one huge problem, that is selecting records given a list of IDs.

Having table.csv file with format id\tfoo\tbar... and ids.csv file with list of ids, only select records from table.csv with id present in ids.csv.

kind of https://stackoverflow.com/questions/13732295/extract-all-lines-from-text-file-based-on-a-given-list-of-ids but with shell, not perl.

grep -F obviously produces false positives if ids are variable width.
join is an utility I could never figure out. First of all, it requires alphabetic sorting (my files are usually numerically sorted), but even then I can't get it to work without complaining about incorrect order and skipping some records. So I don't like it.
grep -f against file with ^id\t-s is very slow when number of ids is large.
awk is cumbersome.

Are there any good solutions for this? Any specific tools for tab-separated files? Extra functionality will be most welcome too.

UPD: Corrected sort -> join

Best Answer

I guess you meant grep -f not grep -F but you actually need a combination of both and -w:

grep -Fwf ids.csv table.csv

The reason you were getting false positives is (I guess, you did not explain) because if an id can be contained in another, then both will be printed. -w removes this problem and -F makes sure your patterns are treated as strings, not regular expressions. From man grep:

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)
   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

If your false positives are because an ID can be present in a non-ID field, loop through your file instead:

while read pat; do grep -w "^$pat" table.csv; done < ids.csv

or, faster:

xargs -I {} grep "^{}" table.csv < ids.csv

Personally, I would do this in perl though:

perl -lane 'BEGIN{open(A,"ids.csv"); while(<A>){chomp; $k{$_}++}} 
            print $_ if defined($k{$F[0]}); ' table.csv

Faster Alternative

If your log files are large and you are grepping for fixed words, as opposed to fancy regular expressions, you may want to consider this approach:

inc='hello
animal
atttribute
metadata'

exc='timeout
runner'

ssh office "grep -F '$inc' ptd.log | grep -vF '$exc'"

By putting each word on a separate line, we can use grep's -F feature for fixed strings. This turns off regex processing, making the process faster.

Shell – Two input pipes through file descriptor shuffling and /dev/fd

Unlike redirections on other commands, redirections on the exec builtin may be closed when the shell executes an external program. POSIX allows both behaviors. Ksh (both ATT ksh, and pdksh and mksh) close these descriptors when they execute an external utility (i.e. for a redirection on the exec builtin, after calling dup2 to perform the redirection, they set the FD_CLOEXEC flag on the new descriptor). The Bourne shell, dash, bash, zsh and BusyBox sh treat this redirection like any other redirection.

A more portable solution to the two-input-pipes problem (assuming the existence of /dev/fd) is to perform another redirection on the command that reads the input, moving the file descriptor to a new one. This extra redirection doesn't set the close-on-exec flag on the new descriptor.

sort a | { exec 3<&0; sort b | comm -12 /dev/fd/0 /dev/fd/4 4<&3; }

This works in pdksh/mksh, and in ksh93r but not in recent versions of ksh (93s+ 2008-01-31 or 93u+ 2012-08-01). I don't understand what ksh is doing there.

Best Answer

Related Solutions

Bash – How to do a grep on remote machine and print out the line which contains those words

Faster Alternative

Shell – Two input pipes through file descriptor shuffling and /dev/fd

Related Question