Filter File by Line Number – Efficient Text Processing Techniques

filterhigh performancetext processing

Given a file L with one non-negative integer per line and text file F, what would be a fast way to keep only those lines in F, whose line number appears in file L?

Example:

$ cat L.txt
1
3

$ cat F.txt
Hello World
Hallo Welt
Hola mundo

$ command-in-question -x L.txt F.txt
Hello World
Hola mundo

I'm looking for a command that can handle a file L with 500 million or more entries; file L is sorted numerically.

Note: I'm halfway through an implementation for a command-in-question but I just wondered, whether one might be able to use some Unix tools here as well.

Update: Thank for all the answers, I learned a lot today! I would like to accept more one answer, but that's not possible.

Best Answer

With C omitting meaningful error messages:

#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[]) {

    FILE *L;
    FILE *F;

    unsigned int to_print;
    unsigned int current = 0;
    char *line = NULL;
    size_t len = 0;

    if ((L = fopen(argv[1], "r")) == NULL) {
        return 1;
    } else if ((F = fopen(argv[2], "r")) == NULL) {
        fclose(L);
        return 1;
    } else {

        while (fscanf(L, "%u", &to_print) > 0) {
            while (getline(&line, &len, F) != -1 && ++current != to_print);
            if (current == to_print) {
                printf("%s", line);
            }
        }

        free(line);
        fclose(L);
        fclose(F);
        return 0;
    }
}

Related Solutions

Shell Script – Cut Tab-Delimited Text File Lines to 80 Characters

I think you're looking for expand and/or unexpand. It seems you're trying to ensure a \tab width counts as 8 chars rather than the single one. fold will do that as well, but it will wrap its input to the next line rather than truncating it. I think you want:

expand < input | cut -c -80

expand and unexpand are both POSIX specified:

The expand utility shall write files or the standard input to the standard output with \tab characters replaced with one or more space characters needed to pad to the next tab stop. Any backspace characters shall be copied to the output and cause the column position count for tab stop calculations to be decremented; the column position count shall not be decremented below zero.

Pretty simple. So, here's a look at what this does:

unset c i; set --;                                                             
until [ "$((i+=1))" -gt 10 ]; do set -- "$@" "$i" "$i"; done                      
for c in 'tr \\t \ ' expand;  do eval '                                           
    { printf "%*s\t" "$@"; echo; } | 
      tee /dev/fd/2 |'"$c"'| { 
      tee /dev/fd/3 | wc -c >&2; } 3>&1 |
      tee /dev/fd/2 | cut -c -80'
done

The until loop at top gets a set of data like...

1 1 2 2 3 3 ...

It printfs this with the %*s arg padding flag so for each of those in the set printf will pad with as many spaces as are in the number of the argument. To each one it appends a \tab character.

All of the tees are used to show the effects of each filter as it is applied.

And the effects are these:

1        2        3        4        5        6        7        8                9               10
1  2   3    4     5      6       7        8         9         10 
1  2   3    4     5      6       7        8         9         10 
66
1        2        3        4        5        6        7        8                9               10
1        2        3        4        5        6        7        8                9               10 
1        2        3        4        5        6        7        8                
105

Those rows are lined up in two sets like...

output of printf ...; echo
output of tr ... or expand
output of cut
output of wc

The top four rows are the results of the tr filter - in which each \tab is converted to a single space.

And the bottom four the results of the expand chain.

How to both extract a specific line in a text file as well as multiple lines containing a specific string

Just change the grep output to append,

grep "string" source.txt >> destination.txt

Best Answer

Related Solutions

Shell Script – Cut Tab-Delimited Text File Lines to 80 Characters

How to both extract a specific line in a text file as well as multiple lines containing a specific string

Related Question