Filter File by Line Number – Efficient Text Processing Techniques

filterhigh performancetext processing

Given a file L with one non-negative integer per line and text file F, what would be a fast way to keep only those lines in F, whose line number appears in file L?

Example:

$ cat L.txt
1
3

$ cat F.txt
Hello World
Hallo Welt
Hola mundo

$ command-in-question -x L.txt F.txt
Hello World
Hola mundo

I'm looking for a command that can handle a file L with 500 million or more entries; file L is sorted numerically.

Note: I'm halfway through an implementation for a command-in-question but I just wondered, whether one might be able to use some Unix tools here as well.


Update: Thank for all the answers, I learned a lot today! I would like to accept more one answer, but that's not possible.

Best Answer

With C omitting meaningful error messages:

#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[]) {

    FILE *L;
    FILE *F;

    unsigned int to_print;
    unsigned int current = 0;
    char *line = NULL;
    size_t len = 0;

    if ((L = fopen(argv[1], "r")) == NULL) {
        return 1;
    } else if ((F = fopen(argv[2], "r")) == NULL) {
        fclose(L);
        return 1;
    } else {

        while (fscanf(L, "%u", &to_print) > 0) {
            while (getline(&line, &len, F) != -1 && ++current != to_print);
            if (current == to_print) {
                printf("%s", line);
            }
        }

        free(line);
        fclose(L);
        fclose(F);
        return 0;
    }
}
Related Question