How to Insert New Lines with Missing Values (NA) in Shell

grepshelltext processing

I would like to insert new lines in text files if there are missing values.
I have for example the following text file (A.txt), for which line 5 is missing. In addition, as the file should have 12 lines the lines 11-12 are also missing.

1 2.30
2 3.01
3 3.22
4 3.34
6 3.01
7 2.90
8 2.99
9 3.00
10 3.02

My expected output is the following. For missing cases a line should be added with the number and NA. As you see, this happened as desired at line 5, 11 and 12:

1 2.30
2 3.01
3 3.22
4 3.34
5 NA
6 3.01
7 2.90
8 2.99
9 3.00
10 3.02
11 NA
12 NA

I am able to do this by using the following script:

f1=/my-directory/
echo "new file" > "$f1"/newfile.txt  

for i in {1..12}; do
l=$(awk '{print $1}' /"$f1"/A.txt | grep -wE ^$i /"$f1"/A.txt)
if grep --quiet -wE ^$i /"$f1"/A.txt; then echo "$l" >> "$f1"/newfile.txt; else echo "$i NA" >> "$f1"/newfile.txt; fi

done

This works fine. The problem is however that I need to do this for about 600 files containing more than about 160000 lines. The loop solution would therefore take too much time searching through all lines. My question is: is there a simpler solution that could do this?

Best Answer

You can do this with an awk script:

awk '{ while (NR + shift < $1) { print (NR + shift) " NA"; shift++ }; print } END { shift++; while (NR + shift < 13) { print (NR + shift) " NA"; shift++ } }' /tmp/test1

will produce the required output for /tmp/test1 (replace that with each file you wish to process).

In a more readable form:

#!/usr/bin/awk -f
{
    while (NR + shift < $1) {
        print (NR + shift) " NA"
        shift++
    }
    print
}
END {
    shift++
    while (NR + shift < 13) {
        print (NR + shift) " NA"
        shift++
    }
}

Save this as a file, say fill-missing, make it executable, then you can simply run

./fill-missing /tmp/test1

The script processes each line, keeping track of the expected delta with the current line number in shift. So for every line, if the current line adjusted doesn't match the first number in the line, it prints the appropriate line number followed by NA and increments the delta; once the line numbers match, it prints the current line. At the end of the process, it prints any missing lines required to reach 12.

Related Question