Text Processing – Removing Duplicate IDs and Keeping Latest Date

csvtext processing

I am working with a multi-line text file (.csv) that contains 7 columns.

Each line contains what "should" be a unique i.d. There are also some date columns, one of which is a "last modified" date.

I have found that what should be "unique" id's are actually repeated on occasion and it's a problem I need to resolve by removing all but one.

I have an example using gawk below but is there a way to use gawk, awk, or grep etc to delete any lines that are duplicated BUT with the exception of the "most recently" modified? So, with some logic on what goes and stays.

As an example, this csv extract has two lines. Each field but one is the same. The ID number being "the same" means it is a "duplicate" for my purposes.

Both lines are not completely the same though.

The date in the final (7th) field of the csv file makes one entry older than the other.

ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-08-26 17:32:00
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-09-11 22:15:00

Is it possible to gawk, cat, grep, cut, awk etc on the file and:

a) identify anything with a duplicate ID.
b) retain only the duplicates with the "newest" date in the last field.

Ideally I would need the first line left in place because that has the headings for the csv which is being fed into a database.

That is why this almost works well:

gawk -i inplace '!a[$0]++' *.csv

It actually seems to remove duplicates leaving one line in place however it does not have the logic to decide what to leave in place based on the oldest date value in the final field.

Can you please help…

Best Answer

Assuming you only want to test for duplicates within each file, not across all files, and that you don't care about preserving input order of your data, his will do what you want using any versions of mandatory POSIX tools so it'll work on any Unix box:

$ cat tst.sh
#!/usr/bin/env bash

tmp=$(mktemp) || exit 1
sep=','
for file in "$@"; do
    {
        head -n 1 "$file" &&
        tail -n 2 "$file" |
            sort -t "$sep" -r -k 7,7 |
            awk -F "$sep" '$1 != prev { print; prev=$1 }'
    } > "$tmp" &&
    mv -- "$tmp" "$file"
done

For example:

$ cat file
foo,bar
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-08-26 17:32:00
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-09-11 22:15:00

$ ./tst.sh file*

$ cat file
foo,bar
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-09-11 22:15:00

Note that only sort above has to handle all of the input at once, the other tools just handle 1 line at a time, and sort is designed to handle large files by using demand paging, etc. so you're unlikely to run into memory issues even if your input files are massive.

If you DO want to retain the input line order then you can change the above to apply the DSU idiom to do that:

$ cat tst.sh
#!/usr/bin/env bash

tmp=$(mktemp) || exit 1
sep=','
for file in "$@"; do
    awk -v OFS="$sep" '{ print (NR>1), NR, $0 }' "$file" |
        sort -t "$sep" -k1,1 -k9,9r |
        awk -F "$sep" 'NR==1{print; next} $1 != prev{ print; prev=$1 }' |
        sort -t "$sep" -k1,1 -k2,2n |
        cut -d "$sep" -f3- \
    > "$tmp" &&
    mv -- "$tmp" "$file"
done

but it does require a second sort after the lines have been selected to get the input back to it's original order.

If you really wanted to do it all using a single call to GNU awk while retaining input order then it'd be:

$ cat tst.awk
BEGIN { FS="," }
FNR == 1 {
    delete id2maxTs
    delete id2fnr
    delete fnr2input
    print
    next
}
{ id=$1; ts=$7 }
!(id in id2maxTs) || (ts > id2maxTs[id]) {
    if ( id in id2fnr ) {
        prevFnr = id2fnr[id]
        delete fnr2input[prevFnr]
    }
    id2maxTs[id]   = ts
    id2fnr[id]     = FNR
    fnr2input[FNR] = $0
}
ENDFILE {
    for ( i=1; i<=FNR; i++ ) {
        if ( i in fnr2input ) {
            print fnr2input[i]
        }
    }
}

$ gawk -i inplace -f tst.awk file*

$ cat file
foo,bar
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-08-26 17:32:00

That gawk script would retain the original input order but it would have to read all of each input file into memory.