I am working with a multi-line text file (.csv) that contains 7 columns.
Each line contains what "should" be a unique i.d. There are also some date columns, one of which is a "last modified" date.
I have found that what should be "unique" id's are actually repeated on occasion and it's a problem I need to resolve by removing all but one.
I have an example using gawk below but is there a way to use gawk, awk, or grep etc to delete any lines that are duplicated BUT with the exception of the "most recently" modified? So, with some logic on what goes and stays.
As an example, this csv extract has two lines. Each field but one is the same. The ID number being "the same" means it is a "duplicate" for my purposes.
Both lines are not completely the same though.
The date in the final (7th) field of the csv file makes one entry older than the other.
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-08-26 17:32:00
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-09-11 22:15:00
Is it possible to gawk, cat, grep, cut, awk etc on the file and:
a) identify anything with a duplicate ID.
b) retain only the duplicates with the "newest" date in the last field.
Ideally I would need the first line left in place because that has the headings for the csv which is being fed into a database.
That is why this almost works well:
gawk -i inplace '!a[$0]++' *.csv
It actually seems to remove duplicates leaving one line in place however it does not have the logic to decide what to leave in place based on the oldest date value in the final field.
Can you please help…
Best Answer
Assuming you only want to test for duplicates within each file, not across all files, and that you don't care about preserving input order of your data, his will do what you want using any versions of mandatory POSIX tools so it'll work on any Unix box:
For example:
Note that only
sort
above has to handle all of the input at once, the other tools just handle 1 line at a time, andsort
is designed to handle large files by using demand paging, etc. so you're unlikely to run into memory issues even if your input files are massive.If you DO want to retain the input line order then you can change the above to apply the DSU idiom to do that:
but it does require a second
sort
after the lines have been selected to get the input back to it's original order.If you really wanted to do it all using a single call to GNU awk while retaining input order then it'd be:
That gawk script would retain the original input order but it would have to read all of each input file into memory.