@Thor is absolutely right but in case you're in a hurry to wrap this up and so you or others with a similar problem can learn how to do such things for yourself in future, here's how to do what you asked for using any awk and a 2-pass approach:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
id = $2
sub(/-[^-]+$/,"",id)
vals[id] = $2 OFS $3
next
}
{ print (FNR>1 ? $1 OFS vals[$1] : $0) }
$ awk -f tst.awk file file
DB Instance Identifier,Reservation ID,State
alpha-db-dev,alpha-db-dev-26aug2022,active
alpha-db-prod,alpha-db-prod-26aug2022,active
alpha-db-staging,alpha-db-staging-26aug2022,active
beta-db-dev-primary,
beta-db-prod-primary,
beta-db-prod-replica,
beta-db-staging-primary,
charlie-db-dev,charlie-db-dev-8dec2021,active
charlie-db-prod-dms,
charlie-db-prod-dms-replica,
charlie-db-staging,
charlie-db-staging-loadtest,charlie-db-staging-loadtest-8dec2021,active
I think it's very obvious what it's doing but if after reading the man pages, googling, adding print
statements to track variables values, etc. (so you can learn more about awk in general) you can't figure it out then just ask specific questions in comments and I'll be happy to answer them. To learn awk in general, get the book Effective AWK Programming, 5th edition, by Arnold Robbins.
Assuming you only want to test for duplicates within each file, not across all files, and that you don't care about preserving input order of your data, his will do what you want using any versions of mandatory POSIX tools so it'll work on any Unix box:
$ cat tst.sh
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
sep=','
for file in "$@"; do
{
head -n 1 "$file" &&
tail -n 2 "$file" |
sort -t "$sep" -r -k 7,7 |
awk -F "$sep" '$1 != prev { print; prev=$1 }'
} > "$tmp" &&
mv -- "$tmp" "$file"
done
For example:
$ cat file
foo,bar
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-08-26 17:32:00
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-09-11 22:15:00
$ ./tst.sh file*
$ cat file
foo,bar
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-09-11 22:15:00
Note that only sort
above has to handle all of the input at once, the other tools just handle 1 line at a time, and sort
is designed to handle large files by using demand paging, etc. so you're unlikely to run into memory issues even if your input files are massive.
If you DO want to retain the input line order then you can change the above to apply the DSU idiom to do that:
$ cat tst.sh
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
sep=','
for file in "$@"; do
awk -v OFS="$sep" '{ print (NR>1), NR, $0 }' "$file" |
sort -t "$sep" -k1,1 -k9,9r |
awk -F "$sep" 'NR==1{print; next} $1 != prev{ print; prev=$1 }' |
sort -t "$sep" -k1,1 -k2,2n |
cut -d "$sep" -f3- \
> "$tmp" &&
mv -- "$tmp" "$file"
done
but it does require a second sort
after the lines have been selected to get the input back to it's original order.
If you really wanted to do it all using a single call to GNU awk while retaining input order then it'd be:
$ cat tst.awk
BEGIN { FS="," }
FNR == 1 {
delete id2maxTs
delete id2fnr
delete fnr2input
print
next
}
{ id=$1; ts=$7 }
!(id in id2maxTs) || (ts > id2maxTs[id]) {
if ( id in id2fnr ) {
prevFnr = id2fnr[id]
delete fnr2input[prevFnr]
}
id2maxTs[id] = ts
id2fnr[id] = FNR
fnr2input[FNR] = $0
}
ENDFILE {
for ( i=1; i<=FNR; i++ ) {
if ( i in fnr2input ) {
print fnr2input[i]
}
}
}
$ gawk -i inplace -f tst.awk file*
$ cat file
foo,bar
ID12345,Here is some text,ABCDEFG,7,9,2022-08-18 20:15:00,2022-08-26 17:32:00
That gawk script would retain the original input order but it would have to read all of each input file into memory.
Best Answer
The following
awk
command sets theawk
variablefile
to the value of thefile
key whenever such a key is found (the key is the first field on the line, the value is the second). If the current line has nofile
key, the current value of thefile
variable is outputted together with the value of the current line.Note that this does not attempt to quote the values correctly for CSV and that it assumes that no value contains the field delimiter
:
(colon+space).With
sed
:When a
file:
line is found, thefile:
prefix string is stripped off and the remainder is stored in the hold space.When a
value:
line is found, thevalue:
prefix string is stripped off, and the text in the hold-space is appended to the end of the buffer with a literal newline character as the delimiter. The newline-delimited parts of the buffer are swapped (newline replaced by a comma) and outputted.The result is the same as what is expected.
This does not have the restriction that the values after the initial
key:
string can't include a colon+space. Again, the final output will not have any special CSV encoding of the text, so fields containing embedded commas and double quotes would confuse a CSV parser.The following modifies the input by adding an empty line between each line in the original file. This makes the file a valid "XTAB" file with
:
as the key-value delimiter. This is then read by Miller (mlr
), which is aware of the special quoting rules of CSV and that can read the XTAB format.Miller reads the records from the
awk
output, and performs a "fill-down" operation with thefile
data, assigning the previousfile
value to each of the records that do not have one.The subsequent "filter" operation removes all records that do not have a
value
field.The data is then outputted without a CSV header.
I've modified the test data to show that this is able to properly produce fully compliant CSV output even if the input contains commas and quotes: