Linux – Aligning Values in CSV File Using Bash and Awk

awkbashcsvlinux

I have a csv file with the following information this csv file was created by combining results from aws rds describe-db-instances and
aws rds describe-reserved-db-instances and by parsing the results with jq then I pasted together the csv files.

DB Instance Identifier,Reservation ID,State
alpha-db-dev,alpha-db-dev-26aug2022,active
alpha-db-prod,alpha-db-prod-26aug2022,active
alpha-db-staging,alpha-db-staging-26aug2022,active
beta-db-dev-primary,beta-db-dev-8dec2021,active
beta-db-prod-primary,charlie-db-dev-8dec2021,active
beta-db-prod-replica,charlie-db-staging-loadtest-8dec2021,active
beta-db-staging-primary,,
charlie-db-dev,,
charlie-db-prod-dms,,
charlie-db-prod-dms-replica,,
charlie-db-staging,,
charlie-db-staging-loadtest,,

What I'm trying to do is match the first few words of the 2nd column and the first few words of the 1st column.

The result should be like this

DB Instance Identifier,Reservation ID,State
alpha-db-dev,alpha-db-dev-26aug2022,active
alpha-db-prod,alpha-db-prod-26aug2022,active
alpha-db-staging,alpha-db-staging-26aug2022,active
beta-db-dev-primary,beta-db-dev-8dec2021,active
beta-db-prod-primary,,
beta-db-prod-replica,,
beta-db-staging-primary,,
charlie-db-dev,charlie-db-dev-8dec2021,active
charlie-db-prod-dms,,
charlie-db-prod-dms-replica,,
charlie-db-staging,,
charlie-db-staging-loadtest,charlie-db-staging-loadtest-8dec2021,active

Basically what I would want to happen is that in the csv file, I would want to match the charlie-db-staging-loadtest with charlie-db-staging-loadtest-8dec2021 to be in the same row.

I have been clueless on what to do next and how to match the texts between the column DB Instance Identifier and Reservation ID. More over, how to move matching texts to be in the same row. So I have no attempts on solving this so far.

I am open to using any commands as long as it would work with bash. And I would really appreciate the breakdown of commands as I will be applying this to a somewhat similar but not quite the same csv file that has tons of data that would be way too big to put here.

I have been playing around with sed and awk but I can't seem to make it work just yet. But maybe I'm thinking of the wrong commands here. Any insights would be helpful.

Thank you so much!

Best Answer

@Thor is absolutely right but in case you're in a hurry to wrap this up and so you or others with a similar problem can learn how to do such things for yourself in future, here's how to do what you asked for using any awk and a 2-pass approach:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
    id = $2
    sub(/-[^-]+$/,"",id)
    vals[id] = $2 OFS $3
    next
}
{ print (FNR>1 ? $1 OFS vals[$1] : $0) }

$ awk -f tst.awk file file
DB Instance Identifier,Reservation ID,State
alpha-db-dev,alpha-db-dev-26aug2022,active
alpha-db-prod,alpha-db-prod-26aug2022,active
alpha-db-staging,alpha-db-staging-26aug2022,active
beta-db-dev-primary,
beta-db-prod-primary,
beta-db-prod-replica,
beta-db-staging-primary,
charlie-db-dev,charlie-db-dev-8dec2021,active
charlie-db-prod-dms,
charlie-db-prod-dms-replica,
charlie-db-staging,
charlie-db-staging-loadtest,charlie-db-staging-loadtest-8dec2021,active

I think it's very obvious what it's doing but if after reading the man pages, googling, adding print statements to track variables values, etc. (so you can learn more about awk in general) you can't figure it out then just ask specific questions in comments and I'll be happy to answer them. To learn awk in general, get the book Effective AWK Programming, 5th edition, by Arnold Robbins.

Related Solutions

AWK, SED, CSV – Converting Key+N-Values Text File to CSV

The following awk command sets the awk variable file to the value of the file key whenever such a key is found (the key is the first field on the line, the value is the second). If the current line has no file key, the current value of the file variable is outputted together with the value of the current line.

$ awk -F ': ' 'BEGIN { OFS="," } $1 == "file" { file = $2; next } { print file, $2 }' file
abc,123
abc,234
abc,567
def,999
ghi,123
ghi,999

Note that this does not attempt to quote the values correctly for CSV and that it assumes that no value contains the field delimiter : (colon+space).

With sed:

sed -n \
    -e '/^file: /  { s///; h; }' \
    -e '/^value: / { s///; G; s/\(.*\)\n\(.*\)/\2,\1/p; }' file

When a file: line is found, the file: prefix string is stripped off and the remainder is stored in the hold space.

When a value: line is found, the value: prefix string is stripped off, and the text in the hold-space is appended to the end of the buffer with a literal newline character as the delimiter. The newline-delimited parts of the buffer are swapped (newline replaced by a comma) and outputted.

The result is the same as what is expected.

This does not have the restriction that the values after the initial key: string can't include a colon+space. Again, the final output will not have any special CSV encoding of the text, so fields containing embedded commas and double quotes would confuse a CSV parser.

The following modifies the input by adding an empty line between each line in the original file. This makes the file a valid "XTAB" file with : as the key-value delimiter. This is then read by Miller (mlr), which is aware of the special quoting rules of CSV and that can read the XTAB format.

Miller reads the records from the awk output, and performs a "fill-down" operation with the file data, assigning the previous file value to each of the records that do not have one.

The subsequent "filter" operation removes all records that do not have a value field.

The data is then outputted without a CSV header.

awk '{ print; print "" }' file | 
mlr --ixtab --ips ': ' \
    --ocsv --headerless-csv-output \
    fill-down -f file then filter -x 'is_absent($value)'

I've modified the test data to show that this is able to properly produce fully compliant CSV output even if the input contains commas and quotes:

$ cat file
file: test: here's a test
value: this is, the value
value: another so called "value"
file: abc
value: 123
value: 234
value: 567
file: def
value: 999
file: ghi
value: 123
value: 999

$ awk '{ print; print "" }' file | mlr --ixtab --ips ': ' --ocsv --headerless-csv-output fill-down -f file then filter -x 'is_absent($value)'
"this is, the value",test: here's a test
"another so called ""value""",test: here's a test
123,abc
234,abc
567,abc
999,def
123,ghi
999,ghi

CSV – Visually Align Columns in File

A CSV file of this type:

a, b, c, some stuff,"some, other, stuff",     d, 2023-03-10 18:37:00<EOL>
x, y, z,          t,                cool, thing, 2022-04-12 21:44:00<EOL>

is not really the same data file anymore since you are modifying the fields. When parsed, what was originally "t" will now parse out as " t" because of the width of "some stuff" above (unless you use a regex to parse the non-standard ,[variable space] delimiters.)

You can force quotes on all fields to get a more csv file that shows these new fields clearly. Here is a Ruby to do that:

ruby -r csv -e '
cols={}
data=CSV.parse($<.read)
data.transpose.each_with_index{|sa,i| 
    cols[i]=sa.max_by{|e| e.length}; cols[i]=cols[i].length 
}
puts CSV.generate(force_quotes:true){|csv|
    data.each{|row|
        csv<<row.map.with_index{|e, i| e.rjust(cols[i] ) }
    }
}
' file

Prints:

"a","b","c","some stuff","some, other, stuff","    d","2023-03-10 18:37:00"
"y","x","z","         t","              cool","thing","2022-04-12 21:44:00"

Or, if you really want quoted and unquoted fields, you can do:

ruby -r csv -e '
lcl_csv_opt={:row_sep=>nil}
data=CSV.parse($<.read)
cols=data.transpose.map.with_index{|sa,i| 
    x=sa.max_by{|e| [e].to_csv(**lcl_csv_opt).length}
    [i,"#{[x].to_csv(**lcl_csv_opt)}"]
}.to_h
puts CSV.generate(){|csv|
    data.each{|row|
        csv<<row.map.with_index{|e, i| 
            [e].to_csv(**lcl_csv_opt)==cols[i] ? e : e.rjust(cols[i].length ) 
        }
    }
}
' file

Prints:

a,b,c,some stuff,"some, other, stuff",    d,2023-03-10 18:37:00
y,x,z,         t,                cool,thing,2022-04-12 21:44:00

Which also handles nasty escaped quotes within fields. Given:

$ cat file
a,b,c,some stuff,"some, other, stuff",d,2023-03-10 18:37:00
y,x,z,t,cool,"""thing"", quoted",2022-04-12 21:44:00

The second version prints:

a,b,c,some stuff,"some, other, stuff",                  d,2023-03-10 18:37:00
y,x,z,         t,                cool,"""thing"", quoted",2022-04-12 21:44:00

Best Answer

Related Solutions

AWK, SED, CSV – Converting Key+N-Values Text File to CSV

CSV – Visually Align Columns in File

Related Question