Bash – Parsing CSV with AWK and Returning Fields to Variable with Line Breaks

awkbashcsvscriptingshell-script

I have to migrate a password database (Keepass) by using a csv file of it, to a new application by using its API. The API is updated with post requests, those needs JSON data format.
What I need to do is to use the KeePass CSV to export the passwords and other pieces of information linked to it to the API. I decided todo a script using bash and awk.

The columns of the csv file are arranged like this :

"Group","Title","Username","Password","URL","Notes","TOTP","Icon","Last Modified","Created"

The field "notes" is multiline because some of the comments have line breaks into them.

"That's an important note, <br/>
some extra infos <br/>
concerning a password"

Here's an example of the API request to post the data, the data field is in JSON format :

I didn't put all the needed fields on this request but you can already see how it would work. Some of the field names are different because the KeePass and the API fields name are differently made

var1=name
var2=my.name
var3=password456

curl -s --request PUT -u username123:password123 -H 'Content-Type: application/json; charset=utf-8' https://tpm.mydomain.com/index.php/api/v5/passwords/1659.json --data-binary @- <<DATA
{
"name": "$var1",
"username": "$var2",
"password": "$var3"
}
DATA

I have planned to parse my CSV file field by field and then when I finish to parse the row, I do my API request to post the password in the database. Then I do this for every remaining rows.

To process the CSV I find the AWK language, it seems very handy and quite useful for my situation. I've come with multiple testing on my file with the gsub command, helping me to replace the line breaks (\n). I don't really know how to go further. Here's some of them (only the first work :

cat keepass.csv | awk NF=NF RS=/\n/ OFS=\n
cat keepass.csv |awk 'BEGIN {RS=","}{gsub("/\n/","",$0); print $0}'
cat keepass.csv | awk 'BEGIN {RS=""}{gsub(/\n/,"",$6); print $0}'

I also know that you can share bash var by adding -v after awk. Here's the closest code I could have.

awk -v RS='"\n' -v FPAT='"[^"]*"|[^,]*' '{
print "Row n°", NR, ""
for (i=1; i<=NF; i++) {
sub(/^"/, "", $i)
printf "Field %d, value=[%s]\n", i, $i
}} keepass.csv

What I am looking for, would be a command to parse any column of my csv by taking in account the multilines notes and input them into the global var of bash in JSON format.

I think you need to structure it by doing something like :

awk -v 'BEGIN{parsing and replacing keeping '\n' of notes}
if end of row,
return boolean to bash for processing the API requests, wait,
restart the loop}''

I'm new to scripting, I think it can be done in only a few lines but I'm unsure on how to proceed. I can change the language to python if needed and I can add some tools to my code.

Best Answer

The multi line is a feature of a CSV cell, and you can use an utility that is CSV aware as Miller.

As examples, if you have this CSV you can run

mlr --csv cut -f fieldA acr.csv to cut the first column
mlr --icsv --ojson cut -f fieldA acr.csv to cut the first column and convert all to JSON

[
  {
    "fieldA": "That's an important note,\nsome extra infos\nConcerning a password\nIpsum"
  },
  {
    "fieldA": "hello"
  }
]

As you can see Miller is aware of cell carriage return (RFC4180 compliant).

Below an image of the sample input file.

Related Solutions

Parsing CSV with AWK to Produce HTML Output

awk's printf function can be easier to use when it comes to embedded quotes.

awk -F, '{printf("<option value=\"%s\">%s %s</option>\n", $2, $1, $3)}'

The problem with that is that you also have whitespace between each field. We can use the gsub function to trim each field.

awk -F, '{gsub(/^ +| +$/,"", $2); printf("<option value=\"%s\">%s %s</option>\n", $2, $1, $3)}'

Or what's easier is to change our field separator: awk -F' *, *' '{printf("%s %s\n", $2, $1, $3)}'

If you need to trim multiple fields then it might be better to use a loop or a function (depending on the situation). See https://stackoverflow.com/questions/9985528/how-can-i-trim-white-space-from-a-variable-in-awk for more info.

AWK, SED, CSV – Converting Key+N-Values Text File to CSV

The following awk command sets the awk variable file to the value of the file key whenever such a key is found (the key is the first field on the line, the value is the second). If the current line has no file key, the current value of the file variable is outputted together with the value of the current line.

$ awk -F ': ' 'BEGIN { OFS="," } $1 == "file" { file = $2; next } { print file, $2 }' file
abc,123
abc,234
abc,567
def,999
ghi,123
ghi,999

Note that this does not attempt to quote the values correctly for CSV and that it assumes that no value contains the field delimiter : (colon+space).

With sed:

sed -n \
    -e '/^file: /  { s///; h; }' \
    -e '/^value: / { s///; G; s/\(.*\)\n\(.*\)/\2,\1/p; }' file

When a file: line is found, the file: prefix string is stripped off and the remainder is stored in the hold space.

When a value: line is found, the value: prefix string is stripped off, and the text in the hold-space is appended to the end of the buffer with a literal newline character as the delimiter. The newline-delimited parts of the buffer are swapped (newline replaced by a comma) and outputted.

The result is the same as what is expected.

This does not have the restriction that the values after the initial key: string can't include a colon+space. Again, the final output will not have any special CSV encoding of the text, so fields containing embedded commas and double quotes would confuse a CSV parser.

The following modifies the input by adding an empty line between each line in the original file. This makes the file a valid "XTAB" file with : as the key-value delimiter. This is then read by Miller (mlr), which is aware of the special quoting rules of CSV and that can read the XTAB format.

Miller reads the records from the awk output, and performs a "fill-down" operation with the file data, assigning the previous file value to each of the records that do not have one.

The subsequent "filter" operation removes all records that do not have a value field.

The data is then outputted without a CSV header.

awk '{ print; print "" }' file | 
mlr --ixtab --ips ': ' \
    --ocsv --headerless-csv-output \
    fill-down -f file then filter -x 'is_absent($value)'

I've modified the test data to show that this is able to properly produce fully compliant CSV output even if the input contains commas and quotes:

$ cat file
file: test: here's a test
value: this is, the value
value: another so called "value"
file: abc
value: 123
value: 234
value: 567
file: def
value: 999
file: ghi
value: 123
value: 999

$ awk '{ print; print "" }' file | mlr --ixtab --ips ': ' --ocsv --headerless-csv-output fill-down -f file then filter -x 'is_absent($value)'
"this is, the value",test: here's a test
"another so called ""value""",test: here's a test
123,abc
234,abc
567,abc
999,def
123,ghi
999,ghi

Best Answer

Related Solutions

Parsing CSV with AWK to Produce HTML Output

AWK, SED, CSV – Converting Key+N-Values Text File to CSV

Related Question