AWK and KSH – Wrapping Comma Delimited Values to Next Line

awkksh

I have the following input, in which some of the values are comma delimited. I would like decouple each value and print it in the new line. I tried many ways in awk, but no luck. If I have three comma delimited values, the first value should be retained in the same row, but the rest of the values should be wrapped up in the new lines.

Input File
===========
key1|0|11881|0|0|0|0|11769|0|0|0
key2|2027|345,712|0|1|0|2040|364,729|0|1|0
key3|0|670944|0|0|0|0|495554|0|0|0
key4|1847|1,21|0|0|0|1814|1,22|0|0|0
key5|1880|11,402|0|154|0|1886|11,397|0|151|0
key6|1,1,19137|65,4570|0|8|0|16684,0|51,4176|0|8|0
key7|1851|11,757|0|202|0|1856|13,751|0|193|0

Expected Output
================


key1|0|11881|0|0|0|0|11769|0|0|0
key2|2027|345|0|1|0|2040|364|0|1|0
key2|-|712|-|-|-|-|729|-|-|-
key3|0|670944|0|0|0|0|495554|0|0|0
key4|1847|1|0|0|0|1814|1|0|0|0
key4|-|21|-|-|-|-|22|-|-|-
key5|1880|11|0|154|0|1886|11|0|151|0
key5|-|402|-|-|-|-|397|-|-|-
key6|1|65|0|8|0|16684|51|0|8|0
key6|1|4570|-|-|-|0|4176|-|-|-
key6|19137|-|-|-|-|-|-|-|-|-
key7|1851|11|0|202|0|1856|13|0|193|0
key7|-|757|-|-|-|-|751|-|-|-

EDIT #1

Per a comment left by @Avinash, here are my attempts. I'm new to the awk world, so may be beating around the bush. I followed the below steps. Please suggest me if any easier solution if you have in mind too. I am not getting the solution as expected.

step 1: $ awk -f test.awk a.txt > index.txt --> creates the index files
step 2: $ awk -f test2.awk a.txt > main.txt --> extracts the lines which are comma delimited and duplicate them number of times equal to the comma delimited values
step 3: $ awk -f updt_db1.awk index.txt main.txt --> updates the respective column
index.txt
=========
0|key2|3|345
1|key2|3|712
2|key2|8|364
3|key2|8|729
4|key4|3|1
5|key4|3|21
6|key4|8|1
7|key4|8|22
8|key5|3|11
9|key5|3|402
10|key5|8|11
11|key5|8|397
12|key6|2|1
13|key6|2|1
14|key6|2|19137
15|key6|3|65
16|key6|3|4570
17|key6|7|16684
18|key6|7|0
19|key6|8|51
20|key6|8|4176
21|key7|3|11
22|key7|3|757
23|key7|8|13
24|key7|8|751

main.txt
========
0|key2|2027|345,712|0|1|0|2040|364,729|0|1|0
1|key2|2027|345,712|0|1|0|2040|364,729|0|1|0
2|key2|2027|345,712|0|1|0|2040|364,729|0|1|0
3|key2|2027|345,712|0|1|0|2040|364,729|0|1|0
4|key4|1847|1,21|0|0|0|1814|1,22|0|0|0
5|key4|1847|1,21|0|0|0|1814|1,22|0|0|0
6|key4|1847|1,21|0|0|0|1814|1,22|0|0|0
7|key4|1847|1,21|0|0|0|1814|1,22|0|0|0
8|key5|1880|11,402|0|154|0|1886|11,397|0|151|0
9|key5|1880|11,402|0|154|0|1886|11,397|0|151|0
10|key5|1880|11,402|0|154|0|1886|11,397|0|151|0
11|key5|1880|11,402|0|154|0|1886|11,397|0|151|0
12|key6|1,1,19137|65,4570|0|8|0|16684,0|51,4176|0|8|0
13|key6|1,1,19137|65,4570|0|8|0|16684,0|51,4176|0|8|0
14|key6|1,1,19137|65,4570|0|8|0|16684,0|51,4176|0|8|0
15|key6|1,1,19137|65,4570|0|8|0|16684,0|51,4176|0|8|0
16|key6|1,1,19137|65,4570|0|8|0|16684,0|51,4176|0|8|0
17|key6|1,1,19137|65,4570|0|8|0|16684,0|51,4176|0|8|0
18|key6|1,1,19137|65,4570|0|8|0|16684,0|51,4176|0|8|0
19|key6|1,1,19137|65,4570|0|8|0|16684,0|51,4176|0|8|0
20|key6|1,1,19137|65,4570|0|8|0|16684,0|51,4176|0|8|0
21|key7|1851|11,757|0|202|0|1856|13,751|0|193|0
22|key7|1851|11,757|0|202|0|1856|13,751|0|193|0
23|key7|1851|11,757|0|202|0|1856|13,751|0|193|0
24|key7|1851|11,757|0|202|0|1856|13,751|0|193|0

Output

0|key2|2027|345|0|1|0|2040|364,729|0|1|0|
1|key2|2027|712|0|1|0|2040|364,729|0|1|0|
2|key2|2027|345,712|0|1|0|2040|364|0|1|0|
3|key2|2027|345,712|0|1|0|2040|729|0|1|0|
4|key4|1847|1|0|0|0|1814|1,22|0|0|0|
5|key4|1847|21|0|0|0|1814|1,22|0|0|0|
6|key4|1847|1,21|0|0|0|1814|1|0|0|0|
7|key4|1847|1,21|0|0|0|1814|22|0|0|0|
8|key5|1880|11|0|154|0|1886|11,397|0|151|0|
9|key5|1880|402|0|154|0|1886|11,397|0|151|0|
10|key5|1880|11,402|0|154|0|1886|11|0|151|0|
11|key5|1880|11,402|0|154|0|1886|397|0|151|0|
12|key6|1|65,4570|0|8|0|16684,0|51,4176|0|8|0|
13|key6|1|65,4570|0|8|0|16684,0|51,4176|0|8|0|
14|key6|19137|65,4570|0|8|0|16684,0|51,4176|0|8|0|
15|key6|1,1,19137|65|0|8|0|16684,0|51,4176|0|8|0|
16|key6|1,1,19137|4570|0|8|0|16684,0|51,4176|0|8|0|
17|key6|1,1,19137|65,4570|0|8|0|16684|51,4176|0|8|0|
18|key6|1,1,19137|65,4570|0|8|0|0|51,4176|0|8|0|
19|key6|1,1,19137|65,4570|0|8|0|16684,0|51|0|8|0|
20|key6|1,1,19137|65,4570|0|8|0|16684,0|4176|0|8|0|
21|key7|1851|11|0|202|0|1856|13,751|0|193|0|
22|key7|1851|757|0|202|0|1856|13,751|0|193|0|
23|key7|1851|11,757|0|202|0|1856|13|0|193|0|
24|key7|1851|11,757|0|202|0|1856|751|0|193|0|


$ cat test.awk
#!/bin/awk

BEGIN{
FS="|";
counter=0
}

NR == FNR {
for(i=1; i<= NF; i++)
{
  n=split($i,arr,",")
  if ( n > 1)
  for(j=1; j<=n; j++)
       printf "%s|%s|%s|%s\n",counter++,$1,i,arr[j]
}

}



$ cat test2.awk
#!/bin/awk
BEGIN { FS="|"
k=0;
}

NR == FNR {

for(i=1; i<= NF; i++)
{

  n=split($i,arr,",")
  if (n > 1)
     for(j=1; j<=n; j++)
            printf("%s|%s\n",k++,$0)

    }
}


$ cat updt_db1.awk
#!/bin/awk
BEGIN {
    FS = "|"
}
( NR == FNR ) {
    lookup[$1] = $0
}

( NR > FNR ) {
    key = toupper($1)
if (key in lookup){
    split(lookup[key], replacements, "|")
    for (i = 1; i <= NF; i++)
        col[i] = $i;
    for (i=3; replacements[i+1] != "" ; i=i+1){
    j=replacements[i]
    col[j+1] = replacements[i+1]
    }
    for (i = 1; i <= NF; i++)
        printf "%s|", col[i]
    print ""
  }
  else
    print $0;
}

Best Answer

You could use sed like this:

    sed 'h;s/,[^|]*//g;x
    /,/{s/|[^,|]*,*/|-/g;H;}
    x;s/-\([^|]\)/\1/g;P;D'

It wound up being relatively simple after all. Applying that little script to your data gets:

key1|0|11881|0|0|0|0|11769|0|0|0
key2|2027|345|0|1|0|2040|364|0|1|0
key2|-|712|-|-|-|-|729|-|-|-
key3|0|670944|0|0|0|0|495554|0|0|0
key4|1847|1|0|0|0|1814|1|0|0|0
key4|-|21|-|-|-|-|22|-|-|-
key5|1880|11|0|154|0|1886|11|0|151|0
key5|-|402|-|-|-|-|397|-|-|-
key6|1|65|0|8|0|16684|51|0|8|0
key6|1|4570|-|-|-|0|4176|-|-|-
key6|19137|-|-|-|-|-|-|-|-|-
key7|1851|11|0|202|0|1856|13|0|193|0
key7|-|757|-|-|-|-|751|-|-|-

Basically sed just tackles each field from both ends. It first saves a copy of its current iteration to hold space. Then sed removes everything from every field following the first comma on. After that sed switches back to its saved copy so it can remove the field it just saved in the other buffer.

If commas remain it appends the second copy to the first following an inserted \newline character so it can recurse at least once more when P;D Print then Delete only up to the first occurring \newline character in pattern space before starting over with what remains.

Related Question