Shell – Tokenize string from $REPLY in bash script

shell-scripttext processing

This is my first post, and I have no idea how I managed anything before StackExchage, Google, Wiki, GNU, Internet, the list goes on 🙂

I am trying to find a way to construct a SQL database generator bash script, which currently looks like this…

renice -n 19 $$;

idx=32768;
dbt='Radix_en';

cat Domains_en.txt;
cat Tables_en.txt;

while read; 
do
checks="$(echo -n $REPLY | md5sum)";
checks=${checks%"  -"};

echo "insert into $dbt values ($idx,'$(uuidgen)','${checks}',$REPLY);";
idx=$((idx+1));

done < Data.txt;
echo "commit;";

The data comes from Data.txt, currently in the form of:

'NUMBER','US_EN','LATIN','GREEK','GERMAN'
0,'zero','nulla','μηδέν','Null'
1,'one','Unum','ένα','ein'

The output is valid SQL (for Firebird):

create domain ...;
create domain ...;
commit;

create table ( ... );
create table ( ... );
commit;

insert into Radix_en values (32768,'dff0207a-591f-4435-9f8b-7b9b3e6ba2c1','d1f77359b3f7236806489ba3108c771f','NUMBER','US_EN','LATIN','GREEK','GERMAN');
insert into Radix_en values (32769,'5ef0e634-5c96-4ae4-92a8-0d68c02ffeb6','4e3f710600230cf0520bf32269511062',0,'zero','nulla','μηδέν','Null');
insert into Radix_en values (32770,'eae9cacc-3ee3-4471-afad-e5af201da435','9ab2f782988416431238ec63277b11df',1,'one','Unum','ένα','ein');
commit;

I would like to find a way to generate the MD5 checksum for every field, instead of the entire line of text including delimiters.

Data.txt format is not yet finalized, and I may change its format to anything which would makes this possible or easier.

Also, if there has to be several separate steps – fine, since the entire process should be scripted and automated. I was considering processing Data.txt first, then run it through this script, but I would like to simplify the process as much as possible. The number of different Data.txt files could be large, and I still have numerous other processors to include.

As the matter of fact I am also trying to learn more about bash scripting, and I am rather hooping to find expert approach and advise to this problem more than a specific solution.

I am not even sure if my post title is the solution I need, and thus it is related to my question. I was not sure if I should post this on Superuser where I usually visit or here. So I post here first, and sorry if I am off a bit.

Thanks!

Sandor


Edited to add more on 08/23/2014 3:00 AM

Thanks to mikeserv using IFS is working, so my scripts now looks like this:

renice -n 19 $$ > /dev/null; #for now

idx=32768;
dbt='Radix_en';

cat Domains_en.txt;
cat Tables_en.txt;

while read; 
do

gid="$idx,'$(uuidgen)'";

IFS=,; set -f # no gobbling!
echo "insert into $dbt values ($gid";
    for field in $REPLY
    do  
    printf '%s' ",$field,'";
    printf '%s' "$field" | md5sum;
    done | cut -d\  -f1;


echo "$var);";

idx=$((idx+1));

done < Data.txt;

The output is great, the line breaks are making text edit/search much easier while Firebird is still happy, except one thing..

Here is the output:

create domain ...;
create domain ...;
commit;

create table ( ... );
create table ( ... );
commit;

insert into Radix_en values (32768,'303f8957-57cf-4485-ace4-d21c7cf144e6'
,'NUMBER','722d79c16b51fe86610972b8d950038c
,'US_EN','b63fb39e32b062c76694bec58c4f8c67
,'LATIN','fd6f27a3c59111fc2a0b5e452595ef3d
,'GREEK','c081310697bb6b7d7bed5034824e2530
,'GERMAN','15db1d0e1b0861d8ac1f391db801493a
);
insert into Radix_en values (32769,'e7fdf095-d31c-4c59-a23b-7ea67db7aefb'
,0,'cfcd208495d565ef66e7dff9f98764da
,'zero','01b40535afbfd9611e910f58f4ab5146
,'nulla','584edd0b6638798dee53e2c23e84e2d1
,'μηδέν','cd3ed2f1039ed8668b4d48e742bd2e5b
,'Null','e0a93a9e6b0eb1688837d8bab9b4badb
);
insert into Radix_en values (32770,'a21916b5-2a05-4656-ad4e-c8cfee1abfcc'
,1,'c4ca4238a0b923820dcc509a6f75849b
,'one','7e31533231a12e4a560a18ac8cd02112
,'Unum','05d92bcbffbf59b375f25945e9af2dd0
,'ένα','826f5e2d5ba7ace48f4d6fe3c5e2925f
,'ein','dcc09a2cb665ca332d1689cb11aff592
);
commit;

The md5 hash is missing a delimiter at the end, and I have no idea how to negotiate the output with the pipes. What is it I am not understanding here?

Since in this particular case the data fields are going to hold code for programmable ICs no extra characters are going to be acceptable in the checksum between the delimiters, and so far it looks like so. Again, the code is in ASCII and my delimiter is going be something that is not part of ASCII so it is safe. Firebird also going to reject anything not ASCII.

If you would be so kind to point me to how to finish this script, as I am already banging my head against some new issues IFS just showed me (yes, file paths on Windows). I'll try and see how this script is going to work with pure ASCII, then I would like to move on and make another post about some more issues.

Thanks again for your help!

Sandor


Edited to final on 08/30/2014 7:00 PM

Replacing cut with sed seems to work. Firebird field input still needs to be escaped for semi-colons (') with another of the same added, and the current comma IFS delimiter in data files still has to be replaced with non-ASCII. Instead of recursive file lists this script is still a single-file input. echo should probably be replaced by printf, and a whole lot of more…

Here is the final script excluding the shebang:

renice -n 19 $$ >> Radix_en_log.txt;

idx=32768; dbt='Radix_en';
cat Domains_en.txt; cat Tables_en.txt;

while read;     do
gid="$idx,'$(uuidgen)'";

IFS=,; set -f;
echo "insert into $dbt values ($gid";
    for field in $REPLY
    do  
        printf '%s' ",$field,'"; printf '%s' "$field" | md5sum;
    done | sed "s/[ ][ ][-]/\'/g"; printf '%s\n' ");";
    idx=$((idx+1));
        done < Data.txt;
echo "commit;";

Here is the output:

create domain ...;
create domain ...;
commit;

create table ( ... );
create table ( ... );
commit;

insert into Radix_en values (32768,'2f675b86-b2b4-4e52-b000-e6a8cf0f3dca'
,'NUMBER','722d79c16b51fe86610972b8d950038c'
,'US_EN','b63fb39e32b062c76694bec58c4f8c67'
,'LATIN','fd6f27a3c59111fc2a0b5e452595ef3d'
,'GREEK','c081310697bb6b7d7bed5034824e2530'
,'GERMAN','15db1d0e1b0861d8ac1f391db801493a'
);
insert into Radix_en values (32769,'e2afcd65-9a1b-49e3-baf1-74b0619a4776'
,0,'cfcd208495d565ef66e7dff9f98764da'
,'zero','01b40535afbfd9611e910f58f4ab5146'
,'nulla','584edd0b6638798dee53e2c23e84e2d1'
,'μηδέν','cd3ed2f1039ed8668b4d48e742bd2e5b'
,'Null','e0a93a9e6b0eb1688837d8bab9b4badb'
);
insert into Radix_en values (32770,'f51b72eb-d64f-4e9e-ab49-8954df9505cd'
,1,'c4ca4238a0b923820dcc509a6f75849b'
,'one','7e31533231a12e4a560a18ac8cd02112'
,'Unum','05d92bcbffbf59b375f25945e9af2dd0'
,'ένα','826f5e2d5ba7ace48f4d6fe3c5e2925f'
,'ein','dcc09a2cb665ca332d1689cb11aff592'
);
commit;

Thanks!

Sandor

Best Answer

The shell has a built-in variable expansion field separator. So if you have a string and your delimiter is solid you can do:

var=32768,'dff0207a-591f-4435-9f8b-7b9b3e6ba2c1','d1f77359b3f7236806489ba3108c771f','NUMBER','US_EN','LATIN','GREEK','GERMAN'
(   IFS=,; set -f
    for field in $var
    do  printf '\n%s\n\t' "$field - md5:" >&2
        printf %s "$field" |
        md5sum
    done |
    cut -d\  -f1
)

32768 - md5:
        f43764367fa4b73ba947fae71b0223a4

dff0207a-591f-4435-9f8b-7b9b3e6ba2c1 - md5:
        0983e6c45209f390461c1b1df9320674

d1f77359b3f7236806489ba3108c771f - md5:
        07d82ab57ba81f991ab996bd7c5a0441

NUMBER - md5:
        34f55eca38e0605a84f169ff61a2a396

US_EN - md5:
        c9d3e580b7b102e864d9aea8703486ab

LATIN - md5:
        0e869135050d24ea6e7a30fc6edbac6c

GREEK - md5:
        d4cacc28e56302bcec9d7af4bba8c9a7

GERMAN - md5:
        ed73cca110623766d7a2457331a4f373

That should give you a newline separated list of md5s - as it did me.

IFS=, is used to specify that when any variable type shell expansion is performed the shell should split it out on the , character rather than <space><newline><tab> - which is the default. set -f is used to specify that if the shell should encounter any file globs within an unquoted expansion it should not expand them - so echo * would print only * regardless of the contents of the current directory.

For every comma separated field in $var the shell does printf "$field" | md5sum - so once per field without separator strings as I take the question to mean. And last cut trims the few spaces and the - at the end of each output line as it receives them. Most of the output is actually to stderr - cut only ever sees the md5sums.

Related Question