Concurrency – Find, Hash, and Replace Across Rows


I have a bunch of files and for each row there is a unique value I'm trying to obscure with a hash.

However there are 3M rows across the files and a rough calculation of the time needed to complete the process is hilariously long at 32days.

for y in files*; do 
  cat $y | while read z; do
    KEY=$(echo $z | awk '{ print $1 }' | tr -d '"')
    HASH=$(echo $KEY | sha1sum | awk '{ print $1 }')
    sed -i -e "s/$KEY/$HASH/g" $y

To improve this processes speed I assume I'm going to have to introduce some concurrency.

A hasty attempt based of led me to

for y in gta*; do 
  cat $y | while read z; do
    (i=i%N)); ((i++==0)); wait
    ((GTA=$(echo $z | awk '{ print $1 }' | tr -d '"')
    HASH=$(echo $GTA | sha1sum | awk '{ print $1 }')
    sed -i -e "s/$KEY/$HASH/g) & 

Which performs no better.

Example input

"2000000000" : ["200000", "2000000000"]
"2000000001" : ["200000", "2000000001"]

Example output

"e8bb6adbb44a2f4c795da6986c8f008d05938fac" : ["200000", "e8bb6adbb44a2f4c795da6986c8f008d05938fac"]
"aaac41fe0491d5855591b849453a58c206d424df" : ["200000", "aaac41fe0491d5855591b849453a58c206d424df"]

Perhaps I should read the lines concurrently then perform the hash-replace on each line?

Best Answer

FWIW I think this is the fastest way you could do it in a shell script:

$ cat
#!/usr/bin/env bash

for file in "$@"; do
    while IFS='"' read -ra a; do
        sha=$(printf '%s' "${a[1]}" | sha1sum)
        sha="${sha% *}"
        printf '%s"%s"%s"%s"%s"%s"%s"\n' "${a[0]}" "$sha" "${a[2]}" "${a[3]}" "${a[4]}" "$sha" "${a[6]}"
    done < "$file"

$ ./ file

$ cat file
"e8bb6adbb44a2f4c795da6986c8f008d05938fac" : ["200000", "e8bb6adbb44a2f4c795da6986c8f008d05938fac"]"
"aaac41fe0491d5855591b849453a58c206d424df" : ["200000", "aaac41fe0491d5855591b849453a58c206d424df"]"

but as I mentioned in the comments you'd be better of for speed of execution using a tool with sha1sum functionality built in, e.g. python.

Related Question