Linux – How to correctly format the output with Awk printf command

awklinuxtext processing

I have the following file:

echo filename
    dfT08r352|30.5|2010/06/01|2016/08/29|2281|6.24503764544832|74.9404517453799|
    zm00dr121|37|2008/03/05|2011/09/12|1285.95833333333|3.52076203513575|42.249144421629|
    ccvd00121|41.6|2008/03/05|2012/03/05|1461|4|48|
    sddf00121|39.6|2008/03/05|2012/09/10|1649.95833333333|4.51733972165184|54.208076659822|
    fttt00121|41|2008/03/05|2013/09/16|2020.95833333333|5.53308236367785|66.3969883641342|
    ghhyy0121|42.2|2008/03/05|2014/03/18|2203.95833333333|6.03410905772302|72.4093086926762|

I am trying to format this file using awk printf to have the following desired format:

keep the same order of fields (left–>right)
have comma ", " FS
only for the last three fields ($5, $6, $7) having all the
numbers to be 4 digits, if less have a leading zero and only 2
digits after the point like 0123.12 or 1234.10

I wrote the following awk command

awk -F"|" '{print $1","$2","$3","$4}{format = "%04.2f,%04.2f,%04.2f,"}{printf format, $5,$6,$7}' filename

however the below output has the following issues:

is not in order (left–>right)

do not have the leading zero

dfT08r352,30.5,2010/06/01,2016/08/29
2281.00,6.25,74.94,zm00dr121,37,2008/03/05,2011/09/12
1285.96,3.52,42.25,ccvd00121,41.6,2008/03/05,2012/03/05
1461.00,4.00,48.00,sddf00121,39.6,2008/03/05,2012/09/10
1649.96,4.52,54.21,fttt00121,41,2008/03/05,2013/09/16
2020.96,5.53,66.40,ghhyy0121,42.2,2008/03/05,2014/03/18

Can someone please let me know what is my mistake and how to fix it?

Best Answer

You have the fields in the right order, but your first print statement adds a newline (Output Record Separator), so your data's there, but just wrapped unexpectedly.

The second issue is that you're telling printf to use a width of 4; that includes the decimal point and the two digits after it, leaving only one for the leading digit and none for any padding. Try using 5 as the width, so that your data is padded up to four total numbers. If you want 4 digits before the decimal point, then change the width to 7 instead.

This is the shortest change I made from your program to something that outputs what I think you want:

awk -F"|" '{
  format = "%05.2f,%05.2f,%05.2f"; 
  print $1","$2","$3","$4"," sprintf(format, $5,$6,$7)}' filename

I combined multiple { } blocks into one, and also combined the print statements into one.

If I was to write your awk statement from scratch, I might do something like this:

awk -v FS=\| -v OFS=, '{
  $5=sprintf("%05.2f", $5); 
  $6=sprintf("%05.2f", $6); 
  $7=sprintf("%05.2f", $7); 
  print $1,$2,$3,$4,$5,$6,$7}' filename

It explicitly sets the input Field Separator, the Output Field Separator, explicitly converts each of the fields on its own, then prints the desired fields, with the OFS separating them.

Related Solutions

Why awk says “syntax error” for the comma I placed between the two patterns

awk 'BEGIN {
        ...
     }
     # the next line should NOT be within curly braces
     $1 ~ /^Observation/, $1 ~ /^@@@/ { ... }
     {
        ...
     }
     END{
        ...
     }' input.txt > out.csv

Adding leading zeros into date and time

A great tool for text processing is awk. The following example is using plain standard awk on FreeBSD 11.1. @RomanPerekhrest has an elegant solution in another answer if you prefer GNU awk.

Your input is comma-separated. Because of this we invoke awk with the -F, parameter.

We can then print out columns using the print statement. $1 is the first column. $2 is the second column.

$ awk -F, '{ print $8 }' inputfile.csv
2017-1-5 1:07:09
2017-11-25 19:57:17

This gives us the 8th column for each row.

This is then the date field you want to manipulate. Rather than setting the delimiter using the command-line parameter we can do it as part of the script. FS for the input delimiter and OFS for the output delimiter.

$ awk 'BEGIN { FS = "," } ; { print $8 }' inputfile.csv
2017-1-5 1:07:09
2017-11-25 19:57:17

When working with dates I often prefer to use the date util to make sure I handle them correctly. And I do not need to worry if I am using regular or GNU awk. Furthermore I get a big fat failure if the date does not parse correctly.

The interesting parameter are:

-j     Specify we do not want to set the date at all
-f     The format string we use for input
+      The format string we use for output

So if we run this for one date:

$ date -j -f "%Y-%m-%d %H:%M:%S" +"%Y-%m-%d %H:%M:%S" "2017-1-5 1:07:09"
2017-01-05 01:07:09

We can then combine this with awk. Notice how the quotes are escaped. This is probably the biggest stumbling block for a beginner.

$ awk -F, '{ system("date -j -f \"%Y-%m-%d %H:%M:%S\" +\"%Y-%m-%d %H:%M:%S\" \""$8"\"")}' inputfile.csv
2017-01-05 01:07:09
2017-11-25 19:57:17

The system call seems correct - but unfortunately it only allows us to capture the returncode and it prints directly to the output. To avoid this we use the cmd | getline pattern. The following simple example will read the current date into mydate:

$ awk 'BEGIN { cmd = "date"; cmd | getline mydate; close(cmd); print mydate }'
Thu Mar  1 16:26:15 CET 2018

We use the BEGIN keyword as we have no input to this simple example.

So let us expand this:

awk 'BEGIN { FS=","; OFS=FS };
     { 
         cmd = "date -j -f \"%Y-%m-%d %H:%M:%S\" +\"%Y-%m-%d %H:%M:%S\" \""$8"\"";
         cmd | getline firstdate;
         close(cmd);
         cmd = "date -j -f \"%Y-%m-%d %H:%M:%S\" +\"%Y-%m-%d %H:%M:%S\" \""$9"\"";
         cmd | getline seconddate;
         close(cmd);
         print $1,$2,$3,$4,$5,$6,$7,firstdate,seconddate
     }' inputfile.csv

And we can collapse it to a one-liner:

awk 'BEGIN {FS=",";OFS=FS};{cmd="date -j -f \"%Y-%m-%d %H:%M:%S\" +\"%Y-%m-%d %H:%M:%S\" \""$8"\"";cmd | getline firstdate;close(cmd);cmd="date -j -f \"%Y-%m-%d %H:%M:%S\" +\"%Y-%m-%d %H:%M:%S\" \""$9"\"";cmd | getline seconddate;close(cmd);print $1,$2,$3,$4,$5,$6,$7,firstdate,seconddate}' inputfile.csv

Which gives me the output:

1111,2222,3333,4444,5555,6666,7777,2017-01-05 01:07:09,2017-01-05 01:11:53
1111,2222,3333,4444,5555,6666,7777,2017-11-25 19:57:17,2017-11-25 19:58:54

Addendum

As the purpose here is to learn good habit I better update this answer. It is a bad habit to repeat code. When you start doing that you should split things into a function. As you will notice the code below immediately becomes more readable.

awk 'function convertdate(the_date) {
         cmd = "date -j -f \"%Y-%m-%d %H:%M:%S\" +\"%Y-%m-%d %H:%M:%S\" \""the_date"\"";
         cmd | getline formatted_date;
         close(cmd);
         return formatted_date
     }
     BEGIN { FS=","; OFS=FS };
     { 
         print $1,$2,$3,$4,$5,$6,$7,convertdate($8),convertdate($9)
     }' inputfile.csv

Make a habit of this and you will notice how much easier it will become to introduce error handling later on.

Best Answer

Related Solutions

Why awk says “syntax error” for the comma I placed between the two patterns

Adding leading zeros into date and time

Addendum

Related Question