How to adjust numeric fields in a text file

text processing

Suppose I have the following <Tab> separated text file:

file name    size      owner    
file1.txt    12.345    root
file2.txt    0.172222  user1
file3.txt    2.46e2    user2
file4.txt    12345     root
file5.txt    21        user3
file6.txt    246.0     user1
file name    owner     last modified    last accessed
text4.txt    root      12.73            13.53
text5.txt    user3     15.3333          34
file1.txt    root      23               31.0032

This file consists of several "tables", each of which starts with a header line and then contains some data lines.
Some columns are numeric, but each table can have different number, as well as different types of columns. The types of columns are not known in advance and they can not be determined according to the table header.

The numeric values in the table use various formats – there might be integers, as well as floating point decimal numbers or also numbers in scientific notation.

My question is how to convert all the numeric fields in this table into the same format. For example, I might want to have every numeric field formatted with a "%.2f" printf format specifier. Naturally, the other, non-numeric fields must remain unchanged.

Also, I would like to be able to arbitrarily adjust (e.g. add 42 and then multiply by 7) every numeric field contained in this file.

The solution I am looking for should be field-based.
It should scan the entire file and for each field it should determine whether it is numeric or not. If it is numeric, it should print its adjusted and formatted value. Otherwise, it should just print the original.

I know that something like that can be done with awk. But if I remember correctly, awk uses double for internal representation of numbers and therefore it might have problems with precision and larger values. So, ideally, I would like to use something else, something which should correctly handle at least 64 bit integers.

Is there any simple way to achieve this?

Best Answer

perl has a module called Scalar::Util (included with perl since v5.8) which has a useful function called looks_like_number(), which can be used to detect whether a field is a number or not.

looks_like_number is not perfect, but is pretty good.

The bare outline of a simple perl program to do what you want might look something like this:

#! /usr/bin/perl

use Scalar::Util qw(looks_like_number);

while(<>) {
  chomp;
  my @fields=split("\t");
  foreach my $f (0..scalar @fields-1) {
    if (looks_like_number($fields[$f])) {
      $fields[$f] += 42;
      $fields[$f] *= 7;
      $fields[$f] = sprintf("%.2f",$fields[$f]);
    }
  }
  print join("\t",@fields),"\n";
}

If given your sample data above as input, it prints this:

file name   size    owner    
file1.txt   380.41  root
file2.txt   295.21  user1
file3.txt   2016.00 user2
file4.txt   86709.00    root
file5.txt   441.00  user3
file6.txt   2016.00 user1
file name   owner   last modified   last accessed
text4.txt   root    383.11  388.71
text5.txt   user3   401.33  532.00
file1.txt   root    455.00  511.02

Here's another version of the script that uses Math::BigFloat for all calculations, rounding decimals to 2 digits.

#! /usr/bin/perl

use Scalar::Util qw(looks_like_number);
use Math::BigFloat;

while(<>) {
  chomp;
  my @fields=split("\t");
  foreach my $f (0..scalar @fields-1) {
    if (looks_like_number($fields[$f])) {
      my $BF = Math::BigFloat->new($fields[$f]);
      $BF->badd(42);
      $BF->bmul(7);
      $BF->ffround(-2);

      $fields[$f] = $BF->bstr();
    }
  }
  print join("\t",@fields),"\n";
}

example input:

file name   owner   last modified   last accessed
text4.txt   root    12.73   13.53
text5.txt   user3   15.3333 34
file6.txt   root    903709792518875002.42857142857142857142 903709792518875002
file7.txt   root    6659166111488656281486807152009765625   539422123247359763587428687890625

output:

file name   owner   last modified   last accessed
text4.txt   root    383.11  388.71
text5.txt   user3   401.33  532.00
file6.txt   root    6325968547632125311.00  6325968547632125308.00
file7.txt   root    46614162780420593970407650064068359669.00   3775954862731518345112000815234669.00
Related Question