Delete Lines with Specific Word Count Using awk or sed

awksedtext processing

I have a large file which looks like:

India 07 1800 BAHRAICH 42273 +28.4 +26.7 NA 997.1 1 NA NA
India 07 1800 BAHRAICH 42273 +28.4 +26.7 NA 997.1 NA NA NA
India 07 1800 BALASORE 42895 +29.0 +26.8 NA 999.7 NA NA NA
India 07 1800 BANGALORE 43295 +23.0 +17.4 908.1 geopotential_of_850mb_=_492 NA NA NA
India 07 1800 BANGALORE 43295 +23.0 +17.4 908.1 geopotential_of_850mb_=_492 Trace NA NA
India 07 1800 BAREILLY 42189 +28.4 +26.2 NA 997.4 NA NA NA
India 07 1800 BAREILLY 42189 +28.4 +26.2 NA 997.4 Trace NA NA
India 07 1800 BARMER 42435 +35.6 +22.6 NA 997.6 NA NA NA
India 07 1800 BHOPAL_BAIRAGHAR 42667 +23.6 +23.3 942.7 1000.5 13 NA NA
India 07 1800 BHOPAL_BAIRAGHAR 42667 +23.6 +23.3 942.7 1000.5 NA NA NA
India 07 1800 BHUBANESHWAR 42971 +28.0 +25.7 NA 1000.7 NA NA NA
India 07 1800 BHUJ-RUDRAMATA 42634 +29.6 +25.7 NA 999.5 NA NA NA
India 07 1800 BIKANER 42165 +33.8 +25.1 NA 994.0 NA NA NA
India 07 1800 BIKANER 42165 +33.8 +25.1 NA 994.0 NA NA NA
India 07 1800 BOMBAY_SANTACRUZ 43003 +29.0 +26.8 NA 1004.4 10 NA NA
India 07 1800 BOMBAY_SANTACRUZ 43003 +29.0 +26.8 NA 1004.4 NA NA NA

In this file 2-3 lines are same with only one entry are different in the form of entry "NA" which can occur at any position. I want keep the line with less number of "NA".

I am not able to think a solution for this.

I want output as:

India 07 1800 BAHRAICH 42273 +28.4 +26.7 NA 997.1 1 NA NA
India 07 1800 BALASORE 42895 +29.0 +26.8 NA 999.7 NA NA NA
India 07 1800 BANGALORE 43295 +23.0 +17.4 908.1 geopotential_of_850mb_=_492 Trace NA NA
India 07 1800 BAREILLY 42189 +28.4 +26.2 NA 997.4 Trace NA NA
India 07 1800 BARMER 42435 +35.6 +22.6 NA 997.6 NA NA NA
India 07 1800 BHOPAL_BAIRAGHAR 42667 +23.6 +23.3 942.7 1000.5 13 NA NA
India 07 1800 BHUBANESHWAR 42971 +28.0 +25.7 NA 1000.7 NA NA NA
India 07 1800 BHUJ-RUDRAMATA 42634 +29.6 +25.7 NA 999.5 NA NA NA
India 07 1800 BIKANER 42165 +33.8 +25.1 NA 994.0 NA NA NA
India 07 1800 BOMBAY_SANTACRUZ 43003 +29.0 +26.8 NA 1004.4 10 NA NA

I will appreciate even logic to do so.

Thanks

Best Answer

Assuming the key is the 4th field and records with identical keys are consecutive (and I understood your question correctly), you could do something like:

perl -lane '
  $na = grep {$_ eq "NA"} @F;

  if ($F[3] eq $last_key) {
    if ($na < $min_na) {
      $min_na = $na; $min = $_
    }
  } else {
    print $min unless $. == 1;
    $last_key = $F[3]; $min = $_; $min_na = $na;
  }
  END{print $min if $.}' < your-file

Which among consecutive lines with same 4th field, prints the first one with the least number of NA fields.

If they're not consecutive, you could use some sorting:

< yourfile awk '{for (i=n=0;i<NF;i++) if ($i == "NA") n++; print n, $0}' |
  sort -k5,5 -k1,1n |
  sort -muk5,5 |
  cut -d ' ' -f 2-

With busybox sort, you'd want to add the -s option to the second invocation as it seems to do some level of sorting of the input again despite the -m.

Related Solutions

Text Processing – Removal of Lines with No More or Fewer Than ‘N’ Fields

You almost have it already:

awk -F'\t' 'NF==13 {print}' infile  > newfile

And, if you're on one of those systems where you're charged by the keystroke ( :) ) you can shorten that to

awk -F'\t' 'NF==13' infile  > newfile

To do multiple files in one sweep, and to actually change the files (and not just create new files), identify a filename thats not in use (for example, scharf), and perform a loop, like this:

for f in list
do
    awk -F'\t' 'NF==13 {print}' "$f" > scharf  &&  mv -f -- scharf "$f"
done

The list can be one or more filenames and/or wildcard filename expansion patterns; for example,

for f in blue.data green.data *.dat orange.data red.data /ultra/violet.dat

The mv command overwrites the input file (e.g., blue.data) with the temporary scharf file (which has only the lines from the input file with 13 fields). (Be sure this is what you want to do, and be careful. To be safe, you should probably back up your data first.) The -f tells mv to overwrite the input file, even though it already exists. The -- protects you against weirdness if any of your files has a name beginning with -.

Print lines where a specific column has a length condition

awk -F, 'length($26) != 10 { print }' /path/to/input > bad_field_length.txt

Best Answer

Related Solutions

Text Processing – Removal of Lines with No More or Fewer Than ‘N’ Fields

Print lines where a specific column has a length condition

Related Question