Remove fields containing specific string

perlsedtext processing

I have file1 containing multiple tab-separated fields, in which I would like to remove only the fields containing a specific string, in my case the underscore character _ (not removing all the row):

cat file1
357M        2054_
357_        154=        1900_
511_        419X        1481_        34=

I would like to obtain the following:

cat file2
357M
154=
419X        34=

I managed to remove the fields as follows:

cat file1 | perl -pe 's/\w+_\s*//g'
357M    154=        419X        34=

But the format is not good, because I would like not to alter the number of columns.

I also tried:

cat file1 | sed 's/[0-9]*_//g'
357M
          154=
          419X         34=

But I would like to get rid of those empty columns.

A brute force approach that actually also works:

cat file1 | sed 's/[0-9]*_//g' | tr -s '\t' '\t' | sed 's/^[ \t]*//g'
357M
154=
419X         34=

This last command: (1) removes all fields containing a underscore; (2) replaces multiple tabs in a row with just one tab; (3) removes leading tabs. Not so elegant though.

Any suggestions?

Best Answer

You could use this simple sed.

sed 's/\w*_\s*//;/^$/d' infile.txt

/^$/d will delete empty lines where the line is including only one field ending with underscore foo_ or _ alone.

Giving result:

357M
154=
419X    34=

Explanation

-F'|' : sets the field separator to |.
NR==FNR : NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
c[$1$2]++; next : if this is the 1st file, save the 1st two fields in the c array. Then, skip to the next line so that this is only applied on the 1st file.
c[$1$2]>0 : the else block will only be executed if this is the second file so we check whether fields 1 and 2 of this file have already been seen (c[$1$2]>0) and if they have been, we print the line. In awk, the default action is to print the line so if c[$1$2]>0 is true, the line will be printed.

Alternatively, since you tagged with Perl:

perl -e 'open(A, "file2"); while(<A>){/.+?\|[^|]+/ && $k{$&}++};
         while(<>){/.+?\|[^|]+/ && do{print if defined($k{$&})}}' file1

Explanation

The first line will open file2, read everything up to the 2nd | (.+?\|[^|]+) and save that (the $& is the result of the last match operator) in the %k hash.

The second line processes file1, uses the same regex to extract the 1st two columns and print the line if those columns are defined in the %k hash.

Both of the above approaches will need to hold the 2 first columns of file2 in memory. That shouldn't be a problem if you only have a few hundred thousand lines but if it is, you could do something like

cut -d'|' -f 1,2 file2 | while read pat; do grep "^$pat" file1; done

But that will be slower.

Text Processing – Removal of Lines with No More or Fewer Than ‘N’ Fields

You almost have it already:

awk -F'\t' 'NF==13 {print}' infile  > newfile

And, if you're on one of those systems where you're charged by the keystroke ( :) ) you can shorten that to

awk -F'\t' 'NF==13' infile  > newfile

To do multiple files in one sweep, and to actually change the files (and not just create new files), identify a filename thats not in use (for example, scharf), and perform a loop, like this:

for f in list
do
    awk -F'\t' 'NF==13 {print}' "$f" > scharf  &&  mv -f -- scharf "$f"
done

The list can be one or more filenames and/or wildcard filename expansion patterns; for example,

for f in blue.data green.data *.dat orange.data red.data /ultra/violet.dat

The mv command overwrites the input file (e.g., blue.data) with the temporary scharf file (which has only the lines from the input file with 13 fields). (Be sure this is what you want to do, and be careful. To be safe, you should probably back up your data first.) The -f tells mv to overwrite the input file, even though it already exists. The -- protects you against weirdness if any of your files has a name beginning with -.

Best Answer

Related Solutions

Shell – compare two columns of different files and print if it matches

Explanation

Explanation

Text Processing – Removal of Lines with No More or Fewer Than ‘N’ Fields

Related Question