Shell – Remove lines from tab-delimited file with missing values

csvsedshell-scripttext processing

I have a large (~900MB) tab-delimited text file that I will be processing in a downstream program. I need to delete any line with a missing value. The correct number of columns are on each line (so a missing value would correspond to 2 tabs).

Note: My actual data has ~2 million lines and 80-300 columns. Possible characters are a-z A-Z 0-9 – (hyphen) _ (underscore) and tab (delimited). No spaces or special characters are in the file.

I am new to this kind of scripting, so an explanation of any code provided would be appreciated. I normally use R, but my filesizes have outgrown the data manipulation functionality of R.

How can I at the terminal (or within a shell script) delete lines with missing values from a file (e.g. using sed)?

Example Input File:

Col1    Col2    Col3
A        B        C
D                 F
G        H        I
J        K

Example Output File:

Col1    Col2    Col3
A        B        C
G        H        I

Best Answer

If your fields can never contain whitespace, an empty field means either a tab as a first character (^\t), a tab as the last character (\t$) or two consecutive tabs (\t\t). You could therefore filter out lines containing any of those:

grep -Ev $'^\t|\t\t|\t$' file

If you can have whitespace, things get more complex. If your fields can begin with spaces, use this instead (it considers a field with only spaces to be empty):

grep -Pv '\t\s*(\t|$)|\t$|^\t' file

The change filters out lines matching a tab followed by 0 or more spaces and then either another tab or the end of the line.

That will also fail if the last field contains nothing but spaces. To avoid that too, use perl with the -F and -a options to split input into the @F array, telling it to print unless one of the fields is empty (/^$/):

perl -F'\t' -lane 'print unless grep{/^$/} @F' file

Related Solutions

Shell – How to expand tabs based on content

If you have column(1), an old BSD tool, try column -t, for pretty-printing tables.

To ensure empty cells are displayed, you could try the approach of inserting a single space in each empty cell (recognizable by two consecutive tabs). The idea is column(1) should give the space character its own column but being a single character in width it should not affect the table dimensions or be visible in the output to humans.

generate_tsv | 
   awk '/\t\t/ { for (i = 0; i < 2; i++) gsub(/\t\t/, "\t \t") } 1' | 
   column -t -s $'\t'

The extra awk inserted in the pipeline does the inserting of spaces into each empty cell, as described. 2 passes are necessary to handle 2 consecutive empty cells (\t\t\t).

Extracting a subset of lines from a large text file

This will give you all lines whose 9th column matches MEF2:

awk -F"\t" '$9~/MEF2/' file > output

Assuming your file is always tab-delimited, this will work and you can rest safely. That's as close to 0 margin of error as you'll ever get.

If, however, you have tried importing into something like R (presumably using read.table("file",sep="\t")) and that didn't work, you might have some lines with different numbers of fields (see the end for how to check that). If so, assuming you are always interested in the last field, you can use $(NF) in awk to print the last field, no matter how many fields there are:

awk -F"\t" '$(NF)~/MEF2/' file > output

If you still feel the need to check, you can simply extract all lines that match MEF2, irrespective of where the match is, then compare the results:

grep MEF2 file > output2

Once you have that, you can use wc to check if they have the same number of lines. If they don't, find where they differ by running

grep -vFf output output2

That command will print any lines in output2 that are not present in output1. If there are any, most probably they will have MEF2 somewhere in the line but not in the 9th field. If it's in the 9th field, then you know that your file is not tab separated and there's something wrong with your data.

The awk above is probably the simplest solution but here are a few others that do the same thing:

Perl

perl -F"\t" -lane '$F[8]=~/MEF2/ && print' file

sed (this one might match wrong lines if you have more than 9 fields)
```
sed -n '/\t.*\t.*\t.*\t.*\t.*\t.*\t.*\t.*MEF2.*/p' file
```

grep

grep -P '^.+?\t.*\t.*\t.*\t.*\t.*\t.*\t.*\t.*MEF2.*' file

If these don't all produce the same output, you know there's an issue with your file. One more thing you can check is to make sure that all your lines have 9 fields. If they don't, you know there's an issue:

awk -F"\t" 'NF!=9' file

The above will print all lines that don't have exactly 9 tab-separated fields. If there is output, the lines it prints are problematic.

Best Answer

Related Solutions

Shell – How to expand tabs based on content

Extracting a subset of lines from a large text file

Related Question