Shell – Remove lines from tab-delimited file with missing values

csvsedshell-scripttext processing

I have a large (~900MB) tab-delimited text file that I will be processing in a downstream program. I need to delete any line with a missing value. The correct number of columns are on each line (so a missing value would correspond to 2 tabs).

Note: My actual data has ~2 million lines and 80-300 columns. Possible characters are a-z A-Z 0-9 – (hyphen) _ (underscore) and tab (delimited). No spaces or special characters are in the file.

I am new to this kind of scripting, so an explanation of any code provided would be appreciated. I normally use R, but my filesizes have outgrown the data manipulation functionality of R.

How can I at the terminal (or within a shell script) delete lines with missing values from a file (e.g. using sed)?

Example Input File:

Col1    Col2    Col3
A        B        C
D                 F
G        H        I
J        K        

Example Output File:

Col1    Col2    Col3
A        B        C
G        H        I 

Best Answer

If your fields can never contain whitespace, an empty field means either a tab as a first character (^\t), a tab as the last character (\t$) or two consecutive tabs (\t\t). You could therefore filter out lines containing any of those:

grep -Ev $'^\t|\t\t|\t$' file

If you can have whitespace, things get more complex. If your fields can begin with spaces, use this instead (it considers a field with only spaces to be empty):

grep -Pv '\t\s*(\t|$)|\t$|^\t' file

The change filters out lines matching a tab followed by 0 or more spaces and then either another tab or the end of the line.

That will also fail if the last field contains nothing but spaces. To avoid that too, use perl with the -F and -a options to split input into the @F array, telling it to print unless one of the fields is empty (/^$/):

perl -F'\t' -lane 'print unless grep{/^$/} @F' file
Related Question