I am working on mac with sed, perl, awk, bash..
I have a large-ish (10GB) text file which has 13 fields (columns) of TAB
delimited data. Unfortunately some of these lines have extraneous TABs
, so I want to delete the entire line where we have extra TABs
, and thus unequal fields. (I don't mind discarding the lines in their entirety)
What I currently have writes the number of fields into another file.
awk -F'\t' '{print NF}' infile > fieldCount
head fieldCount
13
13
10
13
13
13
14
13
13
13
I would like to construct a short script that removes any line with more (or less) than 13 proper fields (from the original file).
- speed is helpful as I have to do this on multiple files
- doing it in one sweep would be cool
- I currently am porting the fieldCount file into Python, trying to load with line by line.
EDIT:
vaild (13 columns)
a b c d e f g h i j k l m
invalid (14 columns)
a b c d e f g h i j k l m n
Best Answer
You almost have it already:
And, if you're on one of those systems where you're charged by the keystroke ( :) ) you can shorten that to
To do multiple files in one sweep, and to actually change the files (and not just create new files), identify a filename thats not in use (for example,
scharf
), and perform a loop, like this:The
list
can be one or more filenames and/or wildcard filename expansion patterns; for example,The
mv
command overwrites the input file (e.g.,blue.data
) with the temporaryscharf
file (which has only the lines from the input file with 13 fields). (Be sure this is what you want to do, and be careful. To be safe, you should probably back up your data first.) The-f
tellsmv
to overwrite the input file, even though it already exists. The--
protects you against weirdness if any of your files has a name beginning with-
.