I have a huge csv file with 10 fields separated by commas. Unfortunately, some lines are malformed and do not contain exactly 10 commas (what causes some problems when I want to read the file into R). How can I filter out only the lines that contain exactly 10 commas?
Keep only the lines containing exact number of delimiters
csvfilter
Related Solutions
This (adapted from here) should do what you need though @rici's Perl one is much simpler:
$ sed -r ':a;s/(("[0-9,]*",?)*"[0-9,]*),/\1/;ta; s/""/","/g;
s/"([0-9]*)",?/\1,/g ' file
56,72,123454,x,y,"foo,a,b,bar"
56,92,1234,x,y,"foo,a,b,bar"
56,72,12345478765467,x,y,"foo,a,b,bar"
56,72,x,y,"foo,a,b,bar",123454,
56,72,x,y,"foo,a,b,bar",123454,45578492,"bar,foo"
Explanation
:a
: define a label calleda
.s/(("[0-9,]*",?)*"[0-9,]*),/\1/
: This one needs to be broken down- First of all, using this construct :
(foo(bar))
,\1
will befoobar
and\2
will bebar
. "[0-9,]*",?
: match 0 or more of0-9
or,
, followed by 0 or 1,
.("[0-9,]*",?)*
: match 0 or more of the above."[0-9,]*
: match 0 or more of0-9
or,
that come right after a"
- First of all, using this construct :
ta;
: go back to the labela
and run again if the substitution was successful.s/""/","/g;
: post-processing. Replace""
with","
.s/"([0-9]*)",?/\1,/g
: remove all quotes around numbers.
This might be easier to understand with another example:
$ echo '"1,2,3,4"' | sed -nr ':a;s/(("[0-9,]*",?)*"[0-9,]*),/\1/;p;ta;'
"1,2,34"
"1,234"
"1234"
"1234"
So, while you can find a number that is right after a quote and followed by a comma and another number, join the two numbers together and repeat the process until it is no longer possible.
At this point I believe it is useful to mention a quote from info sed
that appears in the section describing advanced functions such as the label used above (thanks for finding if @Braiam):
In most cases, use of these commands indicates that you are probably better off programming in something like `awk' or Perl.
The 9 paths to OpenSim enlightenment:
With sed
and some shell magic:
sed 's/=.*//' OpenSimStats.txt | paste -sd, >out.csv
sed 's/.*=//; s/[^0-9]*$//' OpenSimStats.txt | paste -sd, >>out.csv
With sed
, without shell magic:
sed -n 's/=.*//; 1{ h; b; }; $! H; $ { x; s/\n/,/g; p; }' OpenSimStats.txt >out.csv
sed -n 's/.*=//; 1{ s/[0-9]*$//; h; b; }; s/[^0-9]*$//; $! H; $ { x; s/\n/,/g; p; }' OpenSimStats.txt >>out.csv
With shell magic and a tiny bit of sed
:
paste -sd, <(cut -d= -f1 OpenSimStats.txt) <(cut -d= -f2 OpenSimStats.txt | sed 's/[^0-9]*$//')
With cut
and some shell magic:
cut -d= -f1 OpenSimStats.txt | paste -sd, >out.csv
cut -d= -f2 OpenSimStats.txt | sed 's/[^0-9]*$//' | paste -sd, >>out.csv
With GNU datamash
:
sed 's/=/,/; s/[^0-9]*$//' OpenSimStats.txt | datamash -t, transpose
With perl
:
perl -lnE 's/\D+$//o;
($a, $b) = split /=/;
push @a, $a; push @b, $b;
END { $, = ","; say @a; say @b }' OpenSimStats.txt
With grep
:
grep -o '^[^=]*' OpenSimStats.txt | paste -sd, >out.csv
egrep -o '[0-9.]+' OpenSimStats.txt | paste -sd, >>out.csv
With bash
:
#! /usr/bin/env bash
line1=()
line2=()
while IFS='=' read -r a b; do
line1+=("$a")
[[ $b =~ ^[0-9.]+ ]]
line2+=("$BASH_REMATCH")
done <OpenSimStats.txt
( set "${line1[@]}"; IFS=,; echo "$*" ) >out.csv
( set "${line2[@]}"; IFS=,; echo "$*" ) >>out.csv
With awk
:
awk -F= '
NR==1 { a = $1; sub(/[^0-9]+$/, "", $2); b = $2; next }
{ a = a "," $1; sub(/[^0-9]+$/, "", $2); b = b "," $2 }
END { print a; print b }' OpenSimStats.txt
Bonus 10th path for data nerds, with csvtk
:
csvtk replace -d= -f 2 -p '\D+$' -r '' <OpenSimStats.txt | csvtk transpose
Bonus 11th path with vim
:
:%s/\D*$//
:%s/=/\r/
qaq
:g/^\D/y A | normal dd
:1,$-1 s/\n/,/
"aP
:2,$-2 s/\n/,/
:d 1
:w out.csv
Best Answer
Another POSIX one:
If the line has 10 commas, then there will be 11 fields in this line. So we simply make
awk
use,
as the field delimiter. If the number of fields is 11, the conditionNF == 11
is true,awk
then performs the default actionprint $0
.