Keep only the lines containing exact number of delimiters

csvfilter

I have a huge csv file with 10 fields separated by commas. Unfortunately, some lines are malformed and do not contain exactly 10 commas (what causes some problems when I want to read the file into R). How can I filter out only the lines that contain exactly 10 commas?

Best Answer

Another POSIX one:

awk -F , 'NF == 11' <file

If the line has 10 commas, then there will be 11 fields in this line. So we simply make awk use , as the field delimiter. If the number of fields is 11, the condition NF == 11 is true, awk then performs the default action print $0.

Explanation

:a : define a label called a.
s/(("[0-9,]*",?)*"[0-9,]*),/\1/ : This one needs to be broken down
- First of all, using this construct : (foo(bar)), \1 will be foobar and \2 will be bar.
- "[0-9,]*",? : match 0 or more of 0-9 or ,, followed by 0 or 1 ,.
- ("[0-9,]*",?)* : match 0 or more of the above.
- "[0-9,]* : match 0 or more of 0-9 or , that come right after a "
ta; : go back to the label a and run again if the substitution was successful.
s/""/","/g; : post-processing. Replace "" with ",".
s/"([0-9]*)",?/\1,/g : remove all quotes around numbers.

This might be easier to understand with another example:

$ echo '"1,2,3,4"' | sed -nr ':a;s/(("[0-9,]*",?)*"[0-9,]*),/\1/;p;ta;'
"1,2,34"
"1,234"
"1234"
"1234"

So, while you can find a number that is right after a quote and followed by a comma and another number, join the two numbers together and repeat the process until it is no longer possible.

At this point I believe it is useful to mention a quote from info sed that appears in the section describing advanced functions such as the label used above (thanks for finding if @Braiam):

In most cases, use of these commands indicates that you are probably better off programming in something like `awk' or Perl.

Parse a txt file based on a specific delimeter and then tranform it into a CSV file

The 9 paths to OpenSim enlightenment:

With sed and some shell magic:

sed 's/=.*//' OpenSimStats.txt | paste -sd, >out.csv
sed 's/.*=//; s/[^0-9]*$//' OpenSimStats.txt | paste -sd, >>out.csv

With sed, without shell magic:

sed -n 's/=.*//; 1{ h; b; }; $! H; $ { x; s/\n/,/g; p; }' OpenSimStats.txt >out.csv
sed -n 's/.*=//; 1{ s/[0-9]*$//; h; b; }; s/[^0-9]*$//; $! H; $ { x; s/\n/,/g; p; }' OpenSimStats.txt >>out.csv

With shell magic and a tiny bit of sed:

paste -sd, <(cut -d= -f1 OpenSimStats.txt) <(cut -d= -f2 OpenSimStats.txt | sed 's/[^0-9]*$//')

With cut and some shell magic:

cut -d= -f1 OpenSimStats.txt | paste -sd, >out.csv
cut -d= -f2 OpenSimStats.txt | sed 's/[^0-9]*$//' | paste -sd, >>out.csv

With GNU datamash:

sed 's/=/,/; s/[^0-9]*$//' OpenSimStats.txt | datamash -t, transpose

With perl:

perl -lnE 's/\D+$//o;
    ($a, $b) = split /=/;
    push @a, $a; push @b, $b;
    END { $, = ","; say @a; say @b }' OpenSimStats.txt

With grep:

grep -o '^[^=]*' OpenSimStats.txt | paste -sd, >out.csv
egrep -o '[0-9.]+' OpenSimStats.txt | paste -sd, >>out.csv

With bash:

#! /usr/bin/env bash
line1=()
line2=()
while IFS='=' read -r a b; do
    line1+=("$a")
    [[ $b =~ ^[0-9.]+ ]]
    line2+=("$BASH_REMATCH")
done <OpenSimStats.txt
( set "${line1[@]}"; IFS=,; echo "$*" ) >out.csv
( set "${line2[@]}"; IFS=,; echo "$*" ) >>out.csv

With awk:

awk -F= '
    NR==1 { a = $1; sub(/[^0-9]+$/, "", $2); b = $2; next }
    { a = a "," $1; sub(/[^0-9]+$/, "", $2); b = b "," $2 }
    END { print a; print b }' OpenSimStats.txt

Bonus 10th path for data nerds, with csvtk:

csvtk replace -d= -f 2 -p '\D+$' -r '' <OpenSimStats.txt | csvtk transpose

Bonus 11th path with vim:

:%s/\D*$//
:%s/=/\r/
qaq
:g/^\D/y A | normal dd
:1,$-1 s/\n/,/
"aP
:2,$-2 s/\n/,/
:d 1
:w out.csv

Best Answer

Related Solutions

CSV Text Processing – How to Remove Commas Within Double-Quoted Numbers

Explanation

Parse a txt file based on a specific delimeter and then tranform it into a CSV file

Related Question