Shell – How to split a CSV file per initial column (with headers)

shelltext processing

This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv:

id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]

to this content in 1.csv:

id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e

, this content in 2.csv:

id,first,second,third
2,d,e,f

, and this content in 333.csv:

id,first,second,third
333,b,b,b
333,a,a,a

, that is:

Put all the lines with ID of N into N.csv.
Keep the sequence of lines as in the original.
Include the header from the original file in all the output files.

This must also be really fast, so a while read loop is not going to cut it.

Best Answer

This GNU awk command does the trick:

awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv

Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.

Explanation:

-F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.
NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).
!seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).
{f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.

Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.

Related Solutions

Text Processing – Split a File by Column and Rename Generated Files

You could try like this:

awk 'NR==1{h=$0; next}
!seen[$3]++{f="FILE_"FILENAME"_"$3".txt";print h > f} 
{print >> f}' infile

The above saves the header in a variable h (NR==1{h=$0; next}) then, if $3 not seen (!seen[$3]++ i.e. if it's the first time it encounters the current value of $3) it sets the filename (f=...) and writes the header to filename (print h > f). Then it appends the entire line to filename (print >> f). It uses default FS (field separator): blank. If you want to use | as FS (or even a regex with gnu awk) see cas' comment below.

How to merge first two lines of a csv column-by-column

Try this

$ awk -F, 'NR<2{split(gensub(/Citty/,"City","g",$0),a,FS)}NR==2{for(b=2;b<=NF;b+=2){c=c a[b]" "$b","}print gensub(/,$/,"",1,c)}NR>2{print gensub(/(^,|" *",)/,"","g",$0)}' inp
Product Name,City Location,Price Per Unit
banana,CA,5.7
apple,FL,2.3
$

Same code is more readable if split across a few lines :

$ awk -F, '
> NR<2{split(gensub(/Citty/,"City","g",$0),a,FS)}
> NR==2{for(b=2;b<=NF;b+=2){c=c a[b]" "$b","}print gensub(/,$/,"",1,c)}
> NR>2{print gensub(/(^,|" *",)/,"","g",$0)}' inp
Product Name,City Location,Price Per Unit
banana,CA,5.7
apple,FL,2.3
$

If 1st line, split the line into array elements within a. Fix the Citty->City typo.

If 2nd line, starting with the 2nd column, print the corresponding column from 1st line together with this column. Repeat for each column, going in 2 column increments. Strip the trailing ,.

After 2nd line, replace any leading , or any "<spaces>", with an empty string and then print the result.

Tested ok on GNU Awk 4.0.2

Try it online!

Best Answer

Related Solutions

Text Processing – Split a File by Column and Rename Generated Files

How to merge first two lines of a csv column-by-column

Related Question