Shell – How to split a CSV file per initial column (with headers)

shelltext processing

This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv:

id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]

to this content in 1.csv:

id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e

, this content in 2.csv:

id,first,second,third
2,d,e,f

, and this content in 333.csv:

id,first,second,third
333,b,b,b
333,a,a,a

, that is:

  1. Put all the lines with ID of N into N.csv.
  2. Keep the sequence of lines as in the original.
  3. Include the header from the original file in all the output files.

This must also be really fast, so a while read loop is not going to cut it.

Best Answer

This GNU awk command does the trick:

awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv

Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.

Explanation:

  • -F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.
  • NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).
  • !seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).
  • {f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.

Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.

Related Question