This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv
:
id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]
to this content in 1.csv
:
id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e
, this content in 2.csv
:
id,first,second,third
2,d,e,f
, and this content in 333.csv
:
id,first,second,third
333,b,b,b
333,a,a,a
, that is:
- Put all the lines with ID of N into
N.csv
. - Keep the sequence of lines as in the original.
- Include the header from the original file in all the output files.
This must also be really fast, so a while read
loop is not going to cut it.
Best Answer
This GNU awk command does the trick:
Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.
Explanation:
-F ','
(field separator) ensures that$1
etc. refer to the CSV columns rather than space separated values.NR==1{h=$0; next}
treats the first line specially (NR==1
), by storing the full header line in a variableh
(h=$0
) and skipping the line (next
).!seen[$1]++{f=$1".csv"; print h > f}
treats the first occurrence of any$1
specially (!seen[$1]
) by storing$1
followed by.csv
into a filename variablef
and saving the header to that file (print h > f
).{f=$1".csv"; print >> f; close(f)}
adds the current line to the file (print >> f
) and closes the file descriptor (close(f)
) to avoid keeping it around once processing of all lines with a specific ID is done.Bonus: If you replace
$1
with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.