AWK – Split File by Column Name and Add Header Row

awksplittext processing

I have a pipe delimited file a.txt which includes a header row. The first column holds a filename.

I would like to split a.txt into several different files – the name of which is determined by the first column. I would also like to have the header row of a.txt repeated at the top of each file .

so I have a.txt:

filename|count|age
1.txt|1|15
1.txt|2|14
2.txt|3|1
41.txt|44|1
2.txt|1|3

and I want to create 1.txt

filename|count|age
1.txt|1|15
1.txt|2|14

and 2.txt

filename|count|age
2.txt|3|1
2.txt|1|3

and 41.txt

filename|count|age
41.txt|44|1

I have a basic split working

awk -F\| '{print>$1}' a.txt

but I am struggling to work out how to get the header included, could anybody help? Thanks!

Best Answer

The solution would be to store the header in a separate variable and print it on the first occurence of a new $1 value (=file name):

awk -F'|' 'FNR==1{hdr=$0;next} {if (!seen[$1]++) print hdr>$1; print>$1}' a.txt 
  • This will store the entire first line of a.txt in a variable hdr but otherwise leave that particular line unprocessed.
  • On all subsequent lines, we first check if the $1 value (=the desired output filename) was already encountered, by looking it up in an array seen which holds an occurence count of the various $1 values. If the counter is still zero for the current $1 value, output the header to the file indicated by $1, then increase the counter to suppress header output for all later occurences. The rest you already figured out yourself.

Addendum:

If you have more than one input file, which all have a header line, you can simply place them all as arguments to the awk call, as in

awk -F'|' ' ... ' a.txt b.txt c.txt ...

If, however, only the first file has a header line, you would need to change FNR to NR in the first rule.

Caveat

As noted by Ed Morton, the simple approach only works if the number of different output files is small (max. around 10). GNU awk will still continue working, but become slower due to automatically closing and opening files in the background as needed; other awk implementations may simply fail due to "too many open files".

Related Question