AWK – How to Separate Lines with Same Content in a Column into Different Files

awk

I have a big csv file (Test.csv), which looks like this:

1,2,3,A,5
1,2,3,B,5
1,2,3,E,5
1,2,3,D,5
1,2,3,Z,5
1,2,3,B,5

I want to print the lines in which the 4th column has the same content in different files. Actually, I need to join these lines that have the same content in a new csv or txt file, named as the 4th column content. For example:

Output:

File A

1,2,3,A,5
1,2,3,A,5
1,2,3,A,5

File B

1,2,3,B,5
1,2,3,B,5

Since the input file is large, I have no idea how many different patterns there are in this 4th column. Column 4 contains only words and the other columns contain words and/or numbers.

As I have no experience, I researched similar questions and even tried the following code:

awk 'NR==FNR{a[$4]=NR; next} $NF in a {print > "outfile" a[$NF]}' Test.csv

but nothing worked. Can anyone help me, please? Thanks in advance.

Best Answer

This will work efficiently using POSIX sort and any awk in any shell on every UNIX box:

$ sort -t, -k4,4 test.csv |
    awk -F, '$4!=prev{close(out); out="File"$4; prev=$4} {print > out}'

$ head -n 20 File*
==> FileA <==
1,2,3,A,5

==> FileB <==
1,2,3,B,5
1,2,3,B,5

==> FileD <==
1,2,3,D,5

==> FileE <==
1,2,3,E,5

==> FileZ <==
1,2,3,Z,5

Some things to note:

  1. some awks need putting parens around an expression on the right side of output redirection, and
  2. some awks fail if you don't close output files as they go and so trying to retain too many open files once they get past a dozen or so output files, and
  3. keeping multiple open output files is very inefficient in all awks that allow it, and
  4. closing output files line by line to account for that will be very inefficient in all awks.
Related Question