Uniq a csv file ignoring a column, awk maybe

awkcsvsorttext processing

Given this file (annotations are not part of file, but form part of explanation)…

x,a,001,b,c,d,y
x,a,002,b,c,e,yy
x,bb,003,b,d,e,y
x,c,004,b,d,e,y
x,c,005,b,d,e,y   # nb - dupe of row 4
x,dd,006,b,d,e,y
x,c,007,b,d,e,y   # nb - dupe of row 4 and 5
x,dd,008,b,d,f,y
x,dd,009,b,d,e,y   # nb - dupe of row 6
x,e,010,b,d,f,y

… I would like to derive the following output:

x,a,001,b,c,d,y
x,a,002,b,c,e,yy
x,bb,003,b,d,e,y
x,c,004,b,d,e,y
x,dd,006,b,d,e,y
x,dd,008,b,d,f,y
x,e,010,b,d,f,y

If column 3 were cut from the file, and then uniq were run over the file, then if the remaining rows had their column three value added back in at the right place, then I'd get the above result.

But I'm really struggling, to come up with something that would do this. I'd welcome an opportunity to learn about linux's text processing utilities.

Performance: Files don't look likely to grow to more than 1MB, and there is only 1 file per day.

Target: Debian GNU/Linux 7 amd64, 256MB / Xeon.

Edit: tweaked example as fields are not fixedwidth, and a solution involving uniq --skip-chars=n will not work as far as I can tell.

Best Answer

With awk, you could do:

awk -F, -vOFS=, '{l=$0; $3=""}; ! ($0 in seen) {print l; seen[$0]}'
Related Question