Unix CSV Column – How to Truncate Column in CSV File on Unix

text processing

How do I truncate column "test10" to 5 characters from Unix command line?

From this

test1,test2,test3,test4,test10,test11,test12,test17
rh,mbn,ccc,khj,ee3 eeeeeEeee ee$eeee e.eeeee2eeeee5eeeeeeee,a2,3,u
hyt,bb,mb,khj,R ee3ee eeEeee ee$eeee e.eeeee2eeeee5eeeeeeee,a,5,r
mbn,htr,ccc,fdf,F1ee eeeeEeee ee$eeee e.eeeee2eeeee5eeeeeeee,a,e,r

To this

test1,test2,test3,test4,test10,test11,test12,test17
rh,mbn,ccc,khj,ee3 e,a2,3,u
hyt,bb,mb,khj,R ee3,a,5,r
mbn,htr,ccc,fdf,F1ee ,a,e,r

Best Answer

If your file really is as simple as your example, you can do one of:

  • awk

    $ awk -F, -vOFS=, 'NR>1{$5=substr($5,1,5)}1' file 
    test1,test2,test3,test4,test10,test11,test12,test17
    rh,mbn,ccc,khj,ee3 e,a2,3,u
    hyt,bb,mb,khj,R ee3,a,5,r
    mbn,htr,ccc,fdf,F1ee ,a,e,r
    

    Explanation

    The -F, sets the input field separator to , and the -vOFS=, sets the variable OFS (the output field separator) to ,. NR is the current line number, so the script above will change the 5th field to a 5-character substring of itself. The lone 1 is awk shorthand for "print this line".

  • perl

    $ perl -F, -lane '$F[4]=~s/(.{5}).*/$1/ if $.>1; print join ",", @F' file 
    test1,test2,test3,test4,test10,test11,test12,test17
    rh,mbn,ccc,khj,ee3 e,a2,3,u
    hyt,bb,mb,khj,R ee3,a,5,r
    mbn,htr,ccc,fdf,F1ee ,a,e,r
    

    Explanation

    The -a makes perl act like awk and split its input lines on the character given by -F and saves them as elements of the array @F. We then remove all but the 1st 5 characters of the 5th field (they start counting at 0) and then print the resulting @F array joined with commas.

  • sed

    $ sed  -E '1!s/(([^,]+,){4}[^,]{5,5})[^,]*,/\1,/' file
    test1,test2,test3,test4,test10,test11,test12,test17
    rh,mbn,ccc,khj,ee3 e,a2,3,u
    hyt,bb,mb,khj,R ee3,a,5,r
    mbn,htr,ccc,fdf,F1ee ,a,e,r
    

    Explanation

    This is the substitution operator whose general format is s/original.replacement/. The 1! means "don't do this for the 1st line". The regular expression matches a set of non-, followed by a , 4 times (([^,]+,){4}), then any 5 non-, characters ([^,]{5})—these are the 1st 5 of the 5th field—and then anything else until the end of the field ([^,]+,). All this is replaced with the first part of the line, effectively truncating the field.

Related Question