Unix CSV Column – How to Truncate Column in CSV File on Unix

text processing

How do I truncate column "test10" to 5 characters from Unix command line?

From this

test1,test2,test3,test4,test10,test11,test12,test17
rh,mbn,ccc,khj,ee3 eeeeeEeee ee$eeee e.eeeee2eeeee5eeeeeeee,a2,3,u
hyt,bb,mb,khj,R ee3ee eeEeee ee$eeee e.eeeee2eeeee5eeeeeeee,a,5,r
mbn,htr,ccc,fdf,F1ee eeeeEeee ee$eeee e.eeeee2eeeee5eeeeeeee,a,e,r

To this

test1,test2,test3,test4,test10,test11,test12,test17
rh,mbn,ccc,khj,ee3 e,a2,3,u
hyt,bb,mb,khj,R ee3,a,5,r
mbn,htr,ccc,fdf,F1ee ,a,e,r

Best Answer

If your file really is as simple as your example, you can do one of:

awk
```
$ awk -F, -vOFS=, 'NR>1{$5=substr($5,1,5)}1' file 
test1,test2,test3,test4,test10,test11,test12,test17
rh,mbn,ccc,khj,ee3 e,a2,3,u
hyt,bb,mb,khj,R ee3,a,5,r
mbn,htr,ccc,fdf,F1ee ,a,e,r
```
Explanation

The -F, sets the input field separator to , and the -vOFS=, sets the variable OFS (the output field separator) to ,. NR is the current line number, so the script above will change the 5th field to a 5-character substring of itself. The lone 1 is awk shorthand for "print this line".
perl
```
$ perl -F, -lane '$F[4]=~s/(.{5}).*/$1/ if $.>1; print join ",", @F' file 
test1,test2,test3,test4,test10,test11,test12,test17
rh,mbn,ccc,khj,ee3 e,a2,3,u
hyt,bb,mb,khj,R ee3,a,5,r
mbn,htr,ccc,fdf,F1ee ,a,e,r
```
Explanation

The -a makes perl act like awk and split its input lines on the character given by -F and saves them as elements of the array @F. We then remove all but the 1st 5 characters of the 5th field (they start counting at 0) and then print the resulting @F array joined with commas.
sed
```
$ sed  -E '1!s/(([^,]+,){4}[^,]{5,5})[^,]*,/\1,/' file
test1,test2,test3,test4,test10,test11,test12,test17
rh,mbn,ccc,khj,ee3 e,a2,3,u
hyt,bb,mb,khj,R ee3,a,5,r
mbn,htr,ccc,fdf,F1ee ,a,e,r
```
Explanation

This is the substitution operator whose general format is s/original.replacement/. The 1! means "don't do this for the 1st line". The regular expression matches a set of non-, followed by a , 4 times (([^,]+,){4}), then any 5 non-, characters ([^,]{5})—these are the 1st 5 of the 5th field—and then anything else until the end of the field ([^,]+,). All this is replaced with the first part of the line, effectively truncating the field.

Related Solutions

Text Processing – A Unix Command to Truncate Each Line of a File

For the sake of variety, here's another way with cut:

cut -d \; -f -3

text-processing – How to extract the 4th column from a CSV file using Unix command

UPDATE:

Actually, a much easier way is to set the record separator in gawk:

$ gawk 'BEGIN{RS="\"\n"; FS=","}{print $4}' myFile.csv
"col4
"4th column
"4th column2

However, this will remove the trailing " from the end of each column. To fix that you can print it yourself:

$ gawk 'BEGIN{RS="\"\n"; FS=","}{print $4"\""}' myFile.csv
"col4"
"4th column"
"4th column2"

If you don't want the quotes at all, you can set the field separator to ",":

$ gawk 'BEGIN{RS="\"\n"; FS="\",\""}{print $3}' myFile.csv
col3
4th column
4th column2

~~The only way I can think of~~ One way of doing this is to first modify the file and then parse it. In your example, the newline that actually separates two records is always following a ":

"col1","col2","col3","col4"   <-- here 
1,"text1","<p>big             <-- no "

If that is the case for the entire file, you can replace all newlines that are not immediately after a " with a placeholder and so have everything in a single line. You can then parse normally with gawk and finally replace the placeholder with the newline again. I will use the string &%& as a placeholder since it is unlikely to exist in your file:

$ perl -pe 's/"\s*\n/"&%&/; s/\n//g; s/&%&/\n/;' myFile.csv | awk -F, '{print $4}'
"col4"
"4th column"
"4th column2"

The -p flag for perl means print each line of the input file after applying the script given by -e. Then there are 3 substitution (s/foo/bar/) commands:

s/"\s*\n/"&%&/ : This will find any " which is followed by 0 or more whitespace characters (\s*) and then a newline character (\n). It will replace that with "&%&. The quotes are added to preserve the format and the &%& is just a random placeholder, it could be anything that does not appear in your file.
s/\n//g; : since the real newlines have been replaced with the placeholder, we can now safely remove all remaining newlines in this record. This means that all lines of the current record have now been concatenated into the current line.
s/&%&/\n/ : This turns the placeholder back into a normal new line.

To understand the output of the command run it without gawk:

$ perl -pe 's/"\s*\n/"&%&/; s/\n//g; s/&%&/\n/;' myFile.csv 
"col1","col2","col3","col4"
1,"text1","<p>big      html     text</p>","4th column"
2,"text2","<p>big2      html2     text2</p>","4th column2"

So, you now have your long records on single lines and this is perfect food for gawk.

You can also do it directly in Perl:

perl -ne '$/="\"\n"; chomp;@a=split(/,/);print "$a[3]\"\n"' myFile.csv
"col4"
"4th column"
"4th column2"

This is using a bit more Perl magic. The $/special variable is the input record separator. By setting it to "\n we tell Perl to split lines not at \n but only at "\n" so that each record will be treated as a single line. Once that is done, chomp removes the newline from the end of the line (for printing later) and split splits each record (on ,) and saves it in the array @a. Finally, we print the 4th element of the array (arrays are numbered from 0 so that is $a[3]) which is the 4th column.

And even more magic, turn on auto spitting (-a) and split on commas (F","). This will split each record into the special @F array and you can print the 4th element of the array:

$ perl -F"," -ane '$/="\"\n";chomp;print "$F[3]"' myFile.csv
"col4"
"4th column"
"4th column2"

Best Answer

Explanation

Explanation

Explanation

Related Solutions

Text Processing – A Unix Command to Truncate Each Line of a File

text-processing – How to extract the 4th column from a CSV file using Unix command

Related Question