text-processing – How to extract the 4th column from a CSV file using Unix command

awkcsvperltext processing

I have a csv file formatted as below.

"col1","col2","col3","col4"
"1","text1","<p>big
      html
     text</p>
","4th column"
"2","text2","<p>big2
      html2
     text2</p>
","4th column2"

I want to extract the 4th column using. I think that awk is the best tool for this ( let me know if I am wrong). I tried this

awk -F, '{print $4}' myFile.csv 

but it fails. I think because the 3rd column is multiline one.
How can I use awk or any other unix command to extract the 4th column.
I am looking for an efficient solution since my real file is big (> 2GB)

Best Answer

UPDATE:

Actually, a much easier way is to set the record separator in gawk:

$ gawk 'BEGIN{RS="\"\n"; FS=","}{print $4}' myFile.csv
"col4
"4th column
"4th column2

However, this will remove the trailing " from the end of each column. To fix that you can print it yourself:

$ gawk 'BEGIN{RS="\"\n"; FS=","}{print $4"\""}' myFile.csv
"col4"
"4th column"
"4th column2"

If you don't want the quotes at all, you can set the field separator to ",":

$ gawk 'BEGIN{RS="\"\n"; FS="\",\""}{print $3}' myFile.csv
col3
4th column
4th column2

The only way I can think of One way of doing this is to first modify the file and then parse it. In your example, the newline that actually separates two records is always following a ":

"col1","col2","col3","col4"   <-- here 
1,"text1","<p>big             <-- no "

If that is the case for the entire file, you can replace all newlines that are not immediately after a " with a placeholder and so have everything in a single line. You can then parse normally with gawk and finally replace the placeholder with the newline again. I will use the string &%& as a placeholder since it is unlikely to exist in your file:

$ perl -pe 's/"\s*\n/"&%&/; s/\n//g; s/&%&/\n/;' myFile.csv | awk -F, '{print $4}'
"col4"
"4th column"
"4th column2"

The -p flag for perl means print each line of the input file after applying the script given by -e. Then there are 3 substitution (s/foo/bar/) commands:

  • s/"\s*\n/"&%&/ : This will find any " which is followed by 0 or more whitespace characters (\s*) and then a newline character (\n). It will replace that with "&%&. The quotes are added to preserve the format and the &%& is just a random placeholder, it could be anything that does not appear in your file.

  • s/\n//g; : since the real newlines have been replaced with the placeholder, we can now safely remove all remaining newlines in this record. This means that all lines of the current record have now been concatenated into the current line.

  • s/&%&/\n/ : This turns the placeholder back into a normal new line.

To understand the output of the command run it without gawk:

$ perl -pe 's/"\s*\n/"&%&/; s/\n//g; s/&%&/\n/;' myFile.csv 
"col1","col2","col3","col4"
1,"text1","<p>big      html     text</p>","4th column"
2,"text2","<p>big2      html2     text2</p>","4th column2"

So, you now have your long records on single lines and this is perfect food for gawk.

You can also do it directly in Perl:

perl -ne '$/="\"\n"; chomp;@a=split(/,/);print "$a[3]\"\n"' myFile.csv
"col4"
"4th column"
"4th column2"

This is using a bit more Perl magic. The $/special variable is the input record separator. By setting it to "\n we tell Perl to split lines not at \n but only at "\n" so that each record will be treated as a single line. Once that is done, chomp removes the newline from the end of the line (for printing later) and split splits each record (on ,) and saves it in the array @a. Finally, we print the 4th element of the array (arrays are numbered from 0 so that is $a[3]) which is the 4th column.

And even more magic, turn on auto spitting (-a) and split on commas (F","). This will split each record into the special @F array and you can print the 4th element of the array:

$ perl -F"," -ane '$/="\"\n";chomp;print "$F[3]"' myFile.csv
"col4"
"4th column"
"4th column2"
Related Question