UPDATE:
Actually, a much easier way is to set the record separator in gawk
:
$ gawk 'BEGIN{RS="\"\n"; FS=","}{print $4}' myFile.csv
"col4
"4th column
"4th column2
However, this will remove the trailing "
from the end of each column. To fix that you can print it yourself:
$ gawk 'BEGIN{RS="\"\n"; FS=","}{print $4"\""}' myFile.csv
"col4"
"4th column"
"4th column2"
If you don't want the quotes at all, you can set the field separator to ","
:
$ gawk 'BEGIN{RS="\"\n"; FS="\",\""}{print $3}' myFile.csv
col3
4th column
4th column2
The only way I can think of One way of doing this is to first modify the file and then parse it. In your example, the newline that actually separates two records is always following a "
:
"col1","col2","col3","col4" <-- here
1,"text1","<p>big <-- no "
If that is the case for the entire file, you can replace all newlines that are not immediately after a "
with a placeholder and so have everything in a single line. You can then parse normally with gawk
and finally replace the placeholder with the newline again. I will use the string &%&
as a placeholder since it is unlikely to exist in your file:
$ perl -pe 's/"\s*\n/"&%&/; s/\n//g; s/&%&/\n/;' myFile.csv | awk -F, '{print $4}'
"col4"
"4th column"
"4th column2"
The -p
flag for perl
means print each line of the input file
after applying the script given by -e
. Then there are 3 substitution (s/foo/bar/
) commands:
s/"\s*\n/"&%&/
: This will find any "
which is followed by 0 or more whitespace characters (\s*
) and then a newline character (\n
). It will replace that with "&%&
. The quotes are added to preserve the format and the &%&
is just a random placeholder, it could be anything that does not appear in your file.
s/\n//g;
: since the real newlines have been replaced with the placeholder, we can now safely remove all remaining newlines in this record. This means that all lines of the current record have now been concatenated into the current line.
s/&%&/\n/
: This turns the placeholder back into a normal new line.
To understand the output of the command run it without gawk
:
$ perl -pe 's/"\s*\n/"&%&/; s/\n//g; s/&%&/\n/;' myFile.csv
"col1","col2","col3","col4"
1,"text1","<p>big html text</p>","4th column"
2,"text2","<p>big2 html2 text2</p>","4th column2"
So, you now have your long records on single lines and this is perfect food for gawk
.
You can also do it directly in Perl:
perl -ne '$/="\"\n"; chomp;@a=split(/,/);print "$a[3]\"\n"' myFile.csv
"col4"
"4th column"
"4th column2"
This is using a bit more Perl magic. The $/
special variable is the input record separator. By setting it to "\n
we tell Perl to split lines not at \n
but only at "\n"
so that each record will be treated as a single line. Once that is done, chomp
removes the newline from the end of the line (for printing later) and split
splits each record (on ,
) and saves it in the array @a
. Finally, we print the 4th element of the array (arrays are numbered from 0 so that is $a[3]
) which is the 4th column.
And even more magic, turn on auto spitting (-a
) and split on commas (F","
). This will split each record into the special @F
array and you can print the 4th element of the array:
$ perl -F"," -ane '$/="\"\n";chomp;print "$F[3]"' myFile.csv
"col4"
"4th column"
"4th column2"
Best Answer
If your file really is as simple as your example, you can do one of:
awk
Explanation
The
-F,
sets the input field separator to,
and the-vOFS=,
sets the variableOFS
(the output field separator) to,
.NR
is the current line number, so the script above will change the 5th field to a 5-character substring of itself. The lone1
is awk shorthand for "print this line".perl
Explanation
The
-a
makes perl act like awk and split its input lines on the character given by-F
and saves them as elements of the array@F
. We then remove all but the 1st 5 characters of the 5th field (they start counting at0
) and then print the resulting@F
array joined with commas.sed
Explanation
This is the substitution operator whose general format is
s/original.replacement/
. The1!
means "don't do this for the 1st line". The regular expression matches a set of non-,
followed by a,
4 times (([^,]+,){4}
), then any 5 non-,
characters ([^,]{5}
)—these are the 1st 5 of the 5th field—and then anything else until the end of the field ([^,]+,
). All this is replaced with the first part of the line, effectively truncating the field.