Ubuntu – Extracting a part of a massive csv file from command line

command linecsvscripts

I want to read parts of a big csv file between rows n and m and between columns p and q.
Is there an easy way to do this easily with the shell? (Are there commands I should read the doc's? otherwise, I'll write a python script)

Best Answer

I had a script that I adjusted (good idea the (N+1)q part!) thanks to @chronitis comment and the SO answer:

#! /bin/bash 
#
N=10
M=20
P=2
Q=3
sed -n "$N,${M}p; $((M+1))q" $1 | cut -d, -f$P-$Q

Save the file as for example cut_csv, make it executable and use as

cut_csv file

It can be made fancier by accepting the N,M,P,Q parameters as input etc, but I use it seldomly so I normally simply edit the file.

How it works:

The main command is the following (let's suppose N=10, M=20, P=2, Q=3); the shell substitutes the variables and the last line will become: (1)

sed -n "10,20p ; 21q" file | cut -d, -f2-3

Let's start with the first command:

sed -n "10,20p ; 21q" file 

This call sed (stream editor, man sed) in no-print mode (-n) and execute the following commands on the file:

  1. print (p) the lines between 10 and 20 (this is the 10,20p part)
  2. quit (q) when reading line 21 (21q) so that discard the rest of the file

The output of sed is piped (|) to cut:

cut -d, -f2-3

This command (man cut) selects fields of a line (and repeat for each line). In this case, I am telling it that the separator between fields (columns) is a commad (-d,), and to print out the columns between 2 and 3.

As another more complex example I often use this one:

sed -n "1p; 10,14p; 21q" data.csv | cut -d, -f1,4-8

This will select row 1 (where I have titles :-)) and rows from 10 to 14 (5 lines); then select columns 1 (time in my data...) and column from 4 to 8. It is really powerful once you get grips with it.

(1) one great way to see what the shell is doing is change the first line (which is called a shebang) like that:

#! /bin/bash -xv

The shell will now print every command it reads and the result of the substitutions:

(0)asus-rmano: part_of_csv.sh p20dedo.csv
#! /bin/bash -xv
#
N=10
+ N=10
M=20
+ M=20
P=2
+ P=2
Q=3
+ Q=3
sed -n "$N,${M}p; $((M+1))q" $1 | cut -d, -f$P-$Q
+ cut -d, -f2-3
+ sed -n '10,20p; 21q' p20dedo.csv
16:05:49,000
16:05:51,000
[...]
Related Question