Ubuntu – Extracting a part of a massive csv file from command line

command linecsvscripts

I want to read parts of a big csv file between rows n and m and between columns p and q.
Is there an easy way to do this easily with the shell? (Are there commands I should read the doc's? otherwise, I'll write a python script)

Best Answer

I had a script that I adjusted (good idea the (N+1)q part!) thanks to @chronitis comment and the SO answer:

#! /bin/bash 
#
N=10
M=20
P=2
Q=3
sed -n "$N,${M}p; $((M+1))q" $1 | cut -d, -f$P-$Q

Save the file as for example cut_csv, make it executable and use as

cut_csv file

It can be made fancier by accepting the N,M,P,Q parameters as input etc, but I use it seldomly so I normally simply edit the file.

How it works:

The main command is the following (let's suppose N=10, M=20, P=2, Q=3); the shell substitutes the variables and the last line will become: (1)

sed -n "10,20p ; 21q" file | cut -d, -f2-3

Let's start with the first command:

sed -n "10,20p ; 21q" file

This call sed (stream editor, man sed) in no-print mode (-n) and execute the following commands on the file:

print (p) the lines between 10 and 20 (this is the 10,20p part)
quit (q) when reading line 21 (21q) so that discard the rest of the file

The output of sed is piped (|) to cut:

cut -d, -f2-3

This command (man cut) selects fields of a line (and repeat for each line). In this case, I am telling it that the separator between fields (columns) is a commad (-d,), and to print out the columns between 2 and 3.

As another more complex example I often use this one:

sed -n "1p; 10,14p; 21q" data.csv | cut -d, -f1,4-8

This will select row 1 (where I have titles :-)) and rows from 10 to 14 (5 lines); then select columns 1 (time in my data...) and column from 4 to 8. It is really powerful once you get grips with it.

(1) one great way to see what the shell is doing is change the first line (which is called a shebang) like that:

#! /bin/bash -xv

The shell will now print every command it reads and the result of the substitutions:

(0)asus-rmano: part_of_csv.sh p20dedo.csv
#! /bin/bash -xv
#
N=10
+ N=10
M=20
+ M=20
P=2
+ P=2
Q=3
+ Q=3
sed -n "$N,${M}p; $((M+1))q" $1 | cut -d, -f$P-$Q
+ cut -d, -f2-3
+ sed -n '10,20p; 21q' p20dedo.csv
16:05:49,000
16:05:51,000
[...]

Related Solutions

Ubuntu – How to batch rename files (images) based on CSV file

This should work for you:

sed 's/"//g' files.csv | while IFS=, read orig new; do mv "$orig" "$new"; done

Explanation:

sed 's/"//g' files.csv : remove the quotes
IFS=, : split the input on ,
while read orig new; do ... done : This will read each input line, split it on the value of $IFS (here a comma) and save the 1st field as $orig and the rest as $new.
mv "$orig" "$new" : this will rename the files as requested.

If your file only contains file names (like orig.jpg) and no paths (not /home/take2/orig.jpg or similar), the command above will only affect files in your current directory. So, you need to open a terminal, cd to the target directory and run it there.

Test first:

To test this, you can do a dry run first by printing the commands that will be run without actually executing them:

sed 's/"//g' files.csv | while IFS=, read orig new; do echo mv "$orig" "$new"; done

Ubuntu – Converting numbers in a CSV file to their corresponding URLs

Using sed:

sed -r 's#^([0-9]+),([0-9]+)$#http://www.cs.bu.edu/~betke/research/vc-crowd/MSCOCO/\1\.jpg,http://www.cs.bu.edu/~betke/research/vc-crowd/MSCOCO/\2\.jpg#' input-file

redirect the output to a new file > output-file; or use the option -i.bak to make the changes in their places and create a backup file.
-r, --regexp-extended - use extended regular expressions in the script.
the command s means substitute: #<string-or-regexp>#<replacement>#.
# is used as delimiter - usually / plays this role, but here we have much slashes within the <replacement>, thus we do not need to escape each of them.
^ will match to the beginning of the line. $ will match to the end of the line.
[0-9]+ will match to each string consisting of digits.
within the the <replacement>, the capture groups ([0-9]+), will be treated as the variables \1 and \2.
\. is just escape of the special meaning of the dot.

Here is a simplification proposed by @dessert:

sed -r 's#([0-9]+)#http://www.cs.bu.edu/~betke/research/vc-crowd/MSCOCO/\1\.jpg#g' input-file

here we assume the file format is homogeneous, as it is in the example, and we do not need to match the whole line.
the g flag (at the end) repeats the substitution for each occurrence of the matched regex to the end of the line.

In addition it is possible to use also variables for the base URL and for the file extension:

URL='http://www.cs.bu.edu/~betke/research/vc-crowd/MSCOCO/'; EXT='.jpg'
sed -r "s#([0-9]+)#$URL\1$EXT#g" input-file

Note: here are used double quote marks.

Best Answer

Related Solutions

Ubuntu – How to batch rename files (images) based on CSV file

Ubuntu – Converting numbers in a CSV file to their corresponding URLs

Related Question