Ubuntu – Converting numbers in a CSV file to their corresponding URLs

command linecsvtext processing

I have a CSV file with two columns (and a header) where each of its elements includes any number between 0 to 199. I want to convert these to their corresponding URLs. Here is an example:

41,51

should become:

http://www.cs.bu.edu/~betke/research/vc-crowd/MSCOCO/41.jpg,http://www.cs.bu.edu/~betke/research/vc-crowd/MSCOCO/51.jpg

Here is the list.csv file I want to convert:

$ head list.csv
imageA,imageB
41,51
172,100
99,149
83,72
84,160
186,8
93,198
150,21
63,102

Best Answer

Using sed:

sed -r 's#^([0-9]+),([0-9]+)$#http://www.cs.bu.edu/~betke/research/vc-crowd/MSCOCO/\1\.jpg,http://www.cs.bu.edu/~betke/research/vc-crowd/MSCOCO/\2\.jpg#' input-file
  • redirect the output to a new file > output-file; or use the option -i.bak to make the changes in their places and create a backup file.
  • -r, --regexp-extended - use extended regular expressions in the script.
  • the command s means substitute: #<string-or-regexp>#<replacement>#.
  • # is used as delimiter - usually / plays this role, but here we have much slashes within the <replacement>, thus we do not need to escape each of them.
  • ^ will match to the beginning of the line. $ will match to the end of the line.
  • [0-9]+ will match to each string consisting of digits.
  • within the the <replacement>, the capture groups ([0-9]+), will be treated as the variables \1 and \2.
  • \. is just escape of the special meaning of the dot.

Here is a simplification proposed by @dessert:

sed -r 's#([0-9]+)#http://www.cs.bu.edu/~betke/research/vc-crowd/MSCOCO/\1\.jpg#g' input-file
  • here we assume the file format is homogeneous, as it is in the example, and we do not need to match the whole line.
  • the g flag (at the end) repeats the substitution for each occurrence of the matched regex to the end of the line.

In addition it is possible to use also variables for the base URL and for the file extension:

URL='http://www.cs.bu.edu/~betke/research/vc-crowd/MSCOCO/'; EXT='.jpg'
sed -r "s#([0-9]+)#$URL\1$EXT#g" input-file
  • Note: here are used double quote marks.
Related Question