Ubuntu – convert txt file to csv seperated with tabs

bashcommand linecsvtext processing

Sorry people I'm new to Linux, and while I looked through the list of answered questions, I don't know enough to recognize if my question was answered. Or if I can adapt one of the answer to my particular little problem.

I get a text file of data from my boss, who learned to use computers one way; and he won't change. The data is almost a csv file, except the fields are all separated by a space characters rather then a comma or tab character. And the text fields of data include embedded spaces also.

Each field is either a number or is numbers and text, all fields are of varying lengths, and none are off-set with single or double quotes. The number fields predominate, and no text field is adjacent to any other text field. Rarely is an embedded number in a text field preceded or followed by a [space] character.

Unfortunately not every [space] character can just be replaced. Instead because, generally, field breaks come in the form of either [space][0-9] or [0-9][space], this is how I determine if a [space] character should be converted to a [tab] character or not. If the [space] character is beside a digit its to be converted to a [tab] character.

So using the Find/Replace function in Notepad for Windows, I search for a digit-space or a space-digit combination, converting that [space] character to a [tab] character. I have to do this ten times [0-9][space] and then ten more times [space][0-9].
I'm looking for a script to do this automatically.

Here is an example of the file I get. It contains four fields separated by [space] characters (first line). Each following line is one record, so the second line is the first record. Account is 2281, Units are 19, Description is Toshiba PX-1982GRSUB{, and finally the Delta field contains the 0:

Account Units Description Delta  
2281 19 Toshiba PX-1982GRSUB 0  
9618 200 HP MX19942-228b -25  
19246 4 CompuCom HD300g Hard Drive 4

So what I'm looking for is a script that will read the original file, convert the [space] characters that are field separators into characters and write it all to a new file. And I want the explanation — so I don't keep asking the same questions over and over again.

Best Answer

Ok, so you need to replace the first two and the last space in every line with a comma. You can't just replace every space, because the 3rd field may contain spaces itself. You can do this with regular expression replacement. Here's a sed script/command, that works:

sed -re 's/^(\S*) (\S*) (.*) (\S+)\s*$/\1,\2,\3,\4/' in.txt > out.csv

With the above example this returns:

Account,Units,Description,Delta
2281,19,Toshiba PX-1982GRSUB,0
9618,200,HP MX19942-228b,-25
19246,4,CompuCom HD300g Hard Drive,4

This is still quite fragile with handling empty fields and breaks entirely, if columns other than the 3rd contain spaces. It's very easy to introduce such malformed data if it is formatted manually as done by your boss. You should suggest to him to switch to a more robust table format (e. g. proper CSV & Co.) and editor (common spread sheet tools can manipulate CSV quite well and flexibly, e. g. LibreOffice/OpenOffice Calc, Microsoft Excel and Google Docs).

Related Solutions

Ubuntu – Data extraction from a text file using bash

You don't need a script for such a simple thing. You can use awk:

awk '$2 ~ "^[0-9][0-9]*$" { print $2 }' file.txt | head -n -1 | awk '{print}' ORS=',' | sed 's/,$/\n/'

Some explanations:

awk '$2 ~ "^[0-9][0-9]*$" { print $2 }' file.txt - print from the file.txt only the fields which are numbers.
head -n -1 - remove last line / last number.
awk '{print}' ORS=',' - concatenate all lines in one single line, each number separated by ,.
sed 's/,$/\n/' - replace last , with a newline character.

Or, shorter:

awk '$2 ~ "^[0-9][0-9]*$" { print $2 }' ORS=',' file.txt | sed 's/,[0-9]*,$/\n/'

Ubuntu – Retain white spaces in fields when using awk

Using awk

from=1; to=3; delimiter="|"; awk -F"$delimiter" -v from="$from" -v to="$to" '!/^[[:blank:]]*$/ {printf "Index="; for(i=from; i<=to; i++) {printf $i; if(i<to) {printf "|"};} printf "\n"; printf "Key="; for(i=to+1; i<=NF; i++) {printf $i; if(i<NF) {printf "|"};} printf "\n"}' foo

Best Answer

Related Solutions

Ubuntu – Data extraction from a text file using bash

Ubuntu – Retain white spaces in fields when using awk

Related Question