Join Lines of Text with Repeated Beginning – Command Line Tips

command linetext processing

I have a long text file (a tab-file for stardict-editor) which consists of lines in the following format:

word1  some text
word1  some other text
word2  more text
word3  even more

and would like to convert it to

word1  some text<br>some other text
word2  more text
word3  even more

This means that subsequent lines (the file is sorted) which start with the same word should be merged to a single one (here the definitions are separated with <br>). Lines with equal beginning can also appear more often than just twice. The character which separates word and definition is a tab-character and is unique on each line. word1, word2, word3 are of course placeholders for something arbitrary (except tab and newline characters) which I don't know in advance.

I can think of a longer piece of Perl code which does this, but wonder if there is a short solution in Perl or something for the command line. Any ideas?

Best Answer

This is standard procedure for awk

awk '
{
  k=$2
  for (i=3;i<=NF;i++)
    k=k " " $i
  if (! a[$1])
    a[$1]=k
  else
    a[$1]=a[$1] "<br>" k
}
END{
  for (i in a)
    print i "\t" a[i]
}' long.text.file

If file is sorted by first word in line the script can be more simple

awk '
{
  if($1==k)
    printf("%s","<br>")
  else {
    if(NR!=1)
      print ""
    printf("%s\t",$1)
  }
  for(i=2;i<NF;i++)
    printf("%s ",$i)
  printf("%s",$NF)
  k=$1
}
END{
print ""
}' long.text.file

Or just bash

unset n
while read -r word definition
do
    if [ "$last" = "$word" ]
    then
        printf "<br>%s" "$definition"
    else 
        if [ "$n" ]
        then
            echo
        else
            n=1
        fi
        printf "%s\t%s" "$word" "$definition"
        last="$word"
     fi
done < long.text.file
echo
Related Question