Join Lines of Text with Repeated Beginning – Command Line Tips

command linetext processing

I have a long text file (a tab-file for stardict-editor) which consists of lines in the following format:

word1  some text
word1  some other text
word2  more text
word3  even more

and would like to convert it to

word1  some text<br>some other text
word2  more text
word3  even more

This means that subsequent lines (the file is sorted) which start with the same word should be merged to a single one (here the definitions are separated with <br>). Lines with equal beginning can also appear more often than just twice. The character which separates word and definition is a tab-character and is unique on each line. word1, word2, word3 are of course placeholders for something arbitrary (except tab and newline characters) which I don't know in advance.

I can think of a longer piece of Perl code which does this, but wonder if there is a short solution in Perl or something for the command line. Any ideas?

Best Answer

This is standard procedure for awk

awk '
{
  k=$2
  for (i=3;i<=NF;i++)
    k=k " " $i
  if (! a[$1])
    a[$1]=k
  else
    a[$1]=a[$1] "<br>" k
}
END{
  for (i in a)
    print i "\t" a[i]
}' long.text.file

If file is sorted by first word in line the script can be more simple

awk '
{
  if($1==k)
    printf("%s","<br>")
  else {
    if(NR!=1)
      print ""
    printf("%s\t",$1)
  }
  for(i=2;i<NF;i++)
    printf("%s ",$i)
  printf("%s",$NF)
  k=$1
}
END{
print ""
}' long.text.file

Or just bash

unset n
while read -r word definition
do
    if [ "$last" = "$word" ]
    then
        printf "<br>%s" "$definition"
    else 
        if [ "$n" ]
        then
            echo
        else
            n=1
        fi
        printf "%s\t%s" "$word" "$definition"
        last="$word"
     fi
done < long.text.file
echo

Related Solutions

Bash – Get lines matching a pattern in one file and put them into a second file matching the same pattern

With GNU sed:

sed -e '/^b/{R 1.txt' -e 'd}' 2.txt

if you want to edit file 2.txt "in place", add sed's option -i.

Delete all lines which don’t have n characters before delimiter

$ awk '$1 ~ /^[[:xdigit:]]{6}$/' file
00107B  Cisco Systems, Inc
00906D  Cisco Systems, Inc
0090BF  Cisco Systems, Inc
000C6E  ASUSTek COMPUTER INC.
001BFC  ASUSTek COMPUTER INC.
001E8C  ASUSTek COMPUTER INC.
0015F2  ASUSTek COMPUTER INC.
001FC6  ASUSTek COMPUTER INC.
60182E  ShenZhen Protruly Electronic Ltd co.
F4CFE2  Cisco Systems, Inc
501CBF  Cisco Systems, Inc

This uses awk to extract the lines that contains exactly six hexadecimal digits in the first field. The [[:xdigit:]] pattern matches a hexadecimal digit, and {6} requires six of them. Together with the anchoring to the start and end of the field with ^ and $ respectively, this will only match on the wanted lines.

Redirect to some file to save it under a new name.

Note that this seems to work with GNU awk (commonly found on Linux), but not with awk on e.g. OpenBSD, or mawk.

A similar approach with sed:

$ sed -n '/^[[:xdigit:]]\{6\}\>/p' file
00107B  Cisco Systems, Inc
00906D  Cisco Systems, Inc
0090BF  Cisco Systems, Inc
000C6E  ASUSTek COMPUTER INC.
001BFC  ASUSTek COMPUTER INC.
001E8C  ASUSTek COMPUTER INC.
0015F2  ASUSTek COMPUTER INC.
001FC6  ASUSTek COMPUTER INC.
60182E  ShenZhen Protruly Electronic Ltd co.
F4CFE2  Cisco Systems, Inc
501CBF  Cisco Systems, Inc

In this expression, \> is used to match the end of the hexadecimal number. This ensures that longer numbers are not matched. The \> pattern matches a word boundary, i.e. the zero-width space between a word character and a non-word character.

For sorting the resulting data, just pipe the result trough sort, or sort -f if your hexadecimal numbers uses both upper and lower case letters

Best Answer

Related Solutions

Bash – Get lines matching a pattern in one file and put them into a second file matching the same pattern

Delete all lines which don’t have n characters before delimiter

Related Question