Delete all lines which don’t have n characters before delimiter

grepsedtext formattingtext processing

I have a very long text file (from here) which should contain 6 hexadecimal characters then a 'break' (which appears as one character and doesn't seem to show up properly in the code markdown below) followed by a few words:

00107B  Cisco Systems, Inc
00906D  Cisco Systems, Inc
0090BF  Cisco Systems, Inc
5080    Cisco Systems, Inc
0E+00   ASUSTek COMPUTER INC.
000C6E  ASUSTek COMPUTER INC.
001BFC  ASUSTek COMPUTER INC.
001E8C  ASUSTek COMPUTER INC.
0015F2  ASUSTek COMPUTER INC.
2354    ASUSTek COMPUTER INC.
001FC6  ASUSTek COMPUTER INC.
60182E  ShenZhen Protruly Electronic Ltd co.
F4CFE2  Cisco Systems, Inc
501CBF  Cisco Systems, Inc

I've done some looking around and can't see something which would work in this situation. My question is, how can I use grep/sed/awk/perl to delete all lines of this text file which do not start with exactly 6 hexadecimal characters and then a 'break'?

P.S. For bonus points, what's the best way of sorting the file alphabetically and numerically according to the hex characters (i.e. 000000 -> FFFFFF)? Should I just use sort?

Best Answer

$ awk '$1 ~ /^[[:xdigit:]]{6}$/' file
00107B  Cisco Systems, Inc
00906D  Cisco Systems, Inc
0090BF  Cisco Systems, Inc
000C6E  ASUSTek COMPUTER INC.
001BFC  ASUSTek COMPUTER INC.
001E8C  ASUSTek COMPUTER INC.
0015F2  ASUSTek COMPUTER INC.
001FC6  ASUSTek COMPUTER INC.
60182E  ShenZhen Protruly Electronic Ltd co.
F4CFE2  Cisco Systems, Inc
501CBF  Cisco Systems, Inc

This uses awk to extract the lines that contains exactly six hexadecimal digits in the first field. The [[:xdigit:]] pattern matches a hexadecimal digit, and {6} requires six of them. Together with the anchoring to the start and end of the field with ^ and $ respectively, this will only match on the wanted lines.

Redirect to some file to save it under a new name.

Note that this seems to work with GNU awk (commonly found on Linux), but not with awk on e.g. OpenBSD, or mawk.

A similar approach with sed:

$ sed -n '/^[[:xdigit:]]\{6\}\>/p' file
00107B  Cisco Systems, Inc
00906D  Cisco Systems, Inc
0090BF  Cisco Systems, Inc
000C6E  ASUSTek COMPUTER INC.
001BFC  ASUSTek COMPUTER INC.
001E8C  ASUSTek COMPUTER INC.
0015F2  ASUSTek COMPUTER INC.
001FC6  ASUSTek COMPUTER INC.
60182E  ShenZhen Protruly Electronic Ltd co.
F4CFE2  Cisco Systems, Inc
501CBF  Cisco Systems, Inc

In this expression, \> is used to match the end of the hexadecimal number. This ensures that longer numbers are not matched. The \> pattern matches a word boundary, i.e. the zero-width space between a word character and a non-word character.

For sorting the resulting data, just pipe the result trough sort, or sort -f if your hexadecimal numbers uses both upper and lower case letters

Related Solutions

How to print the inputted pattern which don’t have matching lines

Here's an sh script that produces the results you need.

#!/bin/sh

grep -f /path/to/patterns.txt /path/to/*_856_2017* | sort -u > /path/to/foundFiles.txt 

while read -r LINE
do
    grep -F "$LINE" /path/to/foundFiles.txt
    if [ $? -eq 1 ]
    then
        echo "$LINE" not found
    fi
done < /path/to/patterns.txt

In this script, I assume you output the results of your grep to the file found.txt, and that you store your patterns in the file /path/to/foundFiles.txt.

As you can see, the grep in the loop will produce the same contents of the file found.txt while adding "$pattern" not found for the missing ones.

I also devised a second approach to your case:

#!/bin/sh

grep -f /path/to/patterns.txt /path/to/*_856_2017* |
    sort -u > /path/to/foundFiles.txt

comm -23 /path/to/patterns.txt /path/to/foundFiles.txt |
    xargs -L 1 -I {} echo {} not found > /path/to/notFoundFiles.txt

cat /path/to/foundFiles.txt /path/to/notFoundFiles.txt > /path/to/finalList.txt

In this case, patterns.txt needs to be already sorted for comm to work.

The comm command compares the two files returning the lines present only in patterns.txt (-23 parameter), which is the list of patterns not found by grep.

Then, xargs grabs every line (-L 1) and echoes the line ({}) with " not found" appended to it. The result of xargs is redirected to the notFoundFiles.txt file.

Finally, you simply concatenate foundFiles.txt and notFoundFiles.txt into finalList.txt.

Print all lines that don’t have numbers, using sed

@steeldriver already explained why your attempt didn't work (should work with GNU sed, though).

But why not keep it simple? Printing all lines with only non-numeric characters is the same as dropping all lines with numeric characters:

sed '/[0-9]/d' direcciones.csv

Easier to write and easier to read, isn't it?

Best Answer

Related Solutions

How to print the inputted pattern which don’t have matching lines

Print all lines that don’t have numbers, using sed

Related Question