Shell – How to remove dot character from string without calling sed or awk again

awkregular expressionsedshell-scriptstring

I have a file called hostlist.txt that contains text like this:

host1.mydomain.com
host2.mydomain.com
anotherhost
www.mydomain.com
login.mydomain.com
somehost
host3.mydomain.com

I have the following small script:

#!/usr/local/bin/bash

while read host; do
        dig +search @ns1.mydomain.com $host ALL \
        | sed -n '/;; ANSWER SECTION:/{n;p;}';
done <hostlist.txt \
        | gawk '{print $1","$NF}' >fqdn-ip.csv

Which outputs to fqdn-ip.csv:

host1.mydomain.com.,10.0.0.1
host2.mydomain.com.,10.0.0.2
anotherhost.internal.mydomain.com.,10.0.0.11
www.mydomain.com.,10.0.0.10
login.mydomain.com.,10.0.0.12
somehost.internal.mydomain.com.,10.0.0.13
host3.mydomain.com.,10.0.0.3

My question is how do I remove the . just before the comma without invoking sed or gawk again? Is there a step I can perform in the existing sed or gawk calls that will strip the dot?

hostlist.txt will contain 1000s of hosts so I want my script to be fast and efficient.

Best Answer

The sed command, the awk command, and the removal of the trailing period can all be combined into a single awk command:

while read -r host; do dig +search "$host" ALL; done <hostlist.txt | awk 'f{sub(/.$/,"",$1); print $1", "$NF; f=0} /ANSWER SECTION/{f=1}'

Or, as spread out over multiple lines:

while read -r host
do
    dig +search "$host" ALL
done <hostlist.txt | awk 'f{sub(/.$/,"",$1); print $1", "$NF; f=0} /ANSWER SECTION/{f=1}'

Because the awk command follows the done statement, only one awk process is invoked. Although efficiency may not matter here, this is more efficient than creating a new sed or awk process with each loop.

Example

With this test file:

$ cat hostlist.txt 
www.google.com
fd-fp3.wg1.b.yahoo.com

The command produces:

$ while read -r host; do dig +search "$host" ALL; done <hostlist.txt | awk 'f{sub(/.$/,"",$1); print $1", "$NF; f=0} /ANSWER SECTION/{f=1}'
www.google.com, 216.58.193.196
fd-fp3.wg1.b.yahoo.com, 206.190.36.45

How it works

awk implicitly reads its input one record (line) at a time. This awk script uses a single variable, f, which signals whether the previous line was an answer section header or not.

f{sub(/.$/,"",$1); print $1", "$NF; f=0}

If the previous line was an answer section header, then f will be true and the commands in curly braces are executed. The first removes the trailing period from the first field. The second prints the first field, followed by ,, followed by the last field. The third statement resets f to zero (false).

In other words, f here functions as a logical condition. The commands in curly braces are executed if f is nonzero (which, in awk, means 'true').
/ANSWER SECTION/{f=1}

If the current line contains the string ANSWER SECTION, then the variable f is set to 1 (true).

Here, /ANSWER SECTION/ serves as a logical condition. It evaluates to true if the current matches the regular expression ANSWER SECTION. If it does, then the command in curly braces in executed.

Examples:

Modifying FS:

awk -F" +|;|=" '

$3 == "gene" {
    printf("%s\t%s\t%s\t%s\t%s\t%s\t\n",
    $1, $4, $5, $10, $6, $7);
}
' data.file

Using split:

awk '
$3 == "gene" {
    split($9, a, ";")
    printf("%s\t%s\t%s\t%s\t%s\t%s\t\n",
    $1, $4, $5, substr(a[1], 3), $6, $7);
}
' data.file

OFS and FS:

Output Field Separator (OFS) as tab, and alternative FS inside awk. Also updated FS to include tab:

awk '
BEGIN {
    FS="[ \t]+|;|="
    OFS="\t"
}
$3 == "gene" {
    print $1, $4, $5, $10, $6, $7
}

' data.file

Also see The Open Group Variables and Special Variables, Examples.

Gawk manual – it usually is noted when things are a gawk extension to awk.

sed Command – Remove Everything After FQDN Dot

sed uses POSIX basic regular expressions (BRE) by default. \s is a PCRE (Perl-compatible regular expression) which is equivalent to the BRE [[:blank:]] (I think, matching spaces and tabs, or possiby [[:space:]] which matches a larger set of whitespace characters). The + is a POSIX extended regular expression (ERE) modifier, which is equivalent to \{1,\} as a BRE.

So try

sed 's/\.[[:blank:]].*//'

instead. You may replace [[:blank:]] by a space character if you don't need to match tabs:

sed 's/\. .*//'

Note that there is no need to do the substitution with the g flag as there will only ever be a single match. Also, the .+ that you use could just be replaced by .* instead of .\{1,\} as we don't care whether there are any further characters at all (just delete all of them).

Why does my regular expression work in X but not in Y?

Best Answer

Example

How it works

Related Solutions

Remove string from a particular field using awk/sed

Examples:

sed Command – Remove Everything After FQDN Dot

Related Question