Linux DNS – Unable to Run Queries When Response is Bigger than 512 Bytes

digdnslinuxnetworkingpermissions

Situation:

Linux machine running in Azure
looking for a public domain that returns 112 results
the packet response size is 1905 bytes

Case 1:

interrogating google DNS 8.8.8.8 – it returns un-truncated response. Everything is OK.

Case 2:

interrogating Azure DNS 168.63.129.16 – it returns a truncated response and tries to switch to TCP, but it fails there, with error "unable to connect to server address". However, it works perfectly well if I run the interrogation with "sudo".

The problem can be reproduced all the time:

Without sudo:

$ dig  aerserv-bc-us-east.bidswitch.net @8.8.8.8

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> aerserv-bc-us-east.bidswitch.net @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49847
;; flags: qr rd ra; QUERY: 1, ANSWER: 112, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;aerserv-bc-us-east.bidswitch.net. IN   A

;; ANSWER SECTION:
aerserv-bc-us-east.bidswitch.net. 119 IN CNAME  bidcast-bcserver-gce-sc.bidswitch.net.
bidcast-bcserver-gce-sc.bidswitch.net. 119 IN CNAME bidcast-bcserver-gce-sc-multifo.bidswitch.net.
bidcast-bcserver-gce-sc-multifo.bidswitch.net. 59 IN A 35.211.189.137
bidcast-bcserver-gce-sc-multifo.bidswitch.net. 59 IN A 35.211.205.98
--------
bidcast-bcserver-gce-sc-multifo.bidswitch.net. 59 IN A 35.211.28.65
bidcast-bcserver-gce-sc-multifo.bidswitch.net. 59 IN A 35.211.213.32

;; Query time: 12 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Thu Oct 03 22:28:09 EEST 2019
;; MSG SIZE  rcvd: 1905


[azureuser@testserver~]$ dig  aerserv-bc-us-east.bidswitch.net
;; Truncated, retrying in TCP mode.
;; Connection to 168.63.129.16#53(168.63.129.16) for aerserv-bc-us-east.bidswitc                                                                                                                               h.net failed: timed out.
;; Connection to 168.63.129.16#53(168.63.129.16) for aerserv-bc-us-east.bidswitch.net failed: timed out.

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> aerserv-bc-us-east.bidswitch.net
;; global options: +cmd
;; connection timed out; no servers could be reached
;; Connection to 168.63.129.16#53(168.63.129.16) for aerserv-bc-us-east.bidswitch.net failed: timed out.

With sudo:

[root@testserver ~]# dig  aerserv-bc-us-east.bidswitch.net
;; Truncated, retrying in TCP mode.

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> aerserv-bc-us-east.bidswitch.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8941
;; flags: qr rd ra; QUERY: 1, ANSWER: 112, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1280
;; QUESTION SECTION:
;aerserv-bc-us-east.bidswitch.net. IN   A

;; ANSWER SECTION:
aerserv-bc-us-east.bidswitch.net. 120 IN CNAME  bidcast-bcserver-gce-sc.bidswitch.net.
bidcast-bcserver-gce-sc.bidswitch.net. 120 IN CNAME bidcast-bcserver-gce-sc-multifo.bidswitch.net.
bidcast-bcserver-gce-sc-multifo.bidswitch.net. 60 IN A 35.211.56.153
.......
bidcast-bcserver-gce-sc-multifo.bidswitch.net. 60 IN A 35.207.61.237
bidcast-bcserver-gce-sc-multifo.bidswitch.net. 60 IN A 35.207.23.245

;; Query time: 125 msec
;; SERVER: 168.63.129.16#53(168.63.129.16)
;; WHEN: Thu Oct 03 22:17:18 EEST 2019
;; MSG SIZE  rcvd: 1905

I checked everything I found over internet, I saw nowhere an explanation why this works as intended only when ran from root account or with sudo permissions if the response package size is too big and the response gets truncated, forcing the DNS query to switch from UDP to TCP.

Adding "options edns0" or "options use-vc" or "options edns0 use-vc" to /etc/resolv.conf doesn't help either.

Same behavior in CentOS 7.x, Ubuntu 16.04 and 18.04

Update: tested with curl and telnet, the behavior is the same. Works with sudo or from root account, fails without sudo or from standard account.

Can anyone please provide some insight about why it needs superuser permissions when switching from UDP to TCP and help with some solution, if any?

UPDATE:

I know this is long post, but please read it all before answering.
Firewall is set to allow any to any.
Port 53 is open on TCP and UDP in all the test environments I have.
SELinux/AppArmor is disabled.

Update2:

Debian9 (kernel 4.19.0-0.bpo.5-cloud-amd64 ) works correctly without the sudo.
RHEL8 (kernel 4.18.0-80.11.1.el8_0.x86_64) works correcly, but with huge delays (up to 30sec), without sudo.

Update3:
List of distributions I was able to test and it doesn't work:

RHEL 7.6, kernel 3.10.0-957.21.3.el7.x86_64
CentOS 7.6, kernel 3.10.0-862.11.6.el7.x86_64
Oracle7.6, kernel 4.14.35-1902.3.2.el7uek.x86_64
Ubuntu14.04, kernel 3.10.0-1062.1.1.el7.x86_64
Ubuntu16.04, kernel 4.15.0-1057-azure
Ubuntu18.04, kernel 5.0.0-1018-azure
Ubuntu19.04, kernel 5.0.0-1014-azure
SLES12-SP4, kernel 4.12.14-6.23-azure
SLES15, kernel 4.12.14-5.30-azure

So, basically the only distribution I tested and is without problems is Debian 9. Since RHEL 8 has huge delays, which may trigger time outs, I cannot consider it fully working.

So far, the biggest difference between Debian 9 and the rest of distributions I tested is the systemd (missing on debian 9)… not sure how to check if this is the cause.

Thank you!

Best Answer

"Can anyone please provide some insight about why this works like this and help with some solution, if any?"

SHORT ANSWER:

A default Azure VM is created with broken DNS: systemd-resolved needs further configuration. sudo systemctl status systemd-resolved will quickly confirm this. /etc/resolv.conf points to 127.0.0.53- a local unconfigured stub resolver.

The local stub resolver systemd-resolved was unconfigured. It had no forwarder set so after hitting 127.0.0.53 it had nobody else to ask. Ugh. Jump to the end to see how to configure it for Ubuntu 18.04.

If you care about how that conclusion was reached, then please read the Long Answer.

LONG ANSWER:

Why DNS Responses Truncated over 512 Bytes:

TCP [RFC793] is always used for full zone transfers (using AXFR) and is often used for messages whose sizes exceed the DNS protocol's original 512-byte limit.

Source: https://tools.ietf.org/html/rfc7766

ANALYSIS:

This was trickier than I thought. So I spun-up an Ubuntu 18.04 VM in Azure so I could test from the vantage point of the OP:

My starting point was to validate nothing was choking-off the DNS queries:

sudo iptables -nvx -L
sudo apparmor_status

All chains in the iptables had their default policy set to ACCEPT and although Apparmor was set to "enforcing", it wasn't on anything involved with DNS. So no connectivity or permissions issues observed on the host at this point.

Next I needed to establish how the DNS queries were winding through the gears.

cat /etc/resolv.conf 

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "systemd-resolve --status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0
search ns3yb2bs2fketavxxx3qaprsna.zx.internal.cloudapp.net

So according to resolv.conf, the system expects a local stub resolver called systemd-resolved. Checking the status of systemd-resolved per the hint given in the text above we see it's erroring:

sudo systemctl status systemd-resolved

● systemd-resolved.service - Network Name Resolution
   Loaded: loaded (/lib/systemd/system/systemd-resolved.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2019-10-08 12:41:38 UTC; 1h 5min ago
     Docs: man:systemd-resolved.service(8)
           https://www.freedesktop.org/wiki/Software/systemd/resolved
           https://www.freedesktop.org/wiki/Software/systemd/writing-network-configuration-managers
           https://www.freedesktop.org/wiki/Software/systemd/writing-resolver-clients
 Main PID: 871 (systemd-resolve)
   Status: "Processing requests..."
    Tasks: 1 (limit: 441)
   CGroup: /system.slice/systemd-resolved.service
           └─871 /lib/systemd/systemd-resolved

Oct 08 12:42:14 test systemd-resolved[871]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
<Snipped repeated error entries>

/etc/nsswitch.conf set the order sources of sources used to resolved DNS queries. What does this tell us?:

hosts:          files dns

Well, the DNS queries will never hit the local systemd-resolved stub resolver as it's not specified in /etc/nsswitch.conf.

Are the forwarders even set for the systemd-resolved stub resolver?!?!? Let's review that configuration in /etc/systemd/resolved.conf

[Resolve]
#DNS=
#FallbackDNS=
#Domains=
#LLMNR=no
#MulticastDNS=no
#DNSSEC=no
#Cache=yes
#DNSStubListener=yes

Nope: systemd-resolved has no forwarder set to ask if a local ip:name mapping is not found.

The net result of all this is:

/etc/nsswitch.conf sends DNS queries to DNS if no local IP:name mapping found in /etc/hosts
The DNS server to be queried is 127.0.0.53 and we just saw this is not configured from reviewing its' config file /etc/systemd/resolved.conf. With no forwarder specified in here, there's no way we'll successfully resolve anything.

TESTING:

I tried to override the stub resolver 127.0.0.53 by directly specifying 168.63.129.16. This failed:

dig aerserv-bc-us-east.bidswitch.net 168.63.129.16

; <<>> DiG 9.11.3-1ubuntu1.9-Ubuntu <<>> aerserv-bc-us-east.bidswitch.net 168.63.129.16
;; global options: +cmd
;; connection timed out; no servers could be reached
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 24224
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;168.63.129.16.         IN  A

;; Query time: 13 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Tue Oct 08 13:26:07 UTC 2019
;; MSG SIZE  rcvd: 42

Nope: seeing ;; SERVER: 127.0.0.53#53(127.0.0.53) in the output tells us that we've not overridden it and the local, unconfigured stub resolver is still being used.

However using either of the following commands overrode the default 127.0.0.53 stub resolver and therefore succeeded in returning NOERROR results:

sudo dig aerserv-bc-us-east.bidswitch.net @168.63.129.16

dig +trace aerserv-bc-us-east.bidswitch.net @168.63.129.16

So any queries that relied on using the systemd-resolved stub resolver were doomed until it was configured.

SOLUTION:

My initial- incorrect- belief was that TCP/53 was being blocked: the whole "Truncated 512" was a bit of a red-herring. The stub resolver was not configured. I made the assumption- I know, I know, "NEVER ASSUME ;-) - that DNS was otherwise configured.

How to configure `systemd-resolved`:

Ubuntu 18.04

Edit the hosts directive in /etc/nsswitch.conf as below by prepending resolve to set systemd-resolved as the first source of DNS resolution:

hosts:          resolve files dns

Edit the DNS directive (at a minimum) in/etc/systemd/resolved.confto specify your desired forwarder, which in this example would be:

[Resolve]
DNS=168.63.129.16

Restart systemd-resolved:

sudo systemctl restart systemd-resolved

RHEL 8:

Red Hat almost does everything for you in respect to setting up systemd-resolved as a stub resolver, except they didn't tell the system to use it!

Edit the hosts directive in /etc/nsswitch.conf as below by prepending resolve to set systemd-resolved as the first source of DNS resolution:

hosts:          resolve files dns

Then restart systemd-resolved:

sudo systemctl restart systemd-resolved

Source: https://www.linkedin.com/pulse/config-rhel8-local-dns-caching-terrence-houlahan/

CONCLUSION:

Once systemd-resolved was configured my test VM's DNS behaved in the expected way. I think that about does it....