BIND9: DNS resolves sometimes (!) take very long or don’t work at all

bindbind9dns

I have a BIND 9.9.5-9+deb8u8-Raspbian DNS server running on a RPi3 in my network. It is – for everything that's not my home-zone – configured as a "forward only" with the forwarders "{ 8.8.8.8; 8.8.4.4; 208.67.222.222; 208.67.220.220; };".

a) the normal case

Usually, dns resolution works perfectly. Even when results are not in cache, they are fetched from the forward servers and delivered back to my clients often within less than 100 ms. Here's an example:

client:~ $ time nslookup faz.net
Server:     [my_server]
Address:    [my_server]#53

Non-authoritative answer:
Name:   faz.net
Address: 40.118.6.229

real    0m0.095s […]

This is how the traffic looks in tcpdump, everything's perfect as far as I can see, and DNSSEC-validation seems to work fine as well:

06:48:21.880240 IP [my_client].59563 > [my_server].domain: 614+ A? faz.net. (25)
06:48:21.881246 IP [my_server].28766 > google-public-dns-a.google.com.domain: 30021+% [1au] A? faz.net. (36)
06:48:21.916031 IP google-public-dns-a.google.com.domain > [my_server].28766: 30021 1/0/1 A 40.118.6.229 (52)
06:48:21.917093 IP [my_server].44792 > google-public-dns-a.google.com.domain: 10551+% [1au] DS? faz.net. (36)
06:48:21.952356 IP google-public-dns-a.google.com.domain > [my_server].44792: 10551 0/6/1 (757)
06:48:21.956635 IP [my_server].domain > [my_client].59563: 614 1/0/0 A 40.118.6.229 (41)

b) the problematic case

However, in some cases it takes very long – up to 14 seconds – to deliver a result. Or I don't get one at all (even though a dig @8.8.4.4 domain gives me a proper address within around 50 ms.) Here's one example:

client:~ $ time nslookup sueddeutsche.net
Server:     [my_server]
Address:    [my_server]#53

Non-authoritative answer:
Name:   sueddeutsche.net
Address: 213.61.179.40
Name:   sueddeutsche.net
Address: 213.61.179.41

real    0m4.994s […]

And this is what goes on for these 4,9 seconds in traffic via tcpdump:

06:48:47.608104 IP [my_client].53592 > [my_server].domain: 51417+ A? sueddeutsche.net. (34)
06:48:47.609158 IP [my_server].1507 > google-public-dns-a.google.com.domain: 56678+% [1au] A? sueddeutsche.net. (45)
06:48:48.809517 IP [my_server].27279 > google-public-dns-b.google.com.domain: 22525+% [1au] A? sueddeutsche.net. (45)
06:48:50.009592 IP [my_server].37926 > resolver2.opendns.com.domain: 41597+% [1au] A? sueddeutsche.net. (45)
06:48:50.081468 IP resolver2.opendns.com.domain > [my_server].37926: 41597 2/0/1 A 213.61.179.41, A 213.61.179.40 (77)
06:48:50.082768 IP [my_server].1301 > resolver2.opendns.com.domain: 24793+% [1au] DS? sueddeutsche.net. (45)
06:48:50.233748 IP resolver2.opendns.com.domain > [my_server].1301: 24793 0/1/1 (121)
06:48:50.235373 IP [my_server].57628 > resolver1.opendns.com.domain: 8635+% [1au] DS? sueddeutsche.net. (45)
06:48:50.282862 IP resolver1.opendns.com.domain > [my_server].57628: 8635 0/1/1 (121)
06:48:50.287796 IP [my_server].32127 > google-public-dns-a.google.com.domain: 924+% [1au] DS? sueddeutsche.net. (45)
06:48:51.487853 IP [my_server].61208 > google-public-dns-b.google.com.domain: 39031+% [1au] DS? sueddeutsche.net. (45)
06:48:51.547262 IP google-public-dns-b.google.com.domain > [my_server].61208: 39031 0/6/1 (766)
06:48:51.551509 IP [my_server].52786 > resolver2.opendns.com.domain: 28853+% [1au] Type32769? sueddeutsche.net.dlv.isc.org. (57)
06:48:51.589595 IP resolver2.opendns.com.domain > [my_server].52786: 28853 NXDomain 0/1/1 (125)
06:48:51.592942 IP [my_server].30477 > resolver2.opendns.com.domain: 17693+% [1au] DS? sueddeutsche.net.dlv.isc.org. (57)
06:48:51.790903 IP resolver2.opendns.com.domain > [my_server].30477: 17693 NXDomain 0/1/1 (125)
06:48:51.792342 IP [my_server].6503 > resolver1.opendns.com.domain: 17946+% [1au] DS? sueddeutsche.net.dlv.isc.org. (57)
06:48:52.005244 IP resolver1.opendns.com.domain > [my_server].6503: 17946 NXDomain 0/1/1 (125)
06:48:52.006662 IP [my_server].52356 > google-public-dns-b.google.com.domain: 39821+% [1au] DS? sueddeutsche.net.dlv.isc.org. (57)
06:48:52.334093 IP google-public-dns-b.google.com.domain > [my_server].52356: 39821 NXDomain 0/6/1 (748)
06:48:52.342161 IP [my_server].56473 > resolver1.opendns.com.domain: 17279+% [1au] Type32769? sueddeutsche.net.dlv.isc.org. (57)
06:48:52.382211 IP resolver1.opendns.com.domain > [my_server].56473: 17279 NXDomain 0/1/1 (125)
06:48:52.383674 IP [my_server].52741 > google-public-dns-b.google.com.domain: 65018+% [1au] Type32769? sueddeutsche.net.dlv.isc.org. (57)
06:48:52.424757 IP google-public-dns-b.google.com.domain > [my_server].52741: 65018 NXDomain$ 0/6/1 (748)
06:48:52.430544 IP [my_server].domain > [my_client].53592: 51417 2/0/0 A 213.61.179.40, A 213.61.179.41 (66)

Now from what I can understand, it seems that at 06:48:50.081468 my_server got a proper response from resolver2.opendns.com, while google didn't deliver anything. Then my_server asked back for DNSSEC-validation to which resolver2.opendns.com replied. Shouldn't that be the point at which my_server delivers the result back to my_client? Why doesn't it do this?

Here's a second case, for dasoertliche.de, a domain that should work without any problems:

23:39:01.569234 IP [my_client].52174 > [my_server].domain: 57873+ A? dasoertliche.de. (33)
23:39:01.570413 IP [my_server].60368 > resolver2.opendns.com.domain: 39796+% [1au] A? dasoertliche.de. (44)
23:39:01.617721 IP resolver2.opendns.com.domain > [my_server].60368: 39796 1/0/1 A 82.98.79.52 (60)
23:39:01.618712 IP [my_server].41112 > resolver2.opendns.com.domain: 47487+% [1au] DS? dasoertliche.de. (44)
23:39:01.667144 IP resolver2.opendns.com.domain > [my_server].41112: 47487 0/1/1 (100)
23:39:01.668616 IP [my_server].24077 > resolver1.opendns.com.domain: 13310+% [1au] DS? dasoertliche.de. (44)
23:39:01.854327 IP resolver1.opendns.com.domain > [my_server].24077: 13310 0/1/1 (100)
23:39:01.856006 IP [my_server].38412 > google-public-dns-a.google.com.domain: 56597+% [1au] DS? dasoertliche.de. (44)
23:39:03.056263 IP [my_server].35182 > google-public-dns-b.google.com.domain: 45370+% [1au] DS? dasoertliche.de. (44)
23:39:04.256333 IP [my_server].47744 > google-public-dns-a.google.com.domain: 50222+% [1au] DS? dasoertliche.de. (44)
23:39:04.305620 IP google-public-dns-a.google.com.domain > [my_server].47744: 50222| 0/0/1 (44)
23:39:04.306296 IP [my_server].45040 > google-public-dns-a.google.com.domain: Flags [S], seq 3961861791, win 29200, options [mss 1460,sackOK,TS val 11766745 ecr 0,nop,wscale 7], length 0
23:39:04.345835 IP google-public-dns-a.google.com.domain > [my_server].45040: Flags [S.], seq 2448404480, ack 3961861792, win 42408, options [mss 1380,sackOK,TS val 4100658423 ecr 11766745,nop,wscale 7], length 0
23:39:04.345899 IP [my_server].45040 > google-public-dns-a.google.com.domain: Flags [.], ack 1, win 229, options [nop,nop,TS val 11766749 ecr 4100658423], length 0
23:39:04.346167 IP [my_server].45040 > google-public-dns-a.google.com.domain: Flags [P.], seq 1:47, ack 1, win 229, options [nop,nop,TS val 11766749 ecr 4100658423], length 4662876+% [1au] DS? dasoertliche.de. (44)
23:39:04.385803 IP google-public-dns-a.google.com.domain > [my_server].45040: Flags [.], ack 47, win 332, options [nop,nop,TS val 4100658463 ecr 11766749], length 0
23:39:04.394945 IP google-public-dns-a.google.com.domain > [my_server].45040: Flags [P.], seq 1:752, ack 47, win 332, options [nop,nop,TS val 4100658472 ecr 11766749], length 75162876 0/6/1 (749)
23:39:04.394975 IP [my_server].45040 > google-public-dns-a.google.com.domain: Flags [.], ack 752, win 240, options [nop,nop,TS val 11766753 ecr 4100658472], length 0
23:39:04.398143 IP [my_server].45040 > google-public-dns-a.google.com.domain: Flags [F.], seq 47, ack 752, win 240, options [nop,nop,TS val 11766754 ecr 4100658472], length 0
23:39:04.401876 IP [my_server].37878 > resolver2.opendns.com.domain: 50849+% [1au] Type32769? dasoertliche.de.dlv.isc.org. (56)
23:39:04.437774 IP google-public-dns-a.google.com.domain > [my_server].45040: Flags [F.], seq 752, ack 48, win 332, options [nop,nop,TS val 4100658515 ecr 11766754], length 0
23:39:04.437858 IP [my_server].45040 > google-public-dns-a.google.com.domain: Flags [.], ack 753, win 240, options [nop,nop,TS val 11766758 ecr 4100658515], length 0
23:39:04.456088 IP resolver2.opendns.com.domain > [my_server].37878: 50849 NXDomain 0/1/1 (124)
23:39:04.457411 IP [my_server].38743 > resolver2.opendns.com.domain: 45844+% [1au] DS? dasoertliche.de.dlv.isc.org. (56)
23:39:04.658497 IP resolver2.opendns.com.domain > [my_server].38743: 45844 NXDomain 0/1/1 (124)
23:39:04.659855 IP [my_server].39296 > resolver1.opendns.com.domain: 17204+% [1au] DS? dasoertliche.de.dlv.isc.org. (56)
23:39:04.708134 IP resolver1.opendns.com.domain > [my_server].39296: 17204 NXDomain 0/1/1 (124)
23:39:04.713195 IP [my_server].55899 > google-public-dns-a.google.com.domain: 5854+% [1au] DS? dasoertliche.de.dlv.isc.org. (56)
23:39:04.780837 IP google-public-dns-a.google.com.domain > [my_server].55899: 5854 NXDomain 0/6/1 (736)
23:39:04.786940 IP [my_server].27908 > resolver1.opendns.com.domain: 4148+% [1au] Type32769? dasoertliche.de.dlv.isc.org. (56)
23:39:05.267688 IP resolver1.opendns.com.domain > [my_server].27908: 4148 NXDomain 0/1/1 (124)
23:39:05.269026 IP [my_server].38523 > google-public-dns-a.google.com.domain: 60609+% [1au] Type32769? dasoertliche.de.dlv.isc.org. (56)
23:39:06.469277 IP [my_server].58501 > google-public-dns-b.google.com.domain: 5485+% [1au] Type32769? dasoertliche.de.dlv.isc.org. (56)
23:39:06.572296 IP [my_client].52174 > [my_server].domain: 57873+ A? dasoertliche.de. (33)
23:39:07.669762 IP [my_server].52520 > google-public-dns-a.google.com.domain: 43149+% [1au] Type32769? dasoertliche.de.dlv.isc.org. (56)
23:39:07.706440 IP google-public-dns-a.google.com.domain > [my_server].52520: 43149| 0/0/1 (56)
23:39:07.706903 IP [my_server].59047 > google-public-dns-a.google.com.domain: Flags [S], seq 4227748459, win 29200, options [mss 1460,sackOK,TS val 11767085 ecr 0,nop,wscale 7], length 0
23:39:07.747595 IP google-public-dns-a.google.com.domain > [my_server].59047: Flags [S.], seq 719567413, ack 4227748460, win 42408, options [mss 1380,sackOK,TS val 3894129283 ecr 11767085,nop,wscale 7], length 0
23:39:07.747657 IP [my_server].59047 > google-public-dns-a.google.com.domain: Flags [.], ack 1, win 229, options [nop,nop,TS val 11767089 ecr 3894129283], length 0
23:39:07.747982 IP [my_server].59047 > google-public-dns-a.google.com.domain: Flags [P.], seq 1:59, ack 1, win 229, options [nop,nop,TS val 11767089 ecr 3894129283], length 5821405+% [1au] Type32769? dasoertliche.de.dlv.isc.org. (56)
23:39:07.788998 IP google-public-dns-a.google.com.domain > [my_server].59047: Flags [.], ack 59, win 332, options [nop,nop,TS val 3894129324 ecr 11767089], length 0
23:39:07.789344 IP google-public-dns-a.google.com.domain > [my_server].59047: Flags [P.], seq 1:739, ack 59, win 332, options [nop,nop,TS val 3894129325 ecr 11767089], length 73821405 NXDomain$ 0/6/1 (736)
23:39:07.789372 IP [my_server].59047 > google-public-dns-a.google.com.domain: Flags [.], ack 739, win 240, options [nop,nop,TS val 11767093 ecr 3894129325], length 0
23:39:07.790414 IP [my_server].59047 > google-public-dns-a.google.com.domain: Flags [F.], seq 59, ack 739, win 240, options [nop,nop,TS val 11767093 ecr 3894129325], length 0
23:39:07.796565 IP [my_server].domain > [my_client].52174: 57873 1/0/0 A 82.98.79.52 (49)
23:39:07.831137 IP google-public-dns-a.google.com.domain > [my_server].59047: Flags [F.], seq 739, ack 60, win 332, options [nop,nop,TS val 3894129366 ecr 11767093], length 0
23:39:07.831221 IP [my_server].59047 > google-public-dns-a.google.com.domain: Flags [.], ack 740, win 240, options [nop,nop,TS val 11767097 ecr 3894129366], length 0

Again, at 23:39:01.617721 my_server receives the proper answer from resolver2.opendns.com but for a reason I don't understand a whole flood of communication ensues between my_server and the google dns servers.

Any ideas what is happening here and how I can improve the situation?

c) Update

As per request, here's the output of free

[my_server]:~ $ free
             total       used       free     shared    buffers     cached
Mem:        996448     331696     664752      15752      29808     180668
-/+ buffers/cache:     121220     875228
Swap:            0          0          0

… and vmstat:

[my_server]:~ $ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 664752  29808 180676    0    0     0     1   20   23  0  0 100  0  0

d) Update 2:

Just found out that my /var/log/syslog has entries concerning the resolution problem with dasoertliche.de (the second problematic case). It's the following:

Dec 13 23:39:01 raspi-server named[642]:   validating @0x713c0030: de SOA: got insecure response; parent indicates it should be secure
Dec 13 23:39:01 raspi-server named[642]: error (no valid RRSIG) resolving 'dasoertliche.de/DS/IN': 208.67.220.220#53
Dec 13 23:39:01 raspi-server named[642]:   validating @0x712c0030: de SOA: got insecure response; parent indicates it should be secure
Dec 13 23:39:01 raspi-server named[642]: error (no valid RRSIG) resolving 'dasoertliche.de/DS/IN': 208.67.222.222#53
Dec 13 23:39:04 raspi-server named[642]: success resolving 'dasoertliche.de/DS' (in '.'?) after reducing the advertised EDNS UDP packet size to 512 octets
Dec 13 23:39:04 raspi-server named[642]:   validating @0x711c2040: dlv.isc.org SOA: got insecure response; parent indicates it should be secure
Dec 13 23:39:04 raspi-server named[642]:   validating @0x73401378: dlv.isc.org SOA: got insecure response; parent indicates it should be secure
Dec 13 23:39:04 raspi-server named[642]: error (no valid RRSIG) resolving 'dasoertliche.de.dlv.isc.org/DS/IN': 208.67.220.220#53
Dec 13 23:39:04 raspi-server named[642]:   validating @0x713c0030: dlv.isc.org SOA: got insecure response; parent indicates it should be secure
Dec 13 23:39:04 raspi-server named[642]: error (no valid RRSIG) resolving 'dasoertliche.de.dlv.isc.org/DS/IN': 208.67.222.222#53
Dec 13 23:39:04 raspi-server named[642]: error (insecurity proof failed) resolving 'dasoertliche.de.dlv.isc.org/DLV/IN': 208.67.220.220#53
Dec 13 23:39:05 raspi-server named[642]:   validating @0x712c0030: dlv.isc.org SOA: got insecure response; parent indicates it should be secure
Dec 13 23:39:05 raspi-server named[642]: error (insecurity proof failed) resolving 'dasoertliche.de.dlv.isc.org/DLV/IN': 208.67.222.222#53
Dec 13 23:39:07 raspi-server named[642]: success resolving 'dasoertliche.de.dlv.isc.org/DLV' (in '.'?) after reducing the advertised EDNS UDP packet size to 512 octets

Now it all starts to make a lot more sense, I believe. What I suppose happened: It seems OpenDNS does not support DNSSEC, so when my_server got back the initial (correct) resolution from OpenDNS and asked for DNSSEC, it must have figured the response to be insecure, because there was no proper DNSSEC response. Hence from 23:39:01.856006 onwards, it tried to get confirmation from google DNS. But what do the following lines in syslog and tcpdump mean exactly? Why are there seconds in which google didn't reply and what do the replies and the following exchange of information between google and my_server mean?

Best Answer

After more research, the problem appears to be the following:

The initial setup contained both forwarders that were DNSSEC-capable (GoogleDNS 8.8.8.8, 8.8.4.4) and some that weren't (OpenDNS 208.67.222.222, 208.67.220.220). I had BIND9 running with DNSSEC fully enabled, as per the following configuration:

dnssec-enable      yes;
dnssec-validation  yes;
dnssec-lookaside   auto;

a) Whenever a request (A?) was forwarded to the GoogleDNS servers, my_server got a reply (A), sent a DNSSEC-query (DS?) and got a proper answer. Resolution done, signature confirmed, all working like a charm, case closed (see above, the normal case).

b) Whenever a request (A?) was forwarded to the OpenDNS servers, my_server got a reply (A), sent a DNSSEC-query (DS?) which OpenDNS failed to answer properly as it does not support DNSSEC. So BIND9 threw an error in syslog, stating that it got insecure response and tried to get it's DNSSEC-validation elsewhere (see above, the problematic case).

I still don't fully understand what happened then, but obviously that's when the hiccups started. Now I don't know if the GoogleDNS-servers didn't like giving out DS?-answers without having served the corresponding A?-query first. But in both of the problematic cases (sueddeutsche.net, dasoertliche.de) it seems that the entries were also not properly signed so DNSSEC didn't produce correct validation. So DNSSEC-lookaside validation (DLV) was started (Type32769?) and again it all went south. No idea why.

c) Solution: After all this, I have done the following and have not yet run into any problem (so it seems the question is solved):

First, I switched forwarders to only

forwarders         { 8.8.8.8; 8.8.4.4; };

so that there's no longer a mixup of DNSSEC support. Second, I commented out

//dnssec-lookaside   auto;

because after digging through lots of tcpdumps, it appears that whenever resolution is slow, the delays are either caused by GoogleDNS taking some time to give back an answer (very rarely happens) or - regularly - happen during DLV. Since DLV is currently being phased out anyway with entries no longer available in 2017

https://www.isc.org/blogs/dlv/

I believe that's acceptable from a security-standpoint.

Now an alternative solution would be to ditch the GoogleDNS servers and only use OpenDNS as forwarders. But then I'd have to comment out all dnssec-entries mentioned above completely disabling DNSSEC, because OpenDNS does not support it. That would leave my dns queries open to attacks which would require adding an alternative layer of security like dnscrypt (as mentioned by Rui F Ribeiro). While that looks like a worthwile project (as it completely encrypts dns traffic, therefore not only keeping it unaltered but also unreadable by attackers), it's a little over my current time budget.

Any DNS experts out there who want to chime in if the explanations above make sense?