Phillip/449 iterative cache extra queries regression #452

phillip-stephens · 2024-09-20T17:45:18Z

Closes #449

Context

I noticed that when querying ./zdns A google.com yahoo.com --iterative --threads=1 that we're making 2 queries to the root servers to fetch the .com gTLD servers. This is inefficient and leads to sending a decent bit of load to the root servers unnecessarily.

In my investigation, I realized the regression was caused by #413. At the time, I didn't understand what that section of code was doing and removing it both eliminated the many SERVFAIL errors we were seeing and improved performance. Seemed like an easy win.

Well that code was attempting to prevent repeated lookups for the authorities as we're trying to do here. I'm not entirely sure why that code caused SERVFAIL errors, but I took it as a starting point for this PR.

Description

This is only applicable to --iterative mode

At every layer, we get the "nextAuthority. Ex: if layer = ., next layer is .com for google.com
We lookup the NS record for that authority. Ex: NS .com
That NS record will contain the authorities in the Answers section
We'll use that record to populate the Authorities and lookup in the cache for the A and AAAA record for the Additionals

Simple Test Case

The test case laid out in the open issue - ./zdns A google.com yahoo.com --iterative --threads=1 now passes, we only query the root servers once for two .com domains.

$ ./zdns A google.com yahoo.com --iterative --threads=1
{"name":"google.com","results":{"A":{"data":{"answers":[{"answer":"142.250.190.14","class":"IN","name":"google.com","ttl":300,"type":"A"}],"protocol":"udp","resolver":"216.239.34.10:53"},"duration":0.134841792,"status":"NOERROR","timestamp":"2024-09-20T14:25:53-04:00"}}}
{"name":"yahoo.com","results":{"A":{"data":{"answers":[{"answer":"74.6.231.20","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"},{"answer":"74.6.143.25","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"},{"answer":"98.137.11.163","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"},{"answer":"98.137.11.164","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"},{"answer":"74.6.143.26","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"},{"answer":"74.6.231.21","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"}],"protocol":"udp","resolver":"68.180.131.16:53"},"duration":0.073707208,"status":"NOERROR","timestamp":"2024-09-20T14:25:53-04:00"}}}

Performance

This means a lot more cache lookups. In the case of .com, there are 13 gTLD servers each with an A and AAAA record. This means 1 (NS record)+ 13 (A) + 13 (AAAA) = 27 cache lookups. The benefit is eliminating an unnecessary DNS lookup.

To check performance, I compared against main and used iptables to count the number of outgoing UDP packets to port 53. The results look very promising, 42% fewer packets sent and 20% reduction in runtime.

Setup

sudo iptables -A OUTPUT -p udp --dport 53 -j ACCEPT - install the packet counting rule
sudo iptables -L OUTPUT -v -n - Check the number of UDP packets
sudo iptables -Z output - clear packet count

7k domains, A lookups, --iterative --threads=100

`main`

41,260 Packets
35.94 s
7 time outs
79 failures

`Phillip/449`

23,773 packets
28.81 s
6 timeouts
80 failures

Accuracy

Testing accuracy with the cache is a bit hard. One idea I had is that over time, the cache will become more filled up and it's less and less likely that a lookup will have to go to the wire. So it's these later lookups that have the most chance of being incorrect.

As a test, I used a domain from the current.csv CrUX dataset, order.starkeypro.com.

The result from a single lookup for this domain -

echo "order.starkeypro.com" | ./zdns A --iterative
{"name":"order.starkeypro.com","results":{"A":{"data":{"answers":[{"answer":"40.67.187.107","class":"IN","name":"order.starkeypro.com","ttl":600,"type":"A"}],"protocol":"udp","resolver":"167.100.118.217:53"},"duration":4.099243666,"status":"NOERROR","timestamp":"2024-09-20T18:18:05Z"}}}

Was identical to a lookup for this domain after 7k domains had been looked up.

$ head -n 7000 benchmark/10k_crux_top_domains.input <(echo "order.starkeypro.com") | ./zdns A --iterative --output-file=out


~/zdns on  phillip/449-iterative-cache-extra-queries-regression! ⌚ 18:21:04
$ grep "starkeypro" out
{"name":"order.starkeypro.com","results":{"A":{"data":{"answers":[{"answer":"40.67.187.107","class":"IN","name":"order.starkeypro.com","ttl":600,"type":"A"}],"protocol":"udp","resolver":"167.100.118.217:53"},"duration":4.083825326,"status":"NOERROR","timestamp":"2024-09-20T18:20:29Z"}}}

…ully prevent the issue seen before

phillip-stephens added 3 commits September 20, 2024 12:49

add back authority fetching from cache, but with extra logic to hopef…

55c7e99

…ully prevent the issue seen before

don't use cached authority if you don't have the auths or additionals

83529ae

prettify comments

a2b2d09

phillip-stephens marked this pull request as ready for review September 20, 2024 18:32

phillip-stephens requested a review from a team as a code owner September 20, 2024 18:32

phillip-stephens requested a review from zakird September 20, 2024 18:32

phillip-stephens assigned zakird Sep 20, 2024

zakird approved these changes Sep 20, 2024

View reviewed changes

zakird merged commit dc826cd into main Sep 20, 2024
3 checks passed

phillip-stephens deleted the phillip/449-iterative-cache-extra-queries-regression branch September 20, 2024 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phillip/449 iterative cache extra queries regression #452

Phillip/449 iterative cache extra queries regression #452

phillip-stephens commented Sep 20, 2024 •

edited

Loading

Phillip/449 iterative cache extra queries regression #452

Phillip/449 iterative cache extra queries regression #452

Conversation

phillip-stephens commented Sep 20, 2024 • edited Loading

Context

Description

Simple Test Case

Performance

Setup

main

Phillip/449

Accuracy

phillip-stephens commented Sep 20, 2024 •

edited

Loading

`main`

`Phillip/449`