Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phillip/449 iterative cache extra queries regression #452

Merged
merged 3 commits into from
Sep 20, 2024

Conversation

phillip-stephens
Copy link
Contributor

@phillip-stephens phillip-stephens commented Sep 20, 2024

Closes #449

Context

I noticed that when querying ./zdns A google.com yahoo.com --iterative --threads=1 that we're making 2 queries to the root servers to fetch the .com gTLD servers. This is inefficient and leads to sending a decent bit of load to the root servers unnecessarily.

In my investigation, I realized the regression was caused by #413. At the time, I didn't understand what that section of code was doing and removing it both eliminated the many SERVFAIL errors we were seeing and improved performance. Seemed like an easy win.

Well that code was attempting to prevent repeated lookups for the authorities as we're trying to do here. I'm not entirely sure why that code caused SERVFAIL errors, but I took it as a starting point for this PR.

Description

This is only applicable to --iterative mode

  1. At every layer, we get the "nextAuthority. Ex: if layer = ., next layer is .com for google.com
  2. We lookup the NS record for that authority. Ex: NS .com
  3. That NS record will contain the authorities in the Answers section
  4. We'll use that record to populate the Authorities and lookup in the cache for the A and AAAA record for the Additionals

Simple Test Case

The test case laid out in the open issue - ./zdns A google.com yahoo.com --iterative --threads=1 now passes, we only query the root servers once for two .com domains.
image

$ ./zdns A google.com yahoo.com --iterative --threads=1
{"name":"google.com","results":{"A":{"data":{"answers":[{"answer":"142.250.190.14","class":"IN","name":"google.com","ttl":300,"type":"A"}],"protocol":"udp","resolver":"216.239.34.10:53"},"duration":0.134841792,"status":"NOERROR","timestamp":"2024-09-20T14:25:53-04:00"}}}
{"name":"yahoo.com","results":{"A":{"data":{"answers":[{"answer":"74.6.231.20","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"},{"answer":"74.6.143.25","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"},{"answer":"98.137.11.163","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"},{"answer":"98.137.11.164","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"},{"answer":"74.6.143.26","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"},{"answer":"74.6.231.21","class":"IN","name":"yahoo.com","ttl":1800,"type":"A"}],"protocol":"udp","resolver":"68.180.131.16:53"},"duration":0.073707208,"status":"NOERROR","timestamp":"2024-09-20T14:25:53-04:00"}}}

Performance

This means a lot more cache lookups. In the case of .com, there are 13 gTLD servers each with an A and AAAA record. This means 1 (NS record)+ 13 (A) + 13 (AAAA) = 27 cache lookups. The benefit is eliminating an unnecessary DNS lookup.

To check performance, I compared against main and used iptables to count the number of outgoing UDP packets to port 53. The results look very promising, 42% fewer packets sent and 20% reduction in runtime.

Setup

sudo iptables -A OUTPUT -p udp --dport 53 -j ACCEPT - install the packet counting rule
sudo iptables -L OUTPUT -v -n - Check the number of UDP packets
sudo iptables -Z output - clear packet count

7k domains, A lookups, --iterative --threads=100

main

  • 41,260 Packets
  • 35.94 s
  • 7 time outs
  • 79 failures

Phillip/449

  • 23,773 packets
  • 28.81 s
  • 6 timeouts
  • 80 failures

Accuracy

Testing accuracy with the cache is a bit hard. One idea I had is that over time, the cache will become more filled up and it's less and less likely that a lookup will have to go to the wire. So it's these later lookups that have the most chance of being incorrect.

As a test, I used a domain from the current.csv CrUX dataset, order.starkeypro.com.

The result from a single lookup for this domain -

echo "order.starkeypro.com" | ./zdns A --iterative
{"name":"order.starkeypro.com","results":{"A":{"data":{"answers":[{"answer":"40.67.187.107","class":"IN","name":"order.starkeypro.com","ttl":600,"type":"A"}],"protocol":"udp","resolver":"167.100.118.217:53"},"duration":4.099243666,"status":"NOERROR","timestamp":"2024-09-20T18:18:05Z"}}}

Was identical to a lookup for this domain after 7k domains had been looked up.

$ head -n 7000 benchmark/10k_crux_top_domains.input <(echo "order.starkeypro.com") | ./zdns A --iterative --output-file=out


~/zdns on  phillip/449-iterative-cache-extra-queries-regression! ⌚ 18:21:04
$ grep "starkeypro" out
{"name":"order.starkeypro.com","results":{"A":{"data":{"answers":[{"answer":"40.67.187.107","class":"IN","name":"order.starkeypro.com","ttl":600,"type":"A"}],"protocol":"udp","resolver":"167.100.118.217:53"},"duration":4.083825326,"status":"NOERROR","timestamp":"2024-09-20T18:20:29Z"}}}

@phillip-stephens phillip-stephens marked this pull request as ready for review September 20, 2024 18:32
@phillip-stephens phillip-stephens requested a review from a team as a code owner September 20, 2024 18:32
@zakird zakird merged commit dc826cd into main Sep 20, 2024
3 checks passed
@phillip-stephens phillip-stephens deleted the phillip/449-iterative-cache-extra-queries-regression branch September 20, 2024 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug, Regression] Unnecessary queries to root nameservers in --iterative
2 participants