-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-dns never resolves if a domain returns NOERROR with 0 answer records once #121
Comments
@ahmetb I think it's legal in DNS to cache if we get a rcode == 0 response with 0 entries. This looks to be the behavior of the Cloudflare server (sending rcode 0 instead of NXDOMAIN). It looks like the TTL was around 30 minutes for the DNS record. If the record is going to be changing, it would be advisable to reduce the TTL to get faster cache updates. |
@bowei I think @viglesiasce reproduced this with Google Domains (or Cloud DNS) too. In my experience, the cache was not invalidated even after 24 hours when I left it at that. |
Linking to #119 with respect to Cloudflare. |
RCODE=0 with no response is the NODATA pseudo-rcode. For the purpose of caching, it shouldn't be treated differently from NXDOMAIN with one exception - it doesn't say anything about non-existence of names below the requested name. See https://tools.ietf.org/html/rfc2308#section-2.2 for guideline. It's possibly related to miekg/dns#428 |
I reported this to various folks at CloudFlare, still waiting a response. However, if anyone can help pinpoint where the caching happens, under what circumstances and why it lasts so long (i.e. >24h or in my experience, indefinitely), those would help fixing this problem, too. |
I work at Cloudflare, so I'm happy to answer any questions. It's not however specific to Cloudflare DNS; NODATA is a kind of answer you get from an authoritative server when the requested name exists, but the record type you're looking for doesn't, which is quite common. The RFC2038 I linked provides a guideline on how clients should implement negative caching for all cases of negative answers - hope that helps. |
Got an answer from the CloudFlare support:
We should look at fixing the caching behavior in kube-dns (or miekg/dns, or wherever it is) as a mitigation. Not caching 0-record answers sounds like it would yield a low-impact cache-miss rate to me. @bowei thoughts? |
@ahmetb caching is done with dnsmasq (http://www.thekelleys.org.uk/dnsmasq/doc.html) with no special tuneables. Maybe there is a flag that can disable caching that response? I'm surprised this does not impact more people, not just users of Kubernetes. dnsmasq is a popular piece of software, standard resolver on some Linux distros. |
I have this exact same issue (using kube-lego on GCE) but using google cloud DNS. Kube lego cannot resolve my domain when trying to request the token in order to issue a certificate. External to any kube pod the domain name resolves fine. When digging the domain within the pod it still gets I tried restarting the kube-dns with
but to no avail. Is there anything I can do to expedite invalidating the DNS cache or is it a matter of waiting it out? (It's been close to 24h for me) |
Can you post the output of |
From the pod:
but i will occasionally get this answer instead:
This SOA is from AWS which was my prior DNS provider. |
I will try playing with dnsmasq flags and see if we can change its negative caching behavior. |
@bowei Any luck? |
Looks like |
@coresolve whoa this is amazing. @bowei do you think it's sensible to incorporate this as a default in kube-dns distribution? |
yes, since we don't enable neg caching |
yes, that should be a one-line change to the yaml |
Opened kubernetes/kubernetes#53604 to add this |
Has anyone looked into the impact of removing negative caching on the volume of DNS requests that now need resolved again and again? |
@miekg I think we don't know what will this change break. However, unless changed, many software that rely on domains eventually resolving stays broken. I'm not sure if we have enough tools to answer this question properly. |
Automatic merge from submit-queue (batch tested with PRs 53604, 53751). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add no-negcache flag to kube-dns **What this PR does / why we need it**: Adds the [`--no-negcache`](https://linux.die.net/man/8/dnsmasq) flag to prevent dnsmasq from caching negative (NXDOMAIN) responses. More details on why this is desirable [here](kubernetes/dns#121). **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes/dns#121 **Special notes for your reviewer**: Thanks to @rsmitty (https://rsmitty.github.io/KubeDNS-Tweaks/) and @coresolve (kubernetes/dns#121 (comment)) for pointing us in the right direction. **Release note**: ```release-note Add --no-negcache flag to kube-dns to prevent caching of NXDOMAIN responses. ```
Why did we disable neg-caching as default instead of setting reasonable TTL value with |
tl;dr If a nameserver replies status=NOERROR with no answer section to a DNS A question, kube-dns always caches this result. If the domain name actually gets an A record after it's queried through kube-dns, it never (I waited a few days) resolves from the pods, but does resolve outside the container (e.g. on my laptop) just fine.
Repro steps
Prerequisites
alp.im
and the nameservers are pointed to CloudFlare.gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
Step 1: Domain does not exist, query from your laptop
Note
ANSWER: 0
, andstatus: NOERROR
Step 2: Domain does not exist, query from Pod on Kubernetes
Start a toolbelt/dig container with shell and run the same query:
Note the response is the same,
ANSWER: 0
andNOERROR
.(Also note that
SERVER: 10.0.0.10#53
which is kube-dns.)Step 3: Create an A record for the domain
Here I use CloudFlare as it manages my DNS.
Step 4: Test DNS record from your laptop
Run
dig
on your laptop (note;; ANSWER SECTION:
and8.8.8.8
answer):Step 5: Test DNS record from Pod on Kubernetes
Run the same command again:
Note the diff:
ANSWER: 0
andstatus: NOERROR
(but it resolves just fine outside the cluster);; AUTHORITY SECTION:
disappeared andAUTHORITY:
changed to0
from the previous time we ran this.;; Query time: 0 msec
(was 79 ms) –I assume it means it's just a cached response.What else I tried
Try it on GKE: I tried with k8s v1.5.x and v1.6.4. → Same issue. (cc: @bowei)
Query from a different pod on minikube: I started a new Pod and queried from there → Same issue.
Restart kube-dns Pod → This worked on GKE, but not on minikube.
Impact
I am not sure why this has not been discovered before. I noticed this behavior while using kube-lego on GKE. Once kube-lego applies for a TLS certificate, it polls the domain name of the service (e.g.
example.com/.well-known/<token>
) before asking Let's Encrypt to validate it. Before I create an Ingress with kube-lego annotation, I don't have the external IP yet so I can't configure the domain, but the kube-lego Pod already picks it up and starts querying my domain in an infinite loop. It never succeeds because first time it looked up the hostname, the A record didn't exist, so that result is cached forever. After I add A record, it still can't resolve. The moment I delete kube-dns Pods and they get recreated, it immediately starts working and resolves the hostname and completes the kube-lego challenge.The text was updated successfully, but these errors were encountered: