HomeAssistant CoreDNS timeout settings #3

bentasker · 2021-11-05T08:59:43Z

As detailed here some users might still experience delays/lag if their external connectivity drops.

It'll depend on what the upstream timeouts on their local DNS is - obviously if they're using a router, then it may not be configurable (and the router might be a little too dumb to know that it's upstream link is down).

Back within the sphere of HomeAssistant, I was wondering if we can tweak things there so mitigate for users affected by that.

Unfortunately, CoreDNS uses a DialTimeout of 30s (here).

Actually, worse than that, it dynamically adjusts, which probably goes a long way to explain the packet storms that HomeAssistant users have reported when Cloudflare was unreachable.

Unfortunately, timeout isn't a configurable - we'd need to make changes in the source and then recompile coredns with those. Far from ideal.

The text was updated successfully, but these errors were encountered:

bentasker · 2021-11-05T09:44:26Z

The idea of having to rebuild coredns to change this is insane.

However, it is doable, and led me onto a much wider thought.

Actually, if push comes to shove, we've a fair bit of flexibility in terms of what's running on the system despite not having access to the underlying OS (that's achievable too, I've already popped the host from an add-on container, it's just not something I'm gonna productise).

We have access to the docker daemon, which means we can build and tag images to our heart's content. So rather than trying to patch things via exec we can actually go one further:

Include a dockerfile that reflects the official container (for example)
Make the necessary changes in that file
call docker build against it
Tag the built image with the official tag based on what's running (so for example ghcr.io/home-assistant/armv7-hassio-dns:2021.06.0)

Then, when supervisor restarts the add-on, we'll be running our modified version. Obviously, after any update the process would need to be re-run.

It also means that an alternative fix for #1 would be to build an entirely new supervisor container with auto-update patched out and calls to CodeNotary in https://github.com/home-assistant/supervisor/blob/main/supervisor/utils/codenotary.py patched out.

That's very much a nuclear option though, we could just as easily push across a rewrite to that file so that exceptions CodeNotaryError, CodeNotaryUntrusted, PwnedError can never be thrown.

Having dug into it, it really is the most pointless "protection" - the code is checking it's own homework, so anyone with the ability to patch the code being protected also has the ability to patch out the checks. It's effectively just security theatre that impacts users most.

alexdelprete · 2021-11-09T17:22:42Z

Gentlemen (@fenichelar - @Zixim): probably there's a solution to our neverending battle for a proper HA DNS implementation: https://github.com/bentasker/HomeAssistantAddons/tree/master/core-dns-override

@bentasker you have no idea how long we have been waiting for an auto-patcher, we've been patching that configuration manually every time. They didn't even approve a slight modification like this one, let alone all the ones (#56 to #61) that @fenichelar tried to push. It's incredible they're not even able to acknowledge they have a huge DNS problem.

I'm going to install your addon immediately and I hope it solves the problem. Thank you so much. :)

bentasker · 2021-11-09T17:51:24Z

@alexdelprete Glad I could help.

Let me know if you encounter any issues with it - I've tried to be relatively conservative in terms of what it does, but if that causes an issue then there's another more heavy-handed approach I can take (which does come with the benefit that it won't revert when the container is restarted).

It's incredible they're not even able to acknowledge they have a huge DNS problem.

I've done what I can to feed a detailed analysis of the issue and underlying cause back upstream (https://github.com/home-assistant/plugin-dns/issues/ 64) but based on how lightly other PRs/Issues have been discarded, I'm not going to hold my breath. The idea that they've ignored an issue of this scale for this long is more than a bit bewildering to be honest.

I also have supervisor auto-updates on my radar. #1 implements an initial blocker to them, but it means that the Supervisor "Add on store" doesn't display after a reboot (it blocks trying to pull information by the looks of it). There are a couple of ways I can go about fixing that, I've not quite settled on which I prefer yet.

alexdelprete · 2021-11-09T18:00:57Z

there's another more heavy-handed approach I can take (which does come with the benefit that it won't revert when the container is restarted).

I read your comments about that, I think it's a little bit too much. :)

I hope this smarter approach works fine, will let you know...thanks a lot.

I'm not going to hold my breath. The idea that they've ignored an issue of this scale for this long is more than a bit bewildering to be honest.

I doubt they will do anything. In our last attempt, I even engaged the big boss, and he said they were not even willing to discuss it anymore. Case closed. I won't even bother to comment anymore, they wanted to suspend me because I was simply showing them how naive their approach was vs the problem: a big elephant in the room, and they don't see it.

fenichelar · 2021-11-09T18:54:21Z

I started work on a PR to create a config option that allowed completely disabling Cloudflare DOT, however, it became pretty clear that I was not going to be able to get this merged so I gave up.

FYI, I was able to solve all of my problems by NATing 1.1.1.1 and 1.0.0.1 to my private DNS resolver. The NAT is setup for both TCP/UDP port 53 and TCP port 853. TLS certificate verification fails because my DOT server obviously does not have a valid TLS certificate for cloudflare-dns.com, but I do not see any more packet storms, have much faster DNS resolution, and do not see any issues with internal domains. I have been running this setup for at least a month without issue.

bentasker · 2021-11-09T20:02:47Z

I started work on a PR to create a config option that allowed completely disabling Cloudflare DOT

I think it was one of your PRs I first saw actually - lots of work had gone into it, but then the devs came back with a bunch of requirements no-one sane could accept: "do all this work, and ensure we mark anyone using it as unsupported"

fails because my DOT server obviously does not have a valid TLS certificate for cloudflare-dns.com, but I do not see any more packet storms

That makes sense - the storms happen because of CoreDNS's handling at the TCP level - if the TCP handshake completes then you avoid that (even if SSL verification then fails). The funny thing is that your PR to move it back to UDP would have fixed those (even without redirection to another server).

have much faster DNS resolution

Yep, I've been quite struck by the difference to be honest. I've a few devices ewelink and Tuya devices, and I'd always assumed they were slow/laggy because distant API and didn't bother troubleshooting. Turns out the issue was very much closer to home.

I read your comments about that, I think it's a little bit too much. :)

Agreed, @alexdelprete - the main reason I've been experimenting with it is to hedge against ending up being marked as unsupported/unhealthy. Whilst I've no desire to seek support on something I've customised, I do want to still be able to update addons etc.

Zixim · 2021-11-09T20:04:08Z

been using the auto-patcher since it was created. It's the bees knees.

alexdelprete · 2021-11-10T02:24:53Z

been using the auto-patcher since it was created. It's the bees knees.

and you didn't tell us...:)

alexdelprete · 2021-11-10T02:47:50Z

Agreed, @alexdelprete - the main reason I've been experimenting with it is to hedge against ending up being marked as unsupported/unhealthy.

I perfectly understand, it's the plan B in case they ban your addon or enforce something that neutralizes your fix. You prepared the nuke. I don't think they will bother, they perfectly know there are people like us that in some way will find workarounds. :)

I'm still puzzled by the choice of using CoreDNS vs unbound or dnsmasq, that are much more reliable and widely used...because I can clearly see there are two issues here: a bad configuration + a bad dns resolver/forwarder.

One question about the addon, I'm just installing it: your addon checks every 60s to enforce correct configuration, but that doesn't imply a restart every 60s, right? Because that could raise some issues. Restart needed only if changes are applied, if I understood correctly. That is also if dns-override-template is used, right?

alexdelprete · 2021-11-10T02:54:45Z

I can confirm the addon is working perfectly. Great job. :)

bentasker · 2021-11-10T15:01:47Z

One question about the addon, I'm just installing it: your addon checks every 60s to enforce correct configuration, but that doesn't imply a restart every 60s, right? Because that could raise some issues. Restart needed only if changes are applied, if I understood correctly. That is also if dns-override-template is used, right?

Exactly correct - it'll only restart coredns if the config needed to be changed - if everything's the same it won't do the restart.

I'm still puzzled by the choice of using CoreDNS vs unbound or dnsmasq, that are much more reliable and widely used...because I can clearly see there are two issues here: a bad configuration + a bad dns resolver/forwarder.

CoreDNS is actually pretty good where it's commonly used - in Kubernetes clusters with decent connectivity - it's a poor choice (IMO) for embedding in an appliance. As you say, Unbound and dnsmasq are much better candidates.

alexdelprete · 2021-11-10T18:02:15Z

CoreDNS is actually pretty good where it's commonly used - in Kubernetes clusters with decent connectivity

unbound/dnsmasq work ok in any condition I've had the possibility to see...but I'm no expert of this new containerized world...old-school guy here. :)

thanks a lot for everything, I'm really grateful for your contribution to this problem.

can I contact you privately for an advice on an issue I'm having for another project? let me know what's your preferred way...thanks in advance.

bentasker · 2021-11-11T09:20:05Z

Yeah, Unbound/DNSMasq are pretty mature and resilient.

There's almost an inherent assumption in coreDNS that the person configuring it has reasonable control over the network - that contract's effectively broken in HomeAssistant because the config was dropped in without the network operator's knowledge/permission.

can I contact you privately for an advice on an issue I'm having for another project?

Sure, do you wanna DM me on the HA community forum and then I'll give you a better route of contact (saves me dropping details into a public GH issue)

alexdelprete · 2021-11-11T13:39:19Z

Sure Ben, I'll contact you on the forum first. Thanks a lot.

bentasker changed the title ~~HomeAssistant CoreDNS timeouts~~ HomeAssistant CoreDNS timeout settings Nov 5, 2021

bentasker mentioned this issue Nov 5, 2021

Override of version.home-assistant.io #1

Closed

bentasker self-assigned this Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HomeAssistant CoreDNS timeout settings #3

HomeAssistant CoreDNS timeout settings #3

bentasker commented Nov 5, 2021

bentasker commented Nov 5, 2021 •

edited

Loading

alexdelprete commented Nov 9, 2021

bentasker commented Nov 9, 2021

alexdelprete commented Nov 9, 2021

fenichelar commented Nov 9, 2021

bentasker commented Nov 9, 2021

Zixim commented Nov 9, 2021

alexdelprete commented Nov 10, 2021

alexdelprete commented Nov 10, 2021 •

edited

Loading

alexdelprete commented Nov 10, 2021

bentasker commented Nov 10, 2021

alexdelprete commented Nov 10, 2021

bentasker commented Nov 11, 2021

alexdelprete commented Nov 11, 2021

HomeAssistant CoreDNS timeout settings #3

HomeAssistant CoreDNS timeout settings #3

Comments

bentasker commented Nov 5, 2021

bentasker commented Nov 5, 2021 • edited Loading

alexdelprete commented Nov 9, 2021

bentasker commented Nov 9, 2021

alexdelprete commented Nov 9, 2021

fenichelar commented Nov 9, 2021

bentasker commented Nov 9, 2021

Zixim commented Nov 9, 2021

alexdelprete commented Nov 10, 2021

alexdelprete commented Nov 10, 2021 • edited Loading

alexdelprete commented Nov 10, 2021

bentasker commented Nov 10, 2021

alexdelprete commented Nov 10, 2021

bentasker commented Nov 11, 2021

alexdelprete commented Nov 11, 2021

bentasker commented Nov 5, 2021 •

edited

Loading

alexdelprete commented Nov 10, 2021 •

edited

Loading