Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HomeAssistant CoreDNS timeout settings #3

Open
bentasker opened this issue Nov 5, 2021 · 14 comments
Open

HomeAssistant CoreDNS timeout settings #3

bentasker opened this issue Nov 5, 2021 · 14 comments
Assignees

Comments

@bentasker
Copy link
Owner

As detailed here some users might still experience delays/lag if their external connectivity drops.

It'll depend on what the upstream timeouts on their local DNS is - obviously if they're using a router, then it may not be configurable (and the router might be a little too dumb to know that it's upstream link is down).

Back within the sphere of HomeAssistant, I was wondering if we can tweak things there so mitigate for users affected by that.

Unfortunately, CoreDNS uses a DialTimeout of 30s (here).

Actually, worse than that, it dynamically adjusts, which probably goes a long way to explain the packet storms that HomeAssistant users have reported when Cloudflare was unreachable.

Unfortunately, timeout isn't a configurable - we'd need to make changes in the source and then recompile coredns with those. Far from ideal.

@bentasker
Copy link
Owner Author

bentasker commented Nov 5, 2021

The idea of having to rebuild coredns to change this is insane.

However, it is doable, and led me onto a much wider thought.

Actually, if push comes to shove, we've a fair bit of flexibility in terms of what's running on the system despite not having access to the underlying OS (that's achievable too, I've already popped the host from an add-on container, it's just not something I'm gonna productise).

We have access to the docker daemon, which means we can build and tag images to our heart's content. So rather than trying to patch things via exec we can actually go one further:

  • Include a dockerfile that reflects the official container (for example)
  • Make the necessary changes in that file
  • call docker build against it
  • Tag the built image with the official tag based on what's running (so for example ghcr.io/home-assistant/armv7-hassio-dns:2021.06.0)

Then, when supervisor restarts the add-on, we'll be running our modified version. Obviously, after any update the process would need to be re-run.

It also means that an alternative fix for #1 would be to build an entirely new supervisor container with auto-update patched out and calls to CodeNotary in https://github.com/home-assistant/supervisor/blob/main/supervisor/utils/codenotary.py patched out.

That's very much a nuclear option though, we could just as easily push across a rewrite to that file so that exceptions CodeNotaryError, CodeNotaryUntrusted, PwnedError can never be thrown.

Having dug into it, it really is the most pointless "protection" - the code is checking it's own homework, so anyone with the ability to patch the code being protected also has the ability to patch out the checks. It's effectively just security theatre that impacts users most.

@bentasker bentasker changed the title HomeAssistant CoreDNS timeouts HomeAssistant CoreDNS timeout settings Nov 5, 2021
@bentasker bentasker self-assigned this Nov 9, 2021
@alexdelprete
Copy link

Gentlemen (@fenichelar - @Zixim): probably there's a solution to our neverending battle for a proper HA DNS implementation: https://github.com/bentasker/HomeAssistantAddons/tree/master/core-dns-override

@bentasker you have no idea how long we have been waiting for an auto-patcher, we've been patching that configuration manually every time. They didn't even approve a slight modification like this one, let alone all the ones (#56 to #61) that @fenichelar tried to push. It's incredible they're not even able to acknowledge they have a huge DNS problem.

I'm going to install your addon immediately and I hope it solves the problem. Thank you so much. :)

@bentasker
Copy link
Owner Author

@alexdelprete Glad I could help.

Let me know if you encounter any issues with it - I've tried to be relatively conservative in terms of what it does, but if that causes an issue then there's another more heavy-handed approach I can take (which does come with the benefit that it won't revert when the container is restarted).

It's incredible they're not even able to acknowledge they have a huge DNS problem.

I've done what I can to feed a detailed analysis of the issue and underlying cause back upstream (https://github.com/home-assistant/plugin-dns/issues/ 64) but based on how lightly other PRs/Issues have been discarded, I'm not going to hold my breath. The idea that they've ignored an issue of this scale for this long is more than a bit bewildering to be honest.

I also have supervisor auto-updates on my radar. #1 implements an initial blocker to them, but it means that the Supervisor "Add on store" doesn't display after a reboot (it blocks trying to pull information by the looks of it). There are a couple of ways I can go about fixing that, I've not quite settled on which I prefer yet.

@alexdelprete
Copy link

there's another more heavy-handed approach I can take (which does come with the benefit that it won't revert when the container is restarted).

I read your comments about that, I think it's a little bit too much. :)

I hope this smarter approach works fine, will let you know...thanks a lot.

I'm not going to hold my breath. The idea that they've ignored an issue of this scale for this long is more than a bit bewildering to be honest.

I doubt they will do anything. In our last attempt, I even engaged the big boss, and he said they were not even willing to discuss it anymore. Case closed. I won't even bother to comment anymore, they wanted to suspend me because I was simply showing them how naive their approach was vs the problem: a big elephant in the room, and they don't see it.

@fenichelar
Copy link

I started work on a PR to create a config option that allowed completely disabling Cloudflare DOT, however, it became pretty clear that I was not going to be able to get this merged so I gave up.

FYI, I was able to solve all of my problems by NATing 1.1.1.1 and 1.0.0.1 to my private DNS resolver. The NAT is setup for both TCP/UDP port 53 and TCP port 853. TLS certificate verification fails because my DOT server obviously does not have a valid TLS certificate for cloudflare-dns.com, but I do not see any more packet storms, have much faster DNS resolution, and do not see any issues with internal domains. I have been running this setup for at least a month without issue.

@bentasker
Copy link
Owner Author

I started work on a PR to create a config option that allowed completely disabling Cloudflare DOT

I think it was one of your PRs I first saw actually - lots of work had gone into it, but then the devs came back with a bunch of requirements no-one sane could accept: "do all this work, and ensure we mark anyone using it as unsupported"

fails because my DOT server obviously does not have a valid TLS certificate for cloudflare-dns.com, but I do not see any more packet storms

That makes sense - the storms happen because of CoreDNS's handling at the TCP level - if the TCP handshake completes then you avoid that (even if SSL verification then fails). The funny thing is that your PR to move it back to UDP would have fixed those (even without redirection to another server).

have much faster DNS resolution

Yep, I've been quite struck by the difference to be honest. I've a few devices ewelink and Tuya devices, and I'd always assumed they were slow/laggy because distant API and didn't bother troubleshooting. Turns out the issue was very much closer to home.

I read your comments about that, I think it's a little bit too much. :)

Agreed, @alexdelprete - the main reason I've been experimenting with it is to hedge against ending up being marked as unsupported/unhealthy. Whilst I've no desire to seek support on something I've customised, I do want to still be able to update addons etc.

@Zixim
Copy link

Zixim commented Nov 9, 2021

been using the auto-patcher since it was created. It's the bees knees.

@alexdelprete
Copy link

been using the auto-patcher since it was created. It's the bees knees.

and you didn't tell us...:)

@alexdelprete
Copy link

alexdelprete commented Nov 10, 2021

Agreed, @alexdelprete - the main reason I've been experimenting with it is to hedge against ending up being marked as unsupported/unhealthy.

I perfectly understand, it's the plan B in case they ban your addon or enforce something that neutralizes your fix. You prepared the nuke. I don't think they will bother, they perfectly know there are people like us that in some way will find workarounds. :)

I'm still puzzled by the choice of using CoreDNS vs unbound or dnsmasq, that are much more reliable and widely used...because I can clearly see there are two issues here: a bad configuration + a bad dns resolver/forwarder.

One question about the addon, I'm just installing it: your addon checks every 60s to enforce correct configuration, but that doesn't imply a restart every 60s, right? Because that could raise some issues. Restart needed only if changes are applied, if I understood correctly. That is also if dns-override-template is used, right?

@alexdelprete
Copy link

I can confirm the addon is working perfectly. Great job. :)

@bentasker
Copy link
Owner Author

One question about the addon, I'm just installing it: your addon checks every 60s to enforce correct configuration, but that doesn't imply a restart every 60s, right? Because that could raise some issues. Restart needed only if changes are applied, if I understood correctly. That is also if dns-override-template is used, right?

Exactly correct - it'll only restart coredns if the config needed to be changed - if everything's the same it won't do the restart.

I'm still puzzled by the choice of using CoreDNS vs unbound or dnsmasq, that are much more reliable and widely used...because I can clearly see there are two issues here: a bad configuration + a bad dns resolver/forwarder.

CoreDNS is actually pretty good where it's commonly used - in Kubernetes clusters with decent connectivity - it's a poor choice (IMO) for embedding in an appliance. As you say, Unbound and dnsmasq are much better candidates.

@alexdelprete
Copy link

CoreDNS is actually pretty good where it's commonly used - in Kubernetes clusters with decent connectivity

unbound/dnsmasq work ok in any condition I've had the possibility to see...but I'm no expert of this new containerized world...old-school guy here. :)

thanks a lot for everything, I'm really grateful for your contribution to this problem.

can I contact you privately for an advice on an issue I'm having for another project? let me know what's your preferred way...thanks in advance.

@bentasker
Copy link
Owner Author

Yeah, Unbound/DNSMasq are pretty mature and resilient.

There's almost an inherent assumption in coreDNS that the person configuring it has reasonable control over the network - that contract's effectively broken in HomeAssistant because the config was dropped in without the network operator's knowledge/permission.

can I contact you privately for an advice on an issue I'm having for another project?

Sure, do you wanna DM me on the HA community forum and then I'll give you a better route of contact (saves me dropping details into a public GH issue)

@alexdelprete
Copy link

Sure Ben, I'll contact you on the forum first. Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants