-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry DHCP option 57 default value when failure happens #767
Conversation
@starbops @mingshuoqiu I tested with dhcp option57 value 576, the installer works well. But to reduce risk, the code will only try value 576 when 1500 failed, give the dhcp client an additional try. |
As far as I know, option 57 specifies the maximum DHCP message size. The current default value insomniacslk/dhcp is using is 1500, which means the client expects the offer/ack message to be lower than 1500 octets. Do you know why a much larger upper limit would fail, but changing to a lower upper limit is successful? |
That first log message ("Network configuration is applied") comes from harvester-installer/pkg/console/install_panels.go Line 1395 in e767167
(Guessing) after leaving the network page, did you go back then forwards again? |
@starbops It is a good question. The second commit comes from following considration: What I can say is that it does no harm, if it fails on Windows DHCP server again, the root cause is further narrowed down. |
@tserong The tricky comes from setupNetwork which calls RestoreOriginalNetworkConfig
the Restore triggers an old & additional DHCP via wicked |
All I can see in here is writing various files under harvester-installer/pkg/config/cos.go Lines 405 to 436 in e767167
What is it that triggers wicked in this case? Note also that the duplicate DHCP requests in the screenshot above are about 7 minutes apart (18:36 then 18:43) - surely one call to setupNetworks didn't take that long? |
...and anyway (sorry, just thinking aloud) - the fact that the message "Network configuration is applied" appears twice must mean that |
sorry, the above screenshot is a bit misleading suppose you already have a Harvester node installed on a KVM VM, and start a new installation of a new ISO, and then run when you hit dhcp for mgmt network on the installation screen, the tcpdump will show you two times of dhcp request I did not dig all the details, but we can easily reproduce and observe it. |
Ah! That'll be this thing: harvester-installer/pkg/console/install_panels.go Line 1400 in e767167
It runs after Sorry for not realising that sooner! |
@tserong @starbops @mingshuoqiu I rebased the code to address merge conflicts, please take a new look, thanks. |
The dhcp client package bumping is split to PR #779 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed a couple of tiny little things (see comments below), and in general LGTM.
I'm wondering though, if msz=1500 might cause problems with some servers, why not just always set msz=576, rather than trying msz=1500 first? Then we'd only need to keep the implementation in getIPThroughDHCPWithOption57()
, and could drop the original version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion, do we want to upstream this enhancement to insomniacslk/dhcp? I saw you already created an issue there (insomniacslk/dhcp#542). If yes, perhaps we could first apply the enhancement to harvester/dhcp and file the PR to upstream at the same time. For harvester/harvester-installer, replace the dhcp module with ours. WDYT?
@w13915984028 sorry to chime in here, can you reproduce the issue? |
Signed-off-by: Jian Wang <[email protected]>
@bk201 Unfortunately, I don't have the exact windows server version to run and test that dhcp server, and did not find one from surrounding. @tserong The max value 1500 works in most cases, we only got this reported issue. Keep the original code and add a second try only when 1500 fails is the more reasonable solution for now. @starbops You are right, the better way is to get supported from upstream, the project looks to be a semi-active state now, I will try to add a PR to upstream. Thanks for all your kind reveiw. |
@starbops @tserong @mingshuoqiu Please take a new look, if any new concerns about this PR. When no, we can merge it and have more time for testing, thanks. |
@w13915984028 I suggest we should reproduce the issue first to verify the suspicion. |
@bk201 We can merge this PR, and if we have chance to get a Windows server & reproduce the issue, and know that the root cause is something else, we add new fix and revert this PR. WDYT? thanks. |
Add more backgrounds: (1) Unlike k8s/go, upstream can depricate features as long as he wants and the depricated features are removed quickly, user needs to bump and update daily. The internet protocols (e.g. IETF RFCs) are mostly keeping things, no one can force users to drop old devices or old systems. (2) The DHCP option 57 defines min value 576, but no upper limit, thus different DHCP servers (or even specific version) can have it's understanding. 1500 is generally accepted by most modern DHCP servers, but some legacy (versioned) servers may refuse it. Indeed, in a 1500 MTU L2 network, the DHCP option itself is hardly to reach 1500, but arguing this makes little sense. Why we add such a PR to retry 576 when 1500 fails? Back to #3428 and this PR, even if the root cause is not this PR, this PR still makes Harvester installer more robust. |
After reviewing the last few comments and thinking about this some more, I'm sorry, but I'm really not sure we should take this change:
I just checked, and evaluation versions of Windows Server ISOs can apparently be downloaded from https://www.microsoft.com/en-us/evalcenter. What about installing that in a virtual machine, configure it to be a DHCP server, then boot the harvester ISO on a VM in the same network and see what happens? |
@tserong Thanks for your investigation, I will try to set and reproduce. Hopefully, this issue is not related to only certain versions Windows DHCP servers. Meanwhile, it is not surprising that sometimes the defending codes are added, which are difficult to be tested on normal black box of view. |
Yeah, that would be really annoying! I just checked the original issue, which says they used "windows DHCP server 2016" (harvester/harvester#3428 (comment)) and I had a further look on the MS site - thankfully they do seem to have that version there: https://www.microsoft.com/en-us/evalcenter/evaluate-windows-server-2016 |
This PR was replaced by #879 which solved the root cause. |
Problem:
Harvester installer reported DHCP failed when the DHCP server was running on a Windows server.
Solution:
The DHCP client used by Harvester, is using value 1500 for option 57. It is not wrong, but as RFC only defined the min value 576, did not define the max value limitation, 1500 may be refused by some DHCP servers.
Give DHCP client an addtional try with the generally used DHCP option 57 values.
Related Issue:
Partially related to harvester/harvester#3428
Test plan:
Install Harvester with dhcp mode IP, it should get IP successfully.
The DHCP client function on new cluster worked as expected