Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

B #6759: finalize the SG rules on sg rule error #6760

Closed

Conversation

atodorov-storpool
Copy link
Contributor

@atodorov-storpool atodorov-storpool commented Oct 22, 2024

instead of reseting the rules.

Description

Open Nebula will raise and log the rule error for further investigation, but the VM will not be left unprotected.

Branches to which this PR applies

  • master
  • one-6.10

  • Check this if this PR should not be squashed

instead of reseting the rules.
Open Nebula will raise and log the rule error for further investigation, but the VM will not be left unprotected.
@tinova tinova linked an issue Oct 22, 2024 that may be closed by this pull request
3 tasks
@rsmontero rsmontero added this to the Release 6.10.1 milestone Oct 22, 2024
@atodorov-storpool atodorov-storpool changed the title B #6759: finalise the SG rules on sg rule error B #6759: finalize the SG rules on sg rule error Oct 22, 2024
@rsmontero
Copy link
Member

Hi @atodorov-storpool

So the original code cleans up everything in case of error, the new patch just closes the processing but the chains can be inconsistent or partial.

So probably it is better to keep the original behavior?

@rsmontero
Copy link
Member

Also note that for a VM with multiple NICs only some will be processed en case of error, so the others will be left unprotected.

So yes probably having a deterministic state for the error condition is better

@atodorov-storpool
Copy link
Contributor Author

Hello @rsmontero ,
I wrote an answer yesterday but not submitted and got lost (obviously)

in short, the current behavior is, when the administrator trusts OpenNebula to have the VM protected with SG and there is an error introduced in the SG, the VM is left entirely unprotected. Such VMs are easily discovered via services/ports that the administrator "believes" are blocked.

So the patch tries to make OpenNebula better at protecting the VM when the SG is broken.

I missed the case when there are multiple nics, because the customer had single NIC VM hacked due to SG not applied. I'll try to find time to fix this case. probably stash the error and raise it after the loop so the driver will send a note to opennebula about the issues.

On SG error I believe it is better to set the VM "confined" instead of open to the internet. This way, enduser, monitoring, whatever will raise the issue of missing service that needs to be fixed which sounds much better than having a hacked VM due to SG error.

@rsmontero
Copy link
Member

rsmontero commented Nov 25, 2024

Problem is that the rules will be semi applied, which means any subsequent update (e.g. in SG ) maybe broken as well or could not be recovered.

So I still think that if the SG groups cannot be applied it is better to not apply some. The admin will not have any clue which port is protected or not. It may or may not be opened to the Internet, depending on the rule that failed. So the patch it self does not guarantee any protection, even worse it may lead to a false sense of security.

In case of error it should be communicated maybe with a failed deployed if you will or in Sunstone with a warning symbol... etc.

@atodorov-storpool
Copy link
Contributor Author

Problem is that the rules will be semi applied, which means any subsequent update (e.g. in SG ) maybe broken as well or could not be recovered.

Let's clarify how SG works so we are on the same page. When an SG change occurs, all iptables rules are dropped and rebuilt following the logic encoded in the SG rules. Also the SG definitions whitelist what should be permitted, assuming everything else is considered blocked—that is the last "system" rule applied at the end of the SG. So, there is no unrecoverable state as everything is flushed...

So I still think that if the SG groups cannot be applied it is better to not apply some. The admin will not have any clue which port is protected or not. It may or may not be opened to the Internet, depending on the rule that failed.

Well isn't the current situation worse - it is guaranteed that in case of an SG error, the VM is left with at least one NIC entirely open to the internet?

So the patch it self does not guarantee any protection, even worse it may lead to a false sense of security.

I agree that it is clear that the patch does not consider the multiple NICs case and needs improvement to address this case. But that does not change the spirit of the patch which is my understanding of 'the false sense of security' - Leaving the VM open and vulnerable to the internet (what the ONE currently do!) instead of "locked down" in case of an error in the firewall.

@rsmontero
Copy link
Member

Yes, let's look into the SG. The drivers generate a set of iptables/ipset commands to implement the defined SG rules. Those commands are then executed, if one of them failed exception is raised.

So, if we ignore this and leave the system with some of the rules applied it is clearly a worst situation that removing all the rules and role back:

  1. You don't know which rules where applied so the this patch simply does not prevent leaving a VM "vulnerable to the internet", it may or may not depending on the failure.
  2. Any manual recovery process is harder as the admin news to manually rollback the system and fix whatever

So IMHO idempotence here is more important, and hence this patch is in the wrong direction

We need to focus on the actual source of the problem, the error. Right now How is communicated to the admin, Can we improve that? Leaving the hypervisor in an inconsistent/non-deterministic state is simply not the solution for this.

try SG on all nick before raising an error
@atodorov-storpool
Copy link
Contributor Author

Lets try with examples:

Initial state, the VM has 2x NICs and the following SG
Screenshot from 2024-11-26 13-06-51
and the following iptables:
Screenshot from 2024-11-26 13-07-11
An operator adds port range 80..443 (but the wrong way...)
Screenshot from 2024-11-26 13-08-34
Screenshot from 2024-11-26 13-08-40
After hitting the green Update button
Screenshot from 2024-11-26 13-09-02

The VM have the following ipptables rules as a "Security Groups applied"...
That is a false sense of security for me
Screenshot from 2024-11-26 13-09-17
Here the support have a call from an angry client because it's VM got hacked...
The administrator goes to the VM list and see that indeed there is an error?!? (what a surprise!)
Screenshot from 2024-11-26 13-10-10
Screenshot from 2024-11-26 13-10-36
##########################################
Here is the same scenario but with the patch v2 applied (just added to this PR)
##########################################

Here the operator has a call from the customer because a service is not working. (which is slightly different than a compromised VM...)
Screenshot from 2024-11-26 13-11-59
Screenshot from 2024-11-26 13-12-09
The error is clearly visible in the VM log (nf_tables): invalid portrange specified ... Command Error: sudo -n iptables -w 3 -W 20000 -A one-1-0-i -p tcp -m multiport --dports 443:80 -j RETURN
Screenshot from 2024-11-26 13-12-18
And here are the iptables rules on the host, as one can see the SG are not leaving the VM unprotected.
Screenshot from 2024-11-26 13-12-38

There is no impact on the ipset rules, no "unfixable" state, and no probably secured or probably not state... Just a VM with all vulnerable holes(ports) closed.

As could be seen there error message is there, the administrator could take actions. Are the SGs not completely applied - Yes, no argue here. BUT the VM is not left running completely open to the internet.
(As I like to say it is a devil state, yes, but with the small devil/trouble at the head...)

@rsmontero
Copy link
Member

rsmontero commented Nov 26, 2024

So

  1. The VM has a clear warning sign
  2. If the VM get this error on poweroff the VM is not allowed to boot and leaves the host in a clean state, with this patch it is not

Sorry but for me this is the expected behavior

EDIT: extended point 2, and why this patch is not valid

@rsmontero rsmontero closed this Nov 26, 2024
@rsmontero
Copy link
Member

Just as a side note, all your scenario works because the wrong rule was the last one, if it was the first you will end up with an empty iptables set.

The PR needs to be reformulated to validate the rules.

(sorry hit the submit button before adding this)

@rsmontero
Copy link
Member

Updated #6759 with the right approach to this.

TL;DR:

The "bug" is actually in SecurityGroup::is_valid() . This needs to be extended when first port in range is greater than last, the range needs to be properly sorted.

@atodorov-storpool
Copy link
Contributor Author

Sigh...
Fine, just a three final notes on my side, though

First,

Just as a side note, all your scenario works because the wrong rule was the last one, if it was the first you will end up with an empty iptables set.

This assumption is not correct :) There will be -А one-1-0-i -j DROP iptables rule,effectively locking up the VM instead of leaving it open and unprotected to the internet...

Second,

  1. The VM has a clear warning sign

True, if after editing the SG rules someone goes to the VMs list and looks for the warnings...

and Third,

This needs to be extended when first port in range is greater than last, the range needs to be properly sorted.

This is not the only case of erroneous configuration...

p.s.
Why someone do not just spin up a testing VM and throw some configurations or read the iptables man page?
(joker: there are no ports accepted when defining an ICMP rule, for example...)

@rsmontero
Copy link
Member

So sorry, but you have to install like tens of VM any can fail, so yes ignoring the error simply DOES NOT guarantee any security

As you see, we read iptables documentation and we do A LOT of validations in code, to inform the user. We are missing the multiport order.

So the conclusion is the same, this PR DOES NOT solve any security issue and the proper solution is to validate the rules beforehand.

@atodorov-storpool
Copy link
Contributor Author

Hi @rsmontero,
Re-reading my last comment now I realise that it sounds too harsh. My apologies! (Noted to not write comments at the end of the workday...)

I agree that this patch is not the complete solution for the issue. What I believe is, that it is an additional layer, a safeguard, against the case for the running VMs. It is an improvement of the current OpenNebula behaviour. And not affecting the normal and expected OpenNebula operation.

So sorry, but you have to install like tens of VM any can fail, so yes ignoring the error simply DOES NOT guarantee any security

Here is the initial PR post: "Open Nebula will raise and log the rule error for further investigation, but the VM will not be left unprotected."

I've never said or written that any error should be ignored. If such an assumption exists, it is probably due to translation, as I am not a native English speaker. The patch does not remove the rise error part of the logic. The error is raised so as to be "heared" by the opennebula service. It is not about replacing the input validation. It does not silence the error. Definitely, every error should be investigated in and timely fixed.

The patch is about the impact in the case of error. It mitigates the effects if there is an erroneous case slipped through the rules validation.

A VM will not start if there is an error. We are on common ground here. The user is protected from starting a vulnerable VM.

The patch addresses the already running VMs, when the SG rule changed. In other words, we are discussing the case with the running VMs that are hit with the broken SG rule... about the time frame from the moment the error is introduced, until the issue is recognized and fixed by the operator. It could be a matter of minutes up to days, depending on various factors. I hope we can agree that in that time window, the running VM are currently left unprotected with no iptables applied... So the patch is doing the alternative to "powering off" the affected VMs - it is issuing the finalizing part of the SG rule algorithm and closing the gate to the open internet. (iptables -A <chain> -j DROP).

Regarding the iptables docs... please put my shoes on and look what I see - you can set an ICMP rule with a port(range) for example and it will pass the validation leading to an error raised by the iptables... So it is not the port range only... I realize that it is a big, enormous work to keep in sync a validation of something outside your control, like any change in the iptables tools. I am trying to point out that there are more cases that should be covered by the validation. Also, I understand that such a validation could have bugs, that should be fixed.

With Best Regards and no bad feelings,
Anton Todorov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VM is left unprotected when there is an error in the Security Group rules
4 participants