-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support transient /etc #3062
Support transient /etc #3062
Conversation
a0d9b81
to
1841d0b
Compare
Sorry I meant to comment earlier but IMO the best fix here is not to relabel |
@cgwalters Sure. That approach works too, although only with ostree commits built with a new rpm-ostree. |
@cgwalters And btw, we're not relabeing /etc on boot for this reason. We relabel /usr/etc during deploy, when we generate the composefs image. The on-boot stuff (the part in ostree-remount) is to fix up the labels for things that changed in upperdir during the initramfs. |
1841d0b
to
84632ca
Compare
Hmm; changing the labels just in the composefs flow is making it asymmetric with the legacy case though. I'm not strictly opposed to that but...it seems better to support setting up the labels in this way in both cases so we can test it more uniformly. |
@cgwalters Just so we're on the same page. Is this what you mean: With this PR, this is how transient-etc works:
And your proposal is:
That makes a lot of sense to me. |
In that case, we also don't need to change the composefs format, so lemme separate out the inline-small-files from this PR |
84632ca
to
47cb0a5
Compare
I'll have a look at the rpm-ostree side, then we can just drop the first two commits in this PR. I want to test the current code against the autosd images though, to ensure it works there. So far I've only tried against fcos. |
I started looking at rpm-ostree but I think it's actually mainly on this side, started to test out
|
47cb0a5
to
fb575ba
Compare
New version following this approach. To test it I just commented out the flag check, and it seems to work well. Hoever, the real fix is to make rpm-ostree pass this flag. Do we always pass it? |
src/switchroot/ostree-remount.c
Outdated
|
||
if (transient_etc) | ||
{ | ||
/* Systemd will create a /run/machine-id -> /etc/machine-id bind mount if /etc is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically ostree-remount.service
has just been a big source of race conditions.
In general the direction has been to ensure that the target /
is set up fully in the initramfs.
In particular the historical rationale for ostree-remount.service
was compatibility with the legacy behavior of remounting /
as writable only in the real root early boot.
If that's done in the initramfs, there is no reason to run ostree-remount.service
. And I think we should aim to do that uniformly - and actually disable ostree-remount.service
by default.
I'd need to dig in here into what point the /etc/machine-id
is being dealt with in systemd, but it really seems to me like the relevant code here should be run in the initramfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's done in the initramfs, there is no reason to run ostree-remount.service.
Ah wait, there is one load bearing thing left that it's doing, which is to reset the propagation mode of /sysroot
.
Blah...I wonder if we can work with systemd upstream to add some way to opt out of that for particular mounts.
Alternatively...I could imagine actually hiding the /sysroot
mount in a separate mount namespace entirely, held open by some process that we keep running from the initramfs.
That would be much stronger isolation, but it'd mean every tool that wants to operate on that state would need to nsenter -m -t $(pidof ostree-mountns-holder)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is impossible to do this relabeling in the initrd, becase at that point the selinux policy is not loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, for reasons described in #2972 (comment) it isn't even possible to manually fake the relabeling in the initrd. It has to be done after the policy is loaded.
selinux/ostree.te
Outdated
# To fix this we make the kernel context unconfined. It essentially is anyway, as | ||
# the kernel is the entity that validates the permissions anyway. | ||
|
||
unconfined_domain(kernel_t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think we need to debate this. I hope you'd agree with me that logically this change isn't really related to ostree.
I think if we're going to do this it should probably be just changed in the base SELinux policy package.
Isn't this effectively undoing some effort to confine kernel_t
actually? IIRC there was some...something for constraining usermode helpers that could be executed, etc.
I haven't dug in deep to this but could we do something like add an overlayfs option that disables LSM checks for it? And then when we switch to the real root and selinux policy is loaded, we remount it, setting the mounter credentials to unconfined_t
and re-enabling LSM checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, I could imagine building something like initramfs_t
into the kernel itself, and use it for things we run from the initramfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not a fan of this approach, or having it live here.
What kind of SELinux denials are you getting BTW? There was a related BZ to this a while back: https://bugzilla.redhat.com/show_bug.cgi?id=2154428.
At that time, it was the overlay created by ostree admin hotfix
causing issues (so on /usr
, not /etc
), but it lead to fedora-selinux/selinux-policy#1574 which greatly loosened what kernel_t
can do. I'm assuming you're running your tests with that patch there? If so, it might just need to tweaked a bit further.
Alternatively, I could imagine building something like
initramfs_t
into the kernel itself, and use it for things we run from the initramfs.
I like it. That's a similar (but better) idea to what I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2154428#c19 to be able to differentiate between pre and post-pivot.
Also /cc @WOnder93
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issues are that initramfs launches processes before the policy is loaded. When systemd loads the policy any process launched prior to systemd is going to be labeled kernel_t, if those processes do anything that is not currently defined, then they blow up. It becomes a big mess.
Since these processes exists before systemd load_policy, they can only be labeled kernel_t not initramfs_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to do something like
setexeccon(unconfined_t) before mounting
This is only possible after the policy is loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unconfined_t does not exists before policy is loaded, kernel_t also does not exists before policy is loaded.
When policy is loaded, the kernel is told that all procesess identifiers that existed before the policy load are labeled as kernel_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unconfined_domain(kernel_t) == ("kernel_t = unconfined_t - execute_no_trans")
No, it is not, because of files_unconfined_type
and filesystem_unconfined_type
.
Well not quite, but I don't think it is an issue.
allow unconfined_domain_type container_domain:process { dyntransition transition };
allow unconfined_domain_type container_runtime_t:process transition;
allow unconfined_domain_type qm_container_domain:process { dyntransition transition };
How is that relevant? Once you allow read + execute + execute_no_trans on all files, then SELinux won't prevent an attacker exploiting the modprobe path vulnerability from running a binary they control (most likely labeled user_tmp_t or user_home_t). When you eliminate execute_no_trans from kernel_t, then the upper bound of what file types it can execute is those that it has a type_transition defined for, which is easy to check and control. And it is best to avoid execute_no_trans even for selected file types, as you would get userspace processes running as root (always - systemd hardening features can't be used here) + kernel_t, which is risky even for binaries considered "safe" (system binaries). Plus the task being labeled kernel_t would be confusing as it's no longer running in kernel space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WOnder93 So, in practical terms, how would one go about creating rules for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that's somewhat tangentially related to this is I think usermode helper execution is a bit better constrained via other LSMs, like e.g. LoadPin etc...that's something we can more strongly push for in an image-based world.
|
Can we also get list of services started from initramfs which need to persist switchroot? |
This is not really a persistent process issue, it is more of what Colin said. If I am running in initramfs and I mount an overlay, from the kernel's point of view kernel_t mounted the file system. When a user tries to read/or write content in the overlay, the way the kernel works is it checks whether the "mounter" (kernel_t) is able to read/execute the file that the process in userspace is attempting to use. The AVC's are things like kernel_t trying to read /etc/foobar, and getting an AVC. We could write a policy that says the kernel_t can read/execute all file_types to fix this issue, but I think this ends up being a slippery path. Eventually we end up with an AVC saying the kernel is not allowed to do something, and user say WTH? |
I don't thing the solution to this is to change the kernel. to tell overlay to work differently from a security point of view. But to simply fix the SELinux policy to give broader permissions to kernel_t. We can work more closely on fixing the issues with init_t, if necessary. |
Yeah, I think my example from a comment above is a simple way to understand it. I log in as root, so I am However, then somewhere deep inside overlayfs it creates the backing directory in the upperdir (which is also
Now, I'm supposedly unconfined, so I should be able to do anything. However, for that to work, the overlayfs mount must also be unconfined. Anything more limited means that there are things unconfined_t cannot do in /etc. |
I don't think Is coming from the overlay, if it is, then I don't understand. The mounter as I understand it should only be requesting read/execute permissions on types, and the creation should be done by the unconfined_t process, since it should be created in the upper directory. |
See https://docs.kernel.org/filesystems/overlayfs.html#permission-model I saw at one point the maintainer of overlayfs once described it as a "just in time cp -a". Hence, the mounter credentials are captured and checked for each access in addition to the calling task. So what we're hitting is this:
|
@alexlarsson This AVC is sus, because it should be allowed since commit fedora-selinux/selinux-policy@f5438002d7... What was the selinux-policy RPM version on the system where you got this? In fact, I think that commit should fully suffice for the early boot overlayfs mounts to work. |
So, I looked into what it would mean to have kernel_t be a fully unconfined domain, and it seems way too much. Like its got the ability to talk to systemd and all sorts of stuff. Instead I tried to make the minimal policy that I got to boot the system, and let me create some files in /etc as unconfined_t root. This is what I came up with:
(Also in the latest commit here) Basically, this allows all the file operations that I see overlayfs doing while it is in a section between ovl_override_creds() and revert_creds(). I also had to add the excemptions, because the mount ends up running as system_u:system_r:kernel_t and /etc files may not have those roles or object ids. This seems a lot safer to me... |
@alexlarsson Please see my previous comment. This is all already in the policy since at least February... (except the |
@zpytela I'm pretty sure there are zero userspace processes which persist today; it's possible we'd add something in the future but I'd like to avoid it. (And if we do we can probably have it save its own state and re-exec after policy load or so?) The "thing that persists from the initramfs" here is the overlayfs credentials cached in the kernel. |
326e2eb
to
bc9d292
Compare
bc9d292
to
50de0ff
Compare
Unmarked this as draft, as the selinux part now works out of the box (although needs selinux policy master for now). It should however work as is with older selinux-policy that have a fully unconfined kernel_t including, i believe, rhel9. |
I don't undestand the "Adding composefs metadata: composefs is not supported in this ostree build" CI failures. These checks passed before, did we end up with an older ostree version in the CI? |
See #3067 |
FYI the selinux-policy rawhide build is now available at https://bodhi.fedoraproject.org/updates/FEDORA-2023-5be6d65113 |
50de0ff
to
186f210
Compare
After rebase it passed CI, @cgwalters can you review it now? |
I just merged #3063 - this will need to drop the other patch for it now. |
* are set before selinux policy is loaded. | ||
*/ | ||
static void | ||
relabel_dir_for_upper (const char *upper_path, const char *real_path, gboolean is_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this though if we land coreos/rpm-ostree#4640 and use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The problem is not the labels on the lowerdir, those are fixed by that commit. The problem is the labels in upperdir and in the overlayfs itself (i.e. in-memory overlayfs inodes cache the selinux label from the upper).
Basically, in the initramfs all the files created in the tmpfs upper will get their xattrs reset at selinux policy load. So, what we do is to look at upperdir (in tmpfs), and for any file that exists there (and there should be few, only for files created in initramfs), we relabel the corresponding overlayfs file via /etc. This will set the correct selinux label both on the in-memory overlayfs inode, and the backing upperdir file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: We still need that commit tho.
If the `prepare-root.conf` file contains: ``` [etc] transient=yes ``` Then during prepare-root, an overlayfs is mounted as /etc, with the upper dir being in /run. If composefs is used, the lower dir is `usr/etc` from the composefs image , or it is the deployed `$deploydir/usr/etc`. Note that for this to work with selinux, the commit must have been built with OSTREE_REPO_COMMIT_MODIFIER_FLAGS_USRETC_AS_ETC. Otherwise the lowerdir (/usr/etc) will have the wrong selinux contexts for the final location of the mount (/etc). We also set the transient-etc key in the ostree-booted file, pointing it to the directory that is used for the overlayfs. There are some additional work happening in ostree-remount, mostly related to selinux (as this needs to happen post selinux policy load): * Recent versions of selinux-poliy have issues with the overlayfs mount being kernel_t, and that is not allowed to manage files as needed. This is fixed in fedora-selinux/selinux-policy#1893 * Any /etc files created in the initramfs will not be labeled, because the selinux policy has not been loaded. In addition, the upper dir is on a tmpfs, and any manually set xattr-based selinux labels on those are reset during policy load. To work around this ostree-remount will relabel all files on /etc that have corresponding files in overlayfs upper dir. * During early boot, systemd mounts /run/machine-id on top of /etc/machine-id (as /etc is readonly). Later during boot, when etc is readwrite, systemd-machine-id-commit.service will remove the mount and update the real file under it with the right content. To ensure that this keeps working, we need to ensure that when we relabel /etc/machine-id we relabel the real (covered) file, not the temporary bind-mount. * ostree-remount no longer needs to remount /etc read-only in the transient-etc case. Signed-off-by: Alexander Larsson <[email protected]>
186f210
to
f617a34
Compare
Rebased on #3063 and retested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
I think we can add an installed test for this, but not gating.
@cgwalters The installed test will probably be easier once the selinux changes have propagated too. |
Which changes do you have in mind? Rawhide selinux-policy build is available since 3 days ago. |
@zpytela Yes, but it has to propagate to the fedora coreos images used in the CI here. |
I want to try spinning up testing against ELN across our repositories soon, which will fix that. |
Extend the abitlity to manage all files also to character & block device files and also allow relabeling any file. This is required for early boot overlay mounts to fully work, but may be needed for other legitimate oprations as well. See also: ostreedev/ostree#3062 Signed-off-by: Ondrej Mosnacek <[email protected]> Resolves: rhbz#2182033
Extend the abitlity to manage all files also to character & block device files and also allow relabeling any file. This is required for early boot overlay mounts to fully work, but may be needed for other legitimate oprations as well. See also: ostreedev/ostree#3062 Signed-off-by: Ondrej Mosnacek <[email protected]> Resolves: rhbz#2182033
Extend the abitlity to manage all files also to character & block device files and also allow relabeling any file. This is required for early boot overlay mounts to fully work, but may be needed for other legitimate oprations as well. See also: ostreedev/ostree#3062 Signed-off-by: Ondrej Mosnacek <[email protected]> Resolves: rhbz#2182033
Extend the abitlity to manage all files also to character & block device files and also allow relabeling any file. This is required for early boot overlay mounts to fully work, but may be needed for other legitimate oprations as well. See also: ostreedev/ostree#3062 Signed-off-by: Ondrej Mosnacek <[email protected]> Resolves: rhbz#2182033
This is a new version of #2972
There was a lot of complexities, so this is essentially a clean restart. So, rather than reusing that PR I created a new one.
This PR does multiple things:
etc.transient=true
option toprepare-root.conf
, which causes /etc to be and overlayfs mount with a tmpfs-based upper.The actual transient etc overlay mount details are quite complex due to selinux interactions between tmpfs, selinux policy loading, and overlayfs. See #2972 (comment) for some details. In the end, after a discussion with @rhatdan I ended up with a new ostree selinux module that will be needed for transient mode to work.