Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle syscalls via allowlist instead of denylist #4462

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

smcv
Copy link
Collaborator

@smcv smcv commented Oct 8, 2021

This is hopefully a better solution for CVE-2021-41133.

  • run: Rename variables to allow for addition of an allowlist

    • seccomp -> blocklist_ctx
    • seccomp_tmpf -> blocklist_tmpf
  • run: Add an allowlist of known system calls

    This should avoid future vulnerabilities similar to CVE-2021-41133.

    If a new VFS-manipulating system call is added to the kernel and
    libseccomp but not added to our list of known system calls, then it won't
    enter our allowlist, so it will fail with ENOSYS in the sandbox.

    If a new system call is added to our list of known system calls, but is
    not supported by libseccomp, then adding it to both our blocklist (if
    applicable) and our allowlist will fail with -EFAULT. Our blocklist will
    be unable to make it fail with EPERM or check its arguments, but our
    allowlist will make it fail with ENOSYS, so we're still protected.

  • run: Solve CVE-2021-41133 via allowlist, not blocklist

    This means we don't have to know the syscall numbers of recently-added
    system calls, which means we can drop common/flatpak-syscalls-private.h.

  • Update list of known syscalls to Linux v6.11.0-rc1

    • Add cachestat()
    • Add futex_*() (notably used by Wine/Proton)
    • Add fchmodat2()
    • Add listmount(), statmount()
    • Add llseek()
    • Add lsm_get_self_attr(), lsm_list_modules(), lsm_set_self_attr()
    • Add map_shadow_stack()
    • Add mseal()
    • Add process_mrelease()
    • Add riscv_hwprobe() (CPUID-like)
    • Add set_mempolicy_home_node()
    • Add uretprobe()

    None of these seem like something that needs to be blocked, so don't
    add them to the denylist.

    I've done this as a separate commit rather than squashing it into the
    commit that added common/known-syscalls.txt, to illustrate what we will
    need to do periodically.

@smcv smcv marked this pull request as draft October 8, 2021 18:13
@smcv smcv requested a review from alexlarsson October 8, 2021 18:13
@smcv
Copy link
Collaborator Author

smcv commented Oct 8, 2021

I still need to verify that this successfully blocks the syscalls we want to block.

@smcv
Copy link
Collaborator Author

smcv commented Oct 8, 2021

Ugh, this doesn't actually work. bwrap only accepts one --seccomp argument, and ignores subsequent arguments - so we can only turn these into an allowlist if we only have an allowlist, but that means we have to rework all our special-cased rules like the ones for clone() and ioctl().

@smcv
Copy link
Collaborator Author

smcv commented Oct 8, 2021

I'm sorry, I don't see a way to make this work without being able to add more than one seccomp program to bwrap. I'm way outside my understanding of libseccomp at this point, so I'm not sure what the semantics are of adding more than one non-mutually-exclusive rule for the same syscall number (first match wins? last match wins? most-permissive wins? least-permissive wins?) - but the semantics are not documented, so I'd be reluctant to rely on them as a security boundary in any case.

The key problem is that libseccomp doesn't have a MASKED_NEQ filter mode (seccomp/libseccomp#310), so we cannot turn the ioctl deny rule (used with default-allow) into an allow rule (to be used with default-deny).

If we are willing to change the semantics of the ioctl deny rule to only allow the "canonicalized" ioctl numbers, then we could allow every ioctl number from 0 up to some suitably large number, except for the single ioctl we want to block - but that would be a huge number of rules, hence a huge BPF program?

@mcatanzaro
Copy link
Collaborator

mcatanzaro commented Oct 9, 2021

Hmmmm, this is tricky and risky. Will definitely break stuff in the future. We tried this in WebKit once upon a time, and it worked nicely until some pesky kernel developers decided they should be allowed to create new syscalls. (Note for clarity: this is a joke. ;) Thus memfd_create was born. One day, libxshmfence decided it was going to start using memfd_create, and suddenly WebKit couldn't render anything anymore, because this syscall was not on its allowlist. It never hit users because that version of the sandbox was experimental (it failed at the proof of concept stage), but I noticed, and I've been scared of seccomp ever since. Firefox and Chrome have similar problems from time to time, not necessarily due to new syscalls (although I don't doubt those may sometimes cause problems), but because their seccomp filters are especially aggressive and fragile.

So the more secure we make the seccomp filters, the more fragile they become, the more unexpected breakage users will experience from time to time. Of course, you already know most of this, but it's still good to write it down here so others can see the context behind why this is a tough design decision to make. Personally, I think existing syscall denylist is a better sweet spot between security and robustness, but I understand you've come to the conclusion that the allowlist approach makes sense after thinking hard about it, and I acknowledge the allowlist is the only way to prevent new kernel versions from introducing new sandbox escapes without flatpak knowing about it. :/ So perhaps it's the right call, even if we occasionally come to regret it when it breaks software.

I'm sorry, I don't see a way to make this work without being able to add more than one seccomp program to bwrap. I'm way outside my understanding of libseccomp at this point

Me too. This stuff is hard. Sorry for being no help....

@smcv
Copy link
Collaborator Author

smcv commented Oct 9, 2021

OK, this part I think I do understand, if only because I've been researching it as fast as I could...

One day, libxshmfence decided it was going to start using memfd_create, and suddenly WebKit couldn't render anything anymore, because this syscall was not on its allowlist.

A seccomp program can do one of various actions when it sees a syscall it doesn't like. The ones that work without special support from user-space are that it can use SCMP_ACT_ERRNO(x) to make the syscall fail with errno set to x, or use SCMP_ACT_KILL, SCMP_ACT_KILL_PROCESS or SCMP_ACT_TRAP for various variations on killing the thread or process that invoked the syscall.

The idea of making "recent" syscalls fail with SCMP_ACT_ERRNO(ENOSYS) is that we want those syscalls to be exactly equivalent to syscalls your kernel doesn't implement yet, and the result of calling a syscall your kernel doesn't implement is that it fails with ENOSYS. Any software that doesn't hard-require a kernel strictly newer than your libseccomp and Flatpak should be fine with that. Applications realistically can't hard-require such new kernels for quite a long time anyway, because some people keep using "oldstable" LTS distributions like Debian 10, Ubuntu 18.04 and RHEL 8, even after Debian 11, Ubuntu 20.04 and RHEL 9 are available; and because some people will need to use old kernels to work around hardware-specific issues in newer kernels, or transitionally during upgrading, or whatever.

(For example libxshmfence falls back from memfd_create to open with O_TEMPFILE, or to mkstemp if that doesn't work either.)

This is more compatible than using SCMP_ACT_ERRNO(EPERM) for "recent" syscalls, which Docker has historically done (and has had a lot of trouble with, when glibc in the container is upgraded and starts using the new syscall preferentially).

It's also a lot more compatible than using SCMP_ACT_KILL, SCMP_ACT_KILL_PROCESS or SCMP_ACT_TRAP, which I suspect might be what the browsers you're thinking of were doing for sandboxed rendering - but that only really works for seccomp's original use-case, "secure computation", where your code is totally under your control down to the syscall level (no third-party shared library dependencies, not even glibc calls beyond trivial syscall wrappers), and is 99% computation (i.e. pure number crunching with no syscalls) and 1% reporting back the results via write() or similar.

I think Tracker uses SCMP_ACT_TRAP, and that's already really pushing it - I wouldn't want to risk that myself, if I maintained Tracker (which, thankfully, I don't).

The down side of using SCMP_ACT_ERRNO(ENOSYS) or SCMP_ACT_ERRNO(EPERM) is that it's a lot easier for an attacker (a malicious or compromised app) to probe the boundaries of the container if they can try as many syscalls as they want to, rather than being shot down in flames as soon as they try one thing that the policy forbids.

@mcatanzaro
Copy link
Collaborator

A seccomp program can do one of various actions when it sees a syscall it doesn't like. The ones that work without special support from user-space are that it can use SCMP_ACT_ERRNO(x) to make the syscall fail with errno set to x, or use SCMP_ACT_KILL, SCMP_ACT_KILL_PROCESS or SCMP_ACT_TRAP for various variations on killing the thread or process that invoked the syscall.

Hm, that's a pretty good point. I think WebKit was actually using SCMP_ACT_TRAP to pause the web process, then it let a broker process decide what to do. Maybe it was executing the kill. I don't remember for sure. But because your filters return ENOSYS, libxshmfence would clearly just fall back to its original code, and everything should be fine. So that libxshmfence example actually serves to illustrate that maybe this is OK after all.

I suppose you've convinced me. (Well, we still have to figure out if it's possible to solve the problems you've discovered....)

@smcv
Copy link
Collaborator Author

smcv commented Oct 9, 2021

The three most promising routes forward that I can see all require coordination with other packages:

  • Add --disable-userns switch containers/bubblewrap#452, and then depend on that version of bubblewrap, and use it in Flatpak to disable userns creation in a way that doesn't start failing when new kernels invent new ways to create a userns
  • Give bubblewrap an --add-seccomp-fd option that lets it stack up more than one seccomp program (unlike --seccomp, which is not useful to use more than once), and then use two separate seccomp programs, as I tried to do in this MR: one allowlist (default-deny via ENOSYS) to allow only the known syscalls, and one denylist (default-allow) to disable the TIOCSTI ioctl
  • Long term only: RFE: Inverse of MASKED_EQ (MASKED_NEQ?) seccomp/libseccomp#310, and then depend on that version of libseccomp, and use it to turn Flatpak's seccomp program into an allowlist (default-deny via ENOSYS) by inverting the sense of all the rules

@fweimer
Copy link

fweimer commented Oct 10, 2021

Hm, that's a pretty good point. I think WebKit was actually using SCMP_ACT_TRAP to pause the web process, then it let a broker process decide what to do. Maybe it was executing the kill.

I think this only works if the process does not make the system call while signals are blocked, otherwise the process will be terminated by SIGSYS. In certain contexts, it is necessary to disable signals for correctness, and use of clone3 to create new threads in glibc is such a case. So trap-based emulation won't work there. (We originally saw this with the openat system call during dlopen.)

@alexlarsson
Copy link
Member

I personally think that longterm the --disable-userns switch is better, because it hits the problem at the source. It is conceivable that there are other ways than syscalls (like files in /proc or whatever) to cause user namespace stuff, and those would also be neutered by that route. That said, it makes a lot of sense to also support multiple seccomp fds to bubblewrap, as that seems like a simple and generally useful thing to support.

@smcv
Copy link
Collaborator Author

smcv commented Oct 11, 2021

Yes, in an ideal world we'd use all three of the solutions I mentioned.

@alexlarsson
Copy link
Member

I wonder if we should also pre-compute the selinux rules when doing the release and ship them with the tarball. That way we could avoid problems with the libseccomp version on the host. In fact, we could avoid a libseccomp runtime dependency at all.

@smcv
Copy link
Collaborator Author

smcv commented Oct 11, 2021

I wonder if we should also pre-compute the selinux rules

seccomp rules, do you mean?

I don't think that works: they're architecture-specific. We'd have to precompute a rules blob for every architecture supported by libseccomp, and for every possible Flatpak configuration (e.g. --devel or not).

Pre-computing the seccomp rules is also bad from the point of view of building from the actual preferred form for modification, which is both legally desirable (because LGPL) and practically desirable (because distros can't apply security fixes or other important fixes if they can't build from source).

@alexlarsson
Copy link
Member

Yeah, its not ideal. We could compute them at build-time, which would get rid of the runtime dependency, but we'd still have the current example for older distros as it would build against an old libseccomp. Maybe we could vendor a recent version...

@mcatanzaro
Copy link
Collaborator

Why are you trying to avoid depending on libseccomp?

@smcv
Copy link
Collaborator Author

smcv commented Oct 11, 2021

Why are you trying to avoid depending on libseccomp?

Depending on libseccomp is fine, the problem is being able to depend on modern libseccomp. For example, Debian 10 has libseccomp 2.3.3 which doesn't know about any syscalls newer than v4.15-rc7, but the Debian 10 kernel is 4.19, and people might reasonably run Debian 10 on a backported 5.10.x kernel from Debian 11 for better hardware support.

Ubuntu has backported libseccomp 2.5.1 all the way back to their 16.04 branch (released in 2016) so this is clearly something that distros can do, and when the Debian security team get back to me about this issue, that's one option that I'm going to raise.

@Erick555
Copy link
Contributor

Not depending on host libseccomp should allow to get features like Inverse of MASKED_EQ right away instead of waiting few years after they land in upstream. Is it right?

@smcv
Copy link
Collaborator Author

smcv commented Oct 12, 2021

Maybe we could vendor a recent version [of libseccomp]

I think that's a non-starter. If distributions don't want to backport libseccomp itself, then they are absolutely not going to want to vendor an entire newer copy of libseccomp into flatpak; and as an upstream and downstream maintainer, I'm not willing to take responsibility for the security and correct functionality of the whole libseccomp codebase. It's quite stressful enough having to backport security fixes in code I do understand.

Not depending on host libseccomp should allow to get features like Inverse of MASKED_EQ right away instead of waiting few years after they land in upstream. Is it right?

If it was feasible to avoid depending on the host libseccomp, then yes, that would be the benefit; but for the reasons discussed above, I don't think this is sufficiently feasible that it's a useful direction to be pursuing.

common/flatpak-run.c Outdated Show resolved Hide resolved
@smcv
Copy link
Collaborator Author

smcv commented Feb 8, 2022

Rebased, with a submodule update to pick up containers/bubblewrap#459.

This is not suitable to be merged as-is, because for users of a system copy of bubblewrap, it requires a version that doesn't exist yet: we should have a bubblewrap release before proceeding. However, I wanted to get this to a testable state before releasing bubblewrap 0.5.1 (or 0.6.0 or whatever the next one should be).

I still need to verify that this successfully blocks the syscalls we want to block.

The test from #4505 still works on a modern OS (Debian unstable). Not tested on ye olde OSs yet, but our CI should help with this.

I personally think that longterm the --disable-userns switch is better

I think ideally we should do both, and if someone with more kernel knowledge reviews containers/bubblewrap#452 then I'm happy to have that too; but I don't feel that I'm qualified to review that one myself, sorry.

@smcv smcv force-pushed the syscall-allowlist branch 2 times, most recently from 30452df to de2b2c7 Compare February 19, 2022 16:39
@smcv
Copy link
Collaborator Author

smcv commented Feb 19, 2022

Not tested on ye olde OSs yet

I've now successfully tested this on Ubuntu 18.04 (with an unrelated test failure in tests/test-history.sh that I haven't investigated yet), Debian 10 (with #4764 to fix a recent regression), Ubuntu 20.04, Debian 11 and Debian unstable.

I'm leaving this marked as a draft because we should get bubblewrap 0.6.0 released first (for which I'd like to include containers/bubblewrap#475 if possible), but other than that I think this is ready for review.

smcv added a commit to smcv/flatpak that referenced this pull request Feb 27, 2022
* Add `--add-seccomp` (prerequisite for flatpak#4462)
* Add a warning when repeated options are ignored
* Add a Meson build system
* Invoke bash via `PATH`
* Exit early when `argc == 0`

Signed-off-by: Simon McVittie <[email protected]>
mwleeds pushed a commit that referenced this pull request Feb 27, 2022
* Add `--add-seccomp` (prerequisite for #4462)
* Add a warning when repeated options are ignored
* Add a Meson build system
* Invoke bash via `PATH`
* Exit early when `argc == 0`

Signed-off-by: Simon McVittie <[email protected]>
@smcv smcv added ready-for-review sandbox issue related to sandbox setup labels Mar 2, 2022
@smcv
Copy link
Collaborator Author

smcv commented Mar 8, 2022

Now that we have bubblewrap 0.6.1, I'd like to consider this for Flatpak 1.13.x.

The benefit of this is that if a new kernel with new and exciting syscalls triggers another vulnerability like CVE-2021-41133, we will only need to add those syscalls to a deny-list in older branches like 1.12.x (if we are still supporting those branches), and users of 1.13.x/1.14.x will already be protected from the vulnerability. As a philosophy, having a security mechanism based on "enumerating badness" (a deny-list) is fundamentally flawed, and what I hear from people who know about seccomp is that an allow-list is the only safe way.

The cost is that the syscall filter will be more complicated (potentially harming performance in syscall-heavy workloads like games), and Flatpak apps will be unable to use new syscalls until we have added them to our allow-list. I think that's a price we have to pay if we want to reduce kernel attack surface.

Because we're using ENOSYS to block new syscalls, we won't be abruptly killing newer binaries like web browsers' sandboxes do, and we won't be making syscalls unexpectedly fail with EPERM like Docker does, so we shouldn't have the same problems with updating glibc: instead, they should gracefully fall back to older syscalls, the same as they would if they were running on an older kernel that genuinely didn't support the syscall they tried first (which also results in ENOSYS).

@mwleeds mwleeds added this to the 1.14.0 milestone Mar 9, 2022
@smcv smcv requested a review from mwleeds March 10, 2022 13:05
@smcv
Copy link
Collaborator Author

smcv commented Aug 9, 2022

Any chance someone could review this, containers/bubblewrap#488 and/or containers/bubblewrap#452? I'd be more comfortable about Flatpak's resilience against vulnerabilities like CVE-2021-41133 if we could get one of those into 1.14.x.

@mwleeds mwleeds modified the milestones: 1.14.0, 1.16.0 Aug 23, 2022
@alexlarsson
Copy link
Member

Overall this looks good to me. I have a performance worry for the allow list though. There is a ton of syscalls in that list, and adding them one by one may lead to a ruleset that is slow. Is it possible to pre-compute a list of all the syscalls that are contiguous and add rules instead of the form "allow syscalls from zero to 50" instead of 50 separate rules? Or does libseccomp do some optimization like that?

@rusty-snake
Copy link

AFAIK libseccomp has no such implicit optimization.

If you know the code running with a filter, you can benchmark the syscalls and give hot syscalls a higher priority. That's produces a filter where syscalls are check in order of their priority (instead of an undefined order) letting hot syscalls return earlier.

For cases like this (not known code + large filter) you can enable generation of a btree optimized filter with O(log n) by setting SCMP_FLTATR_CTL_OPTIMIZE to 2 (see seccomp_attr_set(3)).

Maintainer of libseccomp-rs (Rust bindings for libseccomp)

@smcv
Copy link
Collaborator Author

smcv commented Jun 1, 2023

This change is less important than it used to be, now that we have --disable-userns, which would have prevented CVE-2021-41133 in a different way. However, using an allowlist instead of a denylist would still give us better protection against new syscalls bypassing other restrictions where we rely on seccomp, like --disallow=bluetooth and protecting the kernel keyring.

There is a ton of syscalls in that list, and adding them one by one may lead to a ruleset that is slow. Is it possible to pre-compute a list of all the syscalls that are contiguous and add rules instead of the form "allow syscalls from zero to 50" instead of 50 separate rules?

It is probably possible, but realistically I am not going to be able to implement that any time soon. My entire knowledge of libseccomp is what I needed to learn in order to fix past CVEs and write this branch, so I'd appreciate it if someone with better seccomp expertise could take this over.

@smcv
Copy link
Collaborator Author

smcv commented Jun 1, 2023

I said this back in 2022 and it's still true:

The cost is that the syscall filter will be more complicated (potentially harming performance in syscall-heavy workloads like games), and Flatpak apps will be unable to use new syscalls until we have added them to our allow-list. I think that's a price we have to pay if we want to reduce kernel attack surface.

For performance-critical, syscall-heavy workloads, perhaps we could consider having an --allow flag that goes back to using a denylist instead of an allowlist, like a weaker version of #5224?

@mcatanzaro
Copy link
Collaborator

For performance-critical, syscall-heavy workloads, perhaps we could consider having an --allow flag that goes back to using a denylist instead of an allowlist, like a weaker version of #5224?

I don't think this is even worth considering unless we know there exists a real performance problem for some real application.

@Erick555
Copy link
Contributor

Erick555 commented Jun 1, 2023

Such performance problem was reported for some games.

@alexlarsson
Copy link
Member

Such performance problem was reported for some games.

Yeah, and that was with the current small filters. But with an allowlist each syscall would have to pass a filter that compared it against all possible syscalls.

lf- pushed a commit to lix-project/lix that referenced this pull request Jul 26, 2024
Previously, system call filtering (to prevent builders from storing files with
setuid/setgid permission bits or extended attributes) was performed using a
blocklist. While this looks simple at first, it actually carries significant
security and maintainability risks: after all, the kernel may add new syscalls
to achieve the same functionality one is trying to block, and it can even be
hard to actually add the syscall to the blocklist when building against a C
library that doesn't know about it yet. For a recent demonstration of this
happening in practice to Nix, see the introduction of fchmodat2 [0] [1].

The allowlist approach does not share the same drawback. While it does require
a rather large list of harmless syscalls to be maintained in the codebase,
failing to update this list (and roll out the update to all users) in time has
rather benign effects; at worst, very recent programs that already rely on new
syscalls will fail with an error the same way they would on a slightly older
kernel that doesn't support them yet. Most importantly, no unintended new ways
of performing dangerous operations will be silently allowed.

Another possible drawback is reduced system call performance due to the larger
filter created by the allowlist requiring more computation [2]. However, this
issue has not convincingly been demonstrated yet in practice, for example in
systemd or various browsers. To the contrary, it has been measured that the the
actual filter constructed here has approximately the same overhead as a very
simple filter blocking only one system call.

This commit tries to keep the behavior as close to unchanged as possible. The
system call list is in line with libseccomp 2.5.5 and glibc 2.39, which are the
latest versions at the point of writing. Since libseccomp 2.5.5 is already a
requirement and the distributions shipping this together with older versions of
glibc are mostly not a thing any more, this should not lead to more build
failures any more.

[0] NixOS/nixpkgs#300635
[1] NixOS/nix#10424
[2] flatpak/flatpak#4462 (comment)

Change-Id: I541be3ea9b249bcceddfed6a5a13ac10b11e16ad
@smcv smcv removed this from the 1.16.0 milestone Aug 23, 2024
@smcv smcv marked this pull request as draft August 23, 2024 12:52
@smcv
Copy link
Collaborator Author

smcv commented Aug 23, 2024

This has been on hold for long enough that I no longer remember its status clearly, and the testing I previously did is no longer necessarily valid. I think we need to take this out of scope for 1.16.x if we want to release 1.16.x in a finite time.

The main concern about this change was performance. I would appreciate it if someone with more time than me could take over what needs to happen in this branch:

  • benchmarking a syscall-heavy workload (like a game) with seccomp filtering #if 0'd out (best case performance), seccomp denylist (status quo) and seccomp allowlist (this branch);
  • if there is a significant performance difference, benchmarking whether the suggestions in Handle syscalls via allowlist instead of denylist #4462 (comment) help

* seccomp -> blocklist_ctx
* seccomp_tmpf -> blocklist_tmpf

Signed-off-by: Simon McVittie <[email protected]>
This should avoid future vulnerabilities similar to CVE-2021-41133.

If a new VFS-manipulating system call is added to the kernel and
libseccomp but not added to our list of known system calls, then it won't
enter our allowlist, so it will fail with ENOSYS in the sandbox.

If a new system call is added to our list of known system calls, but is
not supported by libseccomp, then adding it to both our blocklist (if
applicable) and our allowlist will fail with -EFAULT. Our blocklist will
be unable to make it fail with EPERM or check its arguments, but our
allowlist will make it fail with ENOSYS, so we're still protected.

Signed-off-by: Simon McVittie <[email protected]>
This means we don't have to know the syscall numbers of recently-added
system calls, which means we can drop common/flatpak-syscalls-private.h.

Signed-off-by: Simon McVittie <[email protected]>
* Add cachestat()
* Add futex_*() (notably used by Wine/Proton)
* Add fchmodat2()
* Add listmount(), statmount()
* Add llseek()
* Add lsm_get_self_attr(), lsm_list_modules(), lsm_set_self_attr()
* Add map_shadow_stack()
* Add mseal()
* Add process_mrelease()
* Add riscv_hwprobe() (CPUID-like)
* Add set_mempolicy_home_node()
* Add uretprobe()

None of these seem like something that needs to be blocked, so don't
add them to the denylist.

I've done this as a separate commit rather than squashing it into the
commit that added common/known-syscalls.txt, to illustrate what we will
need to do periodically.

Signed-off-by: Simon McVittie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sandbox issue related to sandbox setup
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants