Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bubblewrap implementation for sandboxing #153

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

RobertRosca
Copy link
Member

@RobertRosca RobertRosca commented Dec 10, 2023

Adds sandboxing via bubblewrap for context file execution, closes #150.

Main functionality is implemented via a Bubblewrap class which is a convenience class for building up the CLI arguments/flags/mounts required to sandbox a python process to only have access to the data from a single proposal directory.

There's also a new flag --no-sandbox to disable the sandboxing feature. This can be set at launch time for the listener and will be propagated to other subprocess commands down to the final relevant extract_in_subprocess call.

Summary

Bubblewrap implements:

  • __init__: initializes with some default flags and bind mounts for running the sandboxed process, these defaults are all security/isolation related (unshare/disable namespaces) or required binds/settings (e.g. share network, bind /bin, /lib, etc...).
  • add_bind: main method for adding some bind mount with a source, destination, and a flag to set the mount to read only or not.
  • add_bind_proposal which takes in a proposal number, finds the directory, bind mounts the directory, and resolves the top-level symlinks in the proposal directory, then bind mounts those in as well.
  • add_bind_venv which takes the path to a python executable and, if it is running in a venv, bind mounts all paths required by it to the sandbox.
  • build_command takes in the command to sandbox and prepends the full bubblewrap command to it, then returns a list that can be called by subprocess to start the sandboxed command.

There are some tests to check that the command is built (at least somewhat) correctly, but they're not that robust since I'm not sure how reliable testing with something bubblewrap is when running in a CI environment.

Example

An example of building up the commands would be:

b = Bubblewrap()

b.add_bind_proposal(3422)

b.add_bind_venv("/gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/bin/python")

b.build_command("/bin/bash")

Which creates:

bwrap \
  --unshare-all \
  --share-net \
  --dev /dev \
  --tmpfs /tmp \
  --dir /gpfs \
  --ro-bind /bin /bin \
  --ro-bind /etc/resolv.conf /etc/resolv.conf \
  --ro-bind /gpfs/exfel/sw/software /gpfs/exfel/sw/software \
  --ro-bind /lib /lib \
  --ro-bind /lib64 /lib64 \
  --ro-bind /sbin /sbin \
  --ro-bind /usr /usr \
  --bind /gpfs/exfel/exp/MID/202304/p003422 /gpfs/exfel/exp/MID/202304/p003422 \
  --bind /gpfs/exfel/u/scratch/MID/202304/p003422 /gpfs/exfel/u/scratch/MID/202304/p003422 \
  --bind /pnfs/xfel.eu/exfel/archive/XFEL/raw/MID/202304/p003422 /pnfs/xfel.eu/exfel/archive/XFEL/raw/MID/202304/p003422 \
  --bind /gpfs/exfel/u/usr/MID/202304/p003422 /gpfs/exfel/u/usr/MID/202304/p003422 \
  --bind /gpfs/exfel/d/proc/MID/202304/p003422 /gpfs/exfel/d/proc/MID/202304/p003422 \
  --ro-bind /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/lib/python3.9 /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/lib/python3.9 \
  --ro-bind /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/lib/python3.9 /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/lib/python3.9 \
  --ro-bind /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/lib/python3.9/site-packages /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/lib/python3.9/site-packages \
  --ro-bind /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/lib/python3.9/site-packages /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/lib/python3.9/site-packages \
  --ro-bind /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/include/python3.9 /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/include/python3.9 \
  --ro-bind /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/include/python3.9 /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/include/python3.9 \
  --ro-bind /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/bin /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env/bin \
  --ro-bind /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env /gpfs/exfel/u/usr/MID/202304/p003422/Software/analysis_env \
  /bin/bash

Questions

Main one is... where should this go? - Went with making it part of `extract_in_subprocess`

It can be added anywhere there is a subprocess command, or even in the ctxrunner:

  1. extract_proc = subprocess.Popen([
  2. return subprocess.run(args, env=env, **kwargs)
    seems like a good option since it would also cover the case of slurm jobs, as far as I understand from just glancing at
    python_cmd = [sys.executable, '-m', 'damnit.backend.extract_data',
  3. def main(argv=None):
    by creating a new flag --sandbox which, if present, re-executes itself within the sandbox.

Other questions are:

  1. Right now I'm only setting read-only binds for the python venv (if there is one). Alternative would be to mount everything as read only by default, and explicitly pass a single output directory which is writable. - spoke to James, decided to leave proposal dir as normal mount instead of having everything read only and one output directory due to other tools potentially writing to other directories.
  2. --die-with-parent is set to kill the sandbox if the parent (DAMNIT) process dies, I assume that's desirable? - spoke to James, seems fine.
  3. --unshare-all and some of the binds break (on purpose) authentication, slurm, ssh, etc..., I assume people don't expect to be able to have some function in a context file that runs a subprocess with ssh or something? - spoke to James, mentioned that xwiz will run slurm commands on its own which would break, added in option to disable sandboxing when the listener is spawned.

TODO:

  • Test read write to db and context file
  • Test supervisor start with no sandbox

@RobertRosca RobertRosca self-assigned this Dec 10, 2023
@RobertRosca
Copy link
Member Author

RobertRosca commented Dec 12, 2023

Seeemmmss to work, at least to some degree. Tested it with the following context file:

import subprocess
import socket
from datetime import timedelta
from pathlib import Path

import numpy as np

from damnit_ctx import Variable

@Variable(title="Trains")
def n_trains(run):
    return len(run.train_ids)

@Variable(title="Proposals")
def list_proposals(run):
    return str(list(Path("/gpfs/exfel/exp/").glob("*")))

@Variable(title="Slurm")
def srun(run):
    return subprocess.run(
        ["sacct"],
        text=True,
        stdout = subprocess.PIPE,
        stderr = subprocess.STDOUT,
    ).stdout

@Variable(title="Web")
def ping(run):
    return subprocess.run(
        ["curl", "example.com"],
        text=True,
        stdout = subprocess.PIPE,
        stderr = subprocess.STDOUT,
    ).stdout

@Variable(title="SSH")
def ping(run):
    return subprocess.run(
        ["ssh", "max-exfl", "hostname"],
        text=True,
        stdout = subprocess.PIPE,
        stderr = subprocess.STDOUT,
    ).stdout

@Variable(title="Slurm Executed", cluster=True)
def ping(run):
    host = socket.getfqdn()

    proposals = len(list(Path("/gpfs/exfel/exp/").glob("*/*/*")))

    return f"{host} - {proposals}"

Which updates the sqlite database and creates the extracted data files successfully. The output is:

Full output
4507, 
1, 
1697763475.124875, 
1702324474.924, 
None, 
"[PosixPath('/gpfs/exfel/exp/FXE/202302/p004507')]", 
12373, 
'sacct: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host\nsacct: error: fetch_config: DNS SRV lookup failed\nsacct: error: _establish_config_source: failed to fetch config\nsacct: fatal: Could not establish a configuration source\n', 
'max-exfl093.desy.de - 1', 
'No user exists for uid 33392\n', 
'  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\n  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\n100  1256  100  1256    0     0   6197      0 --:--:-- --:--:-- --:--:--  6217\n<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n</body>\n</html>\n'
  • u"[PosixPath('/gpfs/exfel/exp/FXE/202302/p004507')]" - only the relevant FXE data is mounted and available
  • 12373 - number of trains in some run
  • u'sacct: error: resolve_ctls_from_dns_srv...' - error caused by sandboxing
  • 'No user exists for uid...' - ssh broken due to lack of /etc mounts
  • '<title>Example D...' - normal internet connectivity is fine
  • 'max-exfl093.desy.de - 1' - slurm job allocated on node and still only saw one proposal directory

@RobertRosca RobertRosca marked this pull request as ready for review December 12, 2023 09:31
Copy link
Member

@JamesWrigley JamesWrigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing, could you write some docs about this in backend.md? Including how to enable/disable it.

Also, your code is outstanding :) I love all the documentation and type hints ❤️

damnit/backend/extract_data.py Show resolved Hide resolved
self.extract_procs_queue.put((proposal, run, extract_proc))

def listen():
def listen(sandbox: bool):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move the sandbox setting to the databases metameta table? That way we can set it relatively easily with amore-proto db-config and we wouldn't have to restart the listener after changing it. It would also make it simpler to move to a centralized listener in the future and have different sandbox settings for different proposals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, that was my first idea, but it depends on how 'secure' the configuration should be, since if it's in the database then users could enable/disable the sandboxing as easily as we could, unless the db is read only to the DAMNIT user, which would break a lot of things.

Keeping the setting as part of how the process is started means that the user running the listener could have sandboxing on, while others can still modify the DB/run reprocessing themselves.

I think options are:

  1. Make it part of the table with tables as read-only, enforces sandboxing for processes, but breaks user write permissions to the table.
  2. Make it part of the table, still read-write, risk of users disabling sandboxing.
  3. Some extra configuration file in the proposal directory/for the listener which is read-only to the DAMNIT user. This kind of config may end up being required as part of the move to a central listener anyway.
  4. Something set as part of the supervisor config?
  5. Leave it as a flag.

With all options ending in "for now" 😛

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say lets go with option 2 for now, and later we can move it to the centralized listener settings (which should be inaccessible by users).

damnit/backend/sandboxing.py Show resolved Hide resolved


@pytest.fixture
def bubblewrap():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, could the fixtures go into conftest.py with the others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Member Author

@RobertRosca RobertRosca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing, could you write some docs about this in backend.md? Including how to enable/disable it.

Yep, I'll add some stuff there, including notes on when it should be disabled.

Also, your code is outstanding :) I love all the documentation and type hints ❤️

Aha thanks, my IDE blinds me with warnings if I don't 😂 😂 good motivator

damnit/cli.py Outdated

elif args.subcmd == 'reprocess':
# Hide some logging from Kafka to make things more readable
logging.getLogger('kafka').setLevel(logging.WARNING)

from .backend.extract_data import reprocess
reprocess(args.run, args.proposal, args.match, args.mock)
reprocess(args.run, args.proposal, args.match, args.mock, args.no_sandbox)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

damnit/backend/extract_data.py Show resolved Hide resolved
self.extract_procs_queue.put((proposal, run, extract_proc))

def listen():
def listen(sandbox: bool):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, that was my first idea, but it depends on how 'secure' the configuration should be, since if it's in the database then users could enable/disable the sandboxing as easily as we could, unless the db is read only to the DAMNIT user, which would break a lot of things.

Keeping the setting as part of how the process is started means that the user running the listener could have sandboxing on, while others can still modify the DB/run reprocessing themselves.

I think options are:

  1. Make it part of the table with tables as read-only, enforces sandboxing for processes, but breaks user write permissions to the table.
  2. Make it part of the table, still read-write, risk of users disabling sandboxing.
  3. Some extra configuration file in the proposal directory/for the listener which is read-only to the DAMNIT user. This kind of config may end up being required as part of the move to a central listener anyway.
  4. Something set as part of the supervisor config?
  5. Leave it as a flag.

With all options ending in "for now" 😛

damnit/backend/sandboxing.py Show resolved Hide resolved


@pytest.fixture
def bubblewrap():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Comment on lines +114 to +115
if venv == "False":
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be surprising that this method is a complete no-op if the Python you point to isn't a venv. Especially if people use conda envs - they're technically not venvs, but I think people usually expect things to work in the same way.

Would it make sense to turn this into add_bind_python and try to do the right thing with the target Python, venv or not?

It could also start by adding the main env directory (i.e. the bit before bin/python) and only add extra paths if they're not under that; it should be the same, but a shorter command is easier to make sense of if we ever need to.

(Also, I think you'd actually end up with "False\n" here if the target is not a venv, so the check would fail. It might be worth using JSON to send details back, just to avoid fiddly details like this.)

@@ -106,6 +108,24 @@ def extract_in_subprocess(
for m in match:
args.extend(['--match', m])

if sandbox:
bubblewrap = Bubblewrap()
with contextlib.suppress(Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with contextlib.suppress(Exception):
with contextlib.suppress(FileNotFoundError):

Is that what we're trying to catch? Do we expect it to come up in real use, or only in testing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sandboxing the extractor
3 participants