Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the UATs behind a proxy #27

Open
phoevos opened this issue Sep 22, 2023 · 0 comments
Open

Running the UATs behind a proxy #27

phoevos opened this issue Sep 22, 2023 · 0 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@phoevos
Copy link
Contributor

phoevos commented Sep 22, 2023

An important part of the Kubeflow UATs effort is their integration into the bundle CI, so that we can have automated checks in place whenever something changes in the Kubeflow bundle:

Due to the sheer size of this bundle, it can only be deployed on beefy self-hosted runners. Currently, our runners are configured to run behind a proxy meant to restrict access to external networks. Unfortunately, even if the proxy allows everything, configuring the UATs to work with it is not trivial. There are 2 main concerns which we'll dive into shortly:

  • Ensuring that connections to internal network addresses don't go through the proxy
  • Ensuring that all deployed resources share the host configuration so that they can work with the proxy

Setup

The UATs are intended to run on a self-hosted VM with MicroK8s and Charmed Kubeflow deployed on top of it. As mentioned before, this VM runs behind a proxy, which is responsible for routing HTTP/S traffic to/from external networks. This is reflected in the HTTP_PROXY and HTTPS_PROXY environment variables (in our case both pointing to the same proxy), which most tools use when connecting to the internet.

Access Internal Addresses

The UATs need to access the Kubernetes API server in order to create and manage the Profile and Job responsible for executing the test suite. In order to achieve that, we use lightkube. Connections to internal resources (such as the API server) do not need to go through the proxy. For this reason, we have to instruct lightkube to ignore it, for which we have 2 options:

  1. Add the internal cluster addresses to the NO_PROXY list
  2. Instruct lightkube to ignore the environment variables (among which are HTTP_PROXY and HTTPS_PROXY)

Although the 1st option looked promising, it's actually harder to work with, due to limitations in the way the httpx package used by lightkube interprets the list of CIDRs provided through NO_PROXY:

Going with the 2nd option turned out to be more straightforward since it only entails initialising the lightkube.Client with the trust_env option set to False. This is propagated to httpx, essentially instructing it to ignore environment variables. This way, the lightkube client ignores the configured proxy and attempts to access the K8s API server directly, which succeeds since it's an internal network.

Access External Addresses

Contrary to the situation described above, accessing external resources is only possible through the proxy, which enforces the following requirements:

  • Every address we attempt to reach is included in the proxy allowlist
  • Every outbound request is configured to go through the proxy

Given that the self-hosted runners are still under active development, the first point is not expected to be an issue. More specifically, all destination addresses are allowed through the proxy at the moment. Later on, we might have to be more careful and deliberate in the resources we're accessing.

On the other hand, ensuring that all requests go through the proxy can be an arduous task. As mentioned before, resources on the VM are configured to work with the proxy through the HTTP_PROXY and HTTPS_PROXY environment variables. Things become a bit more complicated though when we start deploying workloads on the MicroK8s cluster; these workloads are expected to have internet access (e.g. for installing Python dependencies) but do not share the host configuration by default. Our goal, therefore, is to propagate this configuration (the 2 environment variables, essentially) to any workload created on the cluster.

Propagate Host Environment to Workloads

When it comes to propagating the environment variables into the created workloads, there are both fine and coarse-grained approaches, which we'll briefly explore below.

Using a ConfigMap

The best and most fine-grained approach to injecting environment variables to K8s Pods is using a ConfigMap with the desired data and explicitly specifying where these are consumed when creating each workload. An example ConfigMap could look like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: use-proxy
  namespace: default
data:
  HTTP_PROXY: http://squid.internal:3128
  HTTPS_PROXY: http://squid.internal:3128

Then, if we want a specific workload to get this configuration, we can do so by adding the following to the corresponding ContainerSpec:

envFrom:
- configMapRef:
    name: use-proxy
    optional: true

Note, however, that the ConfigMap has to be in the same namespace as the created workload.

We've already implemented this approach in this branch of the UATs for the driver Job that runs the test suite to be able to access the internet. More specifically, if a file is provided through the --env option at invocation, the driver will use it to create a ConfigMap. Then, the deployed Job attempts to consume data from the ConfigMap (given that it exists, hence the optional part) and set them as environment variables in the Pod that will be created to run the suite.

An example can be found in this branch of the CKF bundle, where we add the 2 environment variables to a params.env file and pass that to the UATs invocation.

Limitations

Although this approach works for the driver Job, we need to take into account that many of the notebook tests could be deploying workloads themselves. These can be Argo Workflow Pods when running Kubeflow Pipelines, Katib Trial Pods, or KServe/Seldon deployment Pods, all of which could potentially require internet access. There are 2 issues here:

  • We don't directly manage all created workloads, so we don't have access to their ContainerSpec
  • Even if we did, this doesn't scale; the notebook tests themselves should not have to care about whether we're using a proxy or not.

Using a MutatingAdmissionWebhook

A more global, coarse-grained approach would be through the implementation and deployment of a MutatingAdmissionWebhook that would inject the required data into any arbitrary workload deployed. An example of this can be found in the Kubeflow Admission Webhook that watches for submitted Pods and patches their environment based on the available PodDefaults. Although this could solve our issues, it is a major undertaking and is therefore unlikely to be prioritised if the Kubeflow team is the only one that ends up needing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Labeled
Development

No branches or pull requests

1 participant