Running the UATs behind a proxy #27

phoevos · 2023-09-22T11:47:29Z

An important part of the Kubeflow UATs effort is their integration into the bundle CI, so that we can have automated checks in place whenever something changes in the Kubeflow bundle:

Integrate the UATs into the Kubeflow bundle CI on self-hosted runners #7

Due to the sheer size of this bundle, it can only be deployed on beefy self-hosted runners. Currently, our runners are configured to run behind a proxy meant to restrict access to external networks. Unfortunately, even if the proxy allows everything, configuring the UATs to work with it is not trivial. There are 2 main concerns which we'll dive into shortly:

Ensuring that connections to internal network addresses don't go through the proxy
Ensuring that all deployed resources share the host configuration so that they can work with the proxy

Setup

The UATs are intended to run on a self-hosted VM with MicroK8s and Charmed Kubeflow deployed on top of it. As mentioned before, this VM runs behind a proxy, which is responsible for routing HTTP/S traffic to/from external networks. This is reflected in the HTTP_PROXY and HTTPS_PROXY environment variables (in our case both pointing to the same proxy), which most tools use when connecting to the internet.

Access Internal Addresses

The UATs need to access the Kubernetes API server in order to create and manage the Profile and Job responsible for executing the test suite. In order to achieve that, we use lightkube. Connections to internal resources (such as the API server) do not need to go through the proxy. For this reason, we have to instruct lightkube to ignore it, for which we have 2 options:

Add the internal cluster addresses to the NO_PROXY list
- Selenium tests fail to create a notebook in self-hosted runners bundle-kubeflow#671 (comment))
Instruct lightkube to ignore the environment variables (among which are HTTP_PROXY and HTTPS_PROXY)
- Cannot set Client trust_env=False gtsystem/lightkube#19

Although the 1st option looked promising, it's actually harder to work with, due to limitations in the way the httpx package used by lightkube interprets the list of CIDRs provided through NO_PROXY:

NO_PROXY CIDRs not handled properly encode/httpx#2828

Going with the 2nd option turned out to be more straightforward since it only entails initialising the lightkube.Client with the trust_env option set to False. This is propagated to httpx, essentially instructing it to ignore environment variables. This way, the lightkube client ignores the configured proxy and attempts to access the K8s API server directly, which succeeds since it's an internal network.

Access External Addresses

Contrary to the situation described above, accessing external resources is only possible through the proxy, which enforces the following requirements:

Every address we attempt to reach is included in the proxy allowlist
Every outbound request is configured to go through the proxy

Given that the self-hosted runners are still under active development, the first point is not expected to be an issue. More specifically, all destination addresses are allowed through the proxy at the moment. Later on, we might have to be more careful and deliberate in the resources we're accessing.

On the other hand, ensuring that all requests go through the proxy can be an arduous task. As mentioned before, resources on the VM are configured to work with the proxy through the HTTP_PROXY and HTTPS_PROXY environment variables. Things become a bit more complicated though when we start deploying workloads on the MicroK8s cluster; these workloads are expected to have internet access (e.g. for installing Python dependencies) but do not share the host configuration by default. Our goal, therefore, is to propagate this configuration (the 2 environment variables, essentially) to any workload created on the cluster.

Propagate Host Environment to Workloads

When it comes to propagating the environment variables into the created workloads, there are both fine and coarse-grained approaches, which we'll briefly explore below.

Using a ConfigMap

The best and most fine-grained approach to injecting environment variables to K8s Pods is using a ConfigMap with the desired data and explicitly specifying where these are consumed when creating each workload. An example ConfigMap could look like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: use-proxy
  namespace: default
data:
  HTTP_PROXY: http://squid.internal:3128
  HTTPS_PROXY: http://squid.internal:3128

Then, if we want a specific workload to get this configuration, we can do so by adding the following to the corresponding ContainerSpec:

envFrom:
- configMapRef:
    name: use-proxy
    optional: true

Note, however, that the ConfigMap has to be in the same namespace as the created workload.

We've already implemented this approach in this branch of the UATs for the driver Job that runs the test suite to be able to access the internet. More specifically, if a file is provided through the --env option at invocation, the driver will use it to create a ConfigMap. Then, the deployed Job attempts to consume data from the ConfigMap (given that it exists, hence the optional part) and set them as environment variables in the Pod that will be created to run the suite.

An example can be found in this branch of the CKF bundle, where we add the 2 environment variables to a params.env file and pass that to the UATs invocation.

Limitations

Although this approach works for the driver Job, we need to take into account that many of the notebook tests could be deploying workloads themselves. These can be Argo Workflow Pods when running Kubeflow Pipelines, Katib Trial Pods, or KServe/Seldon deployment Pods, all of which could potentially require internet access. There are 2 issues here:

We don't directly manage all created workloads, so we don't have access to their ContainerSpec
Even if we did, this doesn't scale; the notebook tests themselves should not have to care about whether we're using a proxy or not.

Using a MutatingAdmissionWebhook

A more global, coarse-grained approach would be through the implementation and deployment of a MutatingAdmissionWebhook that would inject the required data into any arbitrary workload deployed. An example of this can be found in the Kubeflow Admission Webhook that watches for submitted Pods and patches their environment based on the available PodDefaults. Although this could solve our issues, it is a major undertaking and is therefore unlikely to be prioritised if the Kubeflow team is the only one that ends up needing it.

The text was updated successfully, but these errors were encountered:

phoevos added the documentation Improvements or additions to documentation label Sep 22, 2023

phoevos self-assigned this Sep 22, 2023

orfeas-k mentioned this issue Nov 14, 2023

Integrate the UATs into the Kubeflow bundle CI on self-hosted runners #7

Closed

nishant-dash mentioned this issue Nov 21, 2023

Update tox file to include proxy as well #54

Open

orfeas-k mentioned this issue Nov 23, 2023

fix: Initialise lightkube with trust_env=False #56

Merged

kimwnasptd mentioned this issue Mar 29, 2024

Bump KFP Controller Python image canonical/kfp-operators#414

Merged

NohaIhab mentioned this issue Jul 1, 2024

Explore running the UATs behind proxy #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the UATs behind a proxy #27

Running the UATs behind a proxy #27

phoevos commented Sep 22, 2023

Running the UATs behind a proxy #27

Running the UATs behind a proxy #27

Comments

phoevos commented Sep 22, 2023

Setup

Access Internal Addresses

Access External Addresses

Propagate Host Environment to Workloads

Using a ConfigMap

Limitations

Using a MutatingAdmissionWebhook