Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Skip kaniko build in serverless #941

Closed
kwiatekus opened this issue Apr 29, 2024 · 11 comments
Closed

POC: Skip kaniko build in serverless #941

kwiatekus opened this issue Apr 29, 2024 · 11 comments
Assignees

Comments

@kwiatekus
Copy link
Contributor

kwiatekus commented Apr 29, 2024

Description

Verify the following ideas:

  • Serverless does not need to trigger kaniko build for Functions but instead the code and dependencies could be mounted from pv
    into serverless provided base image.
  • Do not make any assumptions based on existing function controller code.
  • Assume that Function CR remains backward compatible

Document the outcome as a issue comment and discuss with the team (derive recommendations for git functions).
If the ideas are validated, draw necessary diagrams explaining necessary changes in the function processing flow.

Reason
If the POC validates the idea we could eliminate the dependencies to kaniko build and docker registry. This could greatly speed up redeploying new versions of code and reduce necessary resources required to develop ideas as functions (no extra memory, cpu time or Volume storage for images).
Additionally the POC could help positioning serverless module as development tool that helps users to prove their ideas in the kyma runtimes as functions and later "eject" them into manifests that declare production-grade workloads

@kwiatekus kwiatekus changed the title POC: Skip function build when not required POC: How to skip function build when not required May 9, 2024
@kwiatekus
Copy link
Contributor Author

kwiatekus commented May 9, 2024

Additionally @kwiatekus estimate the gain:

  • how many functions are there that potentially dont need a build ?
  • do customers need that immediate redeployment of changed code ?

@kwiatekus kwiatekus changed the title POC: How to skip function build when not required POC: How to skip inline function build when not required May 9, 2024
@kwiatekus kwiatekus changed the title POC: How to skip inline function build when not required POC: Skip kaniko build in serverless May 13, 2024
@ptesny
Copy link

ptesny commented May 21, 2024

if I understand well the whole ideas is pretty much similar to the existing base function image override. ?
Power of serverless with SAP BTP, Kyma runtime. Base image override.

@pPrecel
Copy link
Contributor

pPrecel commented Sep 16, 2024

POC v1 results

Actual solution

The whole concept is a little bit simpler than the previous solution. Right now we have to deploy in-cluster docker-registry in the kyma-system with all needed resources (ServiceAccount, Secrets, Configmaps, Service with nodeport) to fulfill the build phase. For example, we need a service with nodeport exposure to allow Kubernetes to pull images from our docker-registry, then we need to copy a few resources to every namespace to enable building images and add a possibility to pull it later by deployments:

  • configmaps with Dockerfile definitions for every runtime
  • secrets with access to the registry for pulling and pushing images
  • service account used by functions deployments with RBACs and secrets

additionally, we have to control three phases of function lifecycle (configuration, building, running) and in the separate phase, we create separate resources belong to the functions:

  • configuration - validate function and create configmap with sources
  • building - create a building job (kaniko) and wait until it is completed
  • running - create a deployment with SA that has access to the registry (credentials from secret) and wait until it is ready

The diagram below describes the actual implementation in general:

serverless

New solution

Concept

POC proved that it's possible to skip the building phase and mount function sources directly to the functions' pods. The idea was to register central PVC, expose it to other namespaces using NFS server, and then use dedicated (for every function) PV and PVC to mount function sources to the functions pods. The diagram below describes the idea in general:

buildless-serverless

Note: All resources on the blue background are created in the kyma-system once by the KLM and all resources on the green background are created by the user (actor) once or function-controller per function and are located in the function's namespace

In the following proposition we can highlight two function lifecycle phases:

  • configuration - validate function and copy sources to the central PVC
  • running - create a deployment, run npm install on function sources, and wait until it is ready

Implementation

To prove the idea we decided to create our own, simpler, and not full-featured controller to a little bit speed up the whole process. Implementation is split into two folders:

  • components/buildless-serverless - controller code with all types
  • config/buildless-serverless - minimalistic chart with all needed resources

We decided to support happy-path, without:

  • other runtimes than nodejs20
  • deletion process
  • git-functions
  • scaling
  • custom envs
  • resource configuration
  • custom labels
  • custom annotations
  • runtimeImageOverride

To test POC run the following script in the checkouted kyma-project/serverless repo:

docker build -t buildless-serverless:alpha1 -f components/buildless-serverless/Dockerfile .
k3d cluster create
k3d image import buildless-serverless:alpha1
kubectl create ns kyma-system
helm template --namespace kyma-system config/buildless-serverless | kubectl apply -f -
sleep 5
kubectl wait --for condition=ready pod -l app=buildless-serverless -n kyma-system --timeout 3m

now we can create a function and port-forward it:

cat <<EOF | kubectl apply -f -
apiVersion: serverless.kyma-project.io/v1alpha2
kind: Function
metadata:
  name: test-function-nodejs-cjs
spec:
  runtime: nodejs20
  source:
    inline:
      dependencies: |
        {
          "name": "test-function-nodejs",
          "version": "1.0.0",
          "dependencies": {
            "lodash":"^4.17.20"
          }
        }
      source: |
        const _ = require('lodash')
        module.exports = {
          main: function(event, context) {
            return _.kebabCase('Hello World from Node.js 20 Function');
          }
        }
EOF
sleep 5
kubectl wait --for condition=ready pod -l function=test-function-nodejs-cjs --timeout 3m
kubectl port-forward svc/test-function-nodejs-cjs 8080:80

now we can call a function in a second terminal:

curl localhost:8080

result:

hello-world-from-node-js-20-function

Conclusions

  • It's possible to skip build

  • For the needs of this POC we do not implement all reconciliation operations and do not support all fields and all use cases:

    • We can keep all function features expecting these about build configuration (like build job resources, labels...)
    • We can still support git-sources functions or even runtimeImageOverride without any changes on the users' side
    • We can keep almost all serverless API featured - we have to get rid of these about build like resources, labels
  • We have to decide if we want to refactor function-controller or create a new one:

    • If we decided to keep the old function API (CRD), because it's possible, then we would not have any migration, but still, we have to think about existing functions and how to migrate them to the new solution
  • For the POC needs we used NFS server image from cpuguy83 but in my opinion, we have to create our one and maintain it:

    • Establish details bout NFS PVC and functions PV/PVC definitions details (storage size ... )
  • We can use unchanged runtime base images to run them in pods with mounted functions source code and dependencies:

    • to simplify the process we decided to run npm install (fetch and install all dependencies) at runtime in every functions pod. This solution is good enough for POC purposes but it's not production-ready
    • We should hot-reload when the source code is changed
    • We should hot-reload when dependencies are changed

@ptesny
Copy link

ptesny commented Sep 23, 2024

We decided to support happy-path, without:

other runtimes than nodejs20
deletion process
git-functions
scaling
custom envs
resource configuration
custom labels
custom annotations
runtimeImageOverride

Given you have removed all the real features that make serverless what was the purpose of this POC ?

@pPrecel
Copy link
Contributor

pPrecel commented Sep 24, 2024

We decided to support happy-path, without:
other runtimes than nodejs20
deletion process
git-functions
scaling
custom envs
resource configuration
custom labels
custom annotations
runtimeImageOverride

Given you have removed all the real features that make serverless what was the purpose of this POC ?

We have not removed them, we do not support them in the POC implementation. That does not mean that this POC expects to get rid of these features. We decided to prove that it's possible to skip the build phase and we need PVC to keep all features. It's described in the Conclusions in the last section:

- We can keep all function features expecting these about build configuration (like build job resources, labels...)

@pPrecel
Copy link
Contributor

pPrecel commented Sep 24, 2024

POC v2 results

Introduction

After proposing the POC v1 we got tons of feedback and together with architects decided to simplify the whole concept. General feedback:

  • We should not implement a NFS server ( we should think about a solution without any PVC )
  • We should get rid of custom dependencies in Functions CR
  • We should offer a few variants of our base runtimes with built-in dependencies focused on more specific use-cases. For example:
    • nodejs20 - simple and small nodejs
    • nodejs20-hanadb - nodejs with hanadb client
    • python13 - simple and small python13
    • python13-hanadb - python with hanadb client
    • ...

Result

Now new serverless architecture is much, much simpler in comparison to the actual implementation and POC v1:

buildless-serverless-v2

No NFS server

The first huge change is the lack of the NFS server. This change allows us to run functions much faster and immediately react to changes in code, but the disadvantage is that there is no possibility to support user dependencies and download them in runtime (possibility always exists but in this case, without NFS server or PVC for every function, it's not possible to synchronize dependencies between functions pods and keep them for next function versions).

No PVC

The second change is that we decided to get rid of storing functions code and dependencies in one, dedicated storage shared between function pods and it's versions. This is related to NFS server removal.

No dependencies

The first and the second change forced us that we should not allow users to pass custom dependencies in the functions spec. Because in POCv2 without an NFS server and common PVC, there is no possibility to update functions code without changing pod's definition, for example when a user updates the code but dependencies are still the same.

More runtimes!

In the final shape of the POC, we can't support user dependencies, so we have to prepare more runtimes with built-in dependencies. For example, for it can look like this:

  • nodejs20 - simple and small nodejs
  • nodejs20-hanadb - nodejs with hanadb client
  • ...
  • nodejs20-all - nodejs containing all dependencies from other runtimes
  • python13 - simple and small python13
  • python13-hanadb - python with hanadb client
  • ...
  • python13-all - python containing all dependencies from other runtimes

We can still support fields like runtimeImageOverride that allow users to pass custom images with more specific dependencies. Example of a really simple Dockerfile that wraps base runtime image with custom dependencies:

# import base nodejs20 image with all features
FROM europe-docker.pkg.dev/kyma-project/prod/function-runtime-nodejs20:2.0.0

# add custom dependencies
RUN npm install lodash

Hot Deploy

In this case, we can offer really simple and usable hot-reload functionality. If we offer functions as simple code runners, that can be ready to go almost immediately after applying it on the cluster, then we can pass code to the pod as an argument for the command field, so if user changes code this causes changes in the functions pod definition and Kubernetes immediately start another pod that will be ready to go after a few seconds and remove old one:

apiVersion: v1
kind: Pod
metadata:
  name: test-function
  namespace: default
spec:
  containers:
    - command: 
      - "bash"
      - "-c"
      - "echo <FUNCTION_CODE> > handler.js; npm run"
...

NOTE: This is only an example. We can pass the code in env or think about another solution.

We also know that we can keep code in configmap and mount it to every function pod, but this solution has huge disadvantages. The time needed for Kubernetes to update mounted files in pods collapses from 5sec to 30sec. We aim for a solution that allows us to run functions in less than 5sec, so we can't see configmaps.

Function v2

The POC v1 aims to be fully compatible with the actual solution so as a result we've got a solution proposition that does not require any changes in the Functions API (CRD). The POC v2 changes too many things and concepts as a result we should prepare totally new Functions CR version, for example, the v2alpa1 that will contain only necessary fields.

Unfortunately, migration from version v1alpha2 to v2alpa1 will be not possible so we should deprecate old functions and help users with migration from old to new solution, but we can't propose any automatic mechanism.

Conclusions

  • There will be no NFS server and no PVC for function
    • We can't support user dependencies
    • We should prepare a few variants of our runtimes
    • We can support users by adding their own variants
  • We can perform hot-deploy functionality based on behavior performed by every pod
  • We can't support the old Function v1alpha2 spec
    • We should propose a new v2alpha1 version
    • We can't migrate the v1 to the v2 but we can help users with migration (create a tutorial or something)

@pPrecel
Copy link
Contributor

pPrecel commented Sep 24, 2024

Function v2alpha1

In this POC we mentioned that we should prototype and implement another version of our Functions. I would propose the very first shape of such an API. I prepared it based on the existing solution, but without fields that we can't support (like things about build configuration like spec.resourceConfiguration.build or old-fashioned thingies like spec.scaleConfig). The new Function may look like:

apiVersion: serverless.kyma-project.io/v2alpha2
kind: Function
metadata:
  name: function
spec:
  replicas: 1
  runtime: nodejs20
  runtimeImageOverride: pprecel/nodejs20-custom:0.1
  source:
    inline: |- # or code:
      function main() {...
    # or
    # gitRepository: ...
  env:
    - name: MY_ENV
      value: MY_VALUE
  resourceConfiguration:
    profile: XL
    # or
    # requests...
    # limits...
  secretMounts:
  - secretName: my-secret
    mountPath: /tmp/secret
  labels:
    test-label: test-label-value
  annotations:
    test-annotation: test-annotation-value

New Horizon

These changes are the very end of certain features but the beginning of a new, better buildless-serverless life, and a new horizon.

We can support users who have prepared custom runtime bases with additional functionalities or dependencies and want to pull these runtime bases from private registries. Previously we had to connect our ServiceAccounts to every functions pod because these ServiceAccount allows a pod to pull images from a private in-cluster registry where we store such images. Right now we can allow users to use their own ServiceAccount for every function to allow users to pull images from private registries and for example to configure RBACs for every function.

For example:

A user may want to deploy a function that has to update configmap or any other resource after getting a request. Then user can define a custom ServiceAccount with a Role and RoleBinding that allows to do such operations on the cluster.

When a user wants to deploy a function with a custom base runtime image, and such image is located inside the private registry, then ServiceAccount with imagePullSecrets configured:

imagePullSecrets:
- name: private-registry-configuration

New Challenges

Right now we keep a kind of hot-potato in our serverless manager, called libgit2 which is really hard to maintain from our perspective. It gives us strong possibilities because we can easily handle connection with every kind of git provider in a new version. The problem is that libgit2 faces memory leak issues that are described here and I don't see the possibility of fixing these problems, because git2go (go implementation for libgit2) is unmaintained (more info).

If we want to propose and still support git-sourced functions I would suggest thinking about alternatives. The huge problem is the best (from my perspective) alternative is go-git but as far as I know the input data needed for connections looks a little bit different and we should prototype and refactor our actual API under the spec.source.gitRepository field.

New Possibilities

From our perspective, the actual Serverless is the developer-only tool. It means that is does not meet production-grade requirements like SAP-SLC29. The good news is in our new architecture we may re-investigate this topic and maybe we can meet such requirenments with git-sourced functions.

This is only a quick thought and as I said it requires re-investigation in this area, but hope is our fuel.

@kwiatekus
Copy link
Contributor Author

kwiatekus commented Sep 26, 2024

Having consulted @pbochynski and @varbanv, we decided that possibility for users to define custom dependencies should be preserved.
Only thing that should be dropped is the kaniko build itself.

As a next step, please implement a prototype (v3) with the following assumption:

  • function code is injected into the runtime pod w/o a need for build (consider injecting handler code during container start or as mounted config map )
  • function dependencies are resolved at the start time of the function pod (additional npm install)
- command: 
      - "bash"
      - "-c"
      - "echo <FUNCTION_CODE> > handler.js; npm install; npm run"

Please prove that the same assumptions are applicable for python functions as well.
Please measure what is the extra time added to the overal function readiness time ( because of the dependency resolving at runtime ). For this experiment use the usual function runtime presets (S,M,L,XL) to see how the extra time depends on the function runtime pod resources.

The possibility of skipping runtime dependency resolution in favor of using SAP provided runtime images (with a set of commonly used libraries) will be considered in the next steps (as a path towards enabling SLC-29 compliance)

@pPrecel
Copy link
Contributor

pPrecel commented Oct 1, 2024

POC v3 result

Introduction

As @kwiatekus mentioned, the goal of the next POC iteration is to prove that it's possible to implement a new serverless without the build phase, without the NFS server but still compatible with old functionality. The plan is to support flow described in the POC v2 but without getting rid of support of Functions dependencies (in the Function CR spec).

Implementation

The POC implementation lands in this PR and can be run using these spells:

docker buildx build --platform linux/amd64 -t pprecel/buildless-serverless:alpha1 -f components/buildless-serverless/Dockerfile .
k3d cluster create
k3d image import pprecel/buildless-serverless:alpha1
kubectl create ns kyma-system
helm template --namespace kyma-system config/buildless-serverless | kubectl apply -f -
sleep 5
kubectl wait --for condition=ready pod -l app=buildless-serverless -n kyma-system --timeout 3m

Hot-Reload

In my implementation I decided to keep sources (code and deps) inside the Pod definition to perform hot-reload functionality for all functions pods at the same time. For my example function with dependencies, the pods definition looks like this:

apiVersion: v1
kind: Pod
metadata:
  name: nodejs-deps-5b76877c74-jjkz6
  namespace: default
spec:
  containers:
  - command:
    - sh
    - -c
    - |2

      printf "${FUNC_HANDLER_SOURCE}" > handler.js;
      printf "${FUNC_HANDLER_DEPENDENCIES}" > package.json;
      npm install --prefer-offline --no-audit --progress=false;
      cd ..;
      npm start;
    env:
...
    - name: FUNC_HANDLER_SOURCE
      value: |
        const _ = require('lodash');
        var hana = require('@sap/hana-client');

        module.exports = {
          main: function(event, context) {
            return _.kebabCase('Hello World from Node.js 20 Function');
          }
        };
    - name: FUNC_HANDLER_DEPENDENCIES
      value: |
        {
          "name": "nodejs-deps",
          "version": "1.0.0",
          "dependencies": {
            "lodash":"^4.17.20",
            "@sap/xsenv":"5.3.0",
            "@sap/hana-client":"2.22.27"
          }
        }
    image: europe-docker.pkg.dev/kyma-project/prod/function-runtime-nodejs20:main
    imagePullPolicy: IfNotPresent
    name: function-container
    securityContext:
      capabilities:
        drop:
        - ALL
      privileged: false
      procMount: Default
      readOnlyRootFilesystem: true
    startupProbe:
      failureThreshold: 300
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      periodSeconds: 1
      successThreshold: 1
      timeoutSeconds: 1
...
    volumeMounts:
    - mountPath: /usr/src/app/function
      name: sources
    - mountPath: /.local
      name: local
    - mountPath: /tmp
      name: tmp-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-m8bkd
      readOnly: true
    workingDir: /usr/src/app/function
...
  securityContext:
    runAsGroup: 1001
    runAsUser: 1001
    seccompProfile:
      type: RuntimeDefault
...
  volumes:
  - emptyDir: {}
    name: sources
  - emptyDir: {}
    name: local
  - emptyDir:
      sizeLimit: 100M
    name: tmp-dir
...

As we can see in this example source code and dependencies are stored in the FUNC_HANDLER_SOURCE and FUNC_HANDLER_DEPENDENCIES envs and every time the user changes the function source, the controller updates the deployment definition which will lead to running new pods and terminating old ones. If we would lead to a situation when the user's code can be run in a few seconds, for example, 3s, then we can say, that the user has to wait 3 seconds to hot-reload his code.

Features

The POC v3 is almost the same as v2, but we decided to check how resource and time-consuming will be adding dependencies resolving s functions pod startup (in runtime). The second important difference is support for python3.12 functions to check if there will be any differences in behavior or further problems.

The second change is support for resource specification

Measurements

This section describes how I tested POC v3 implementation on BTP cluster

Functions

I've decided to deploy four small functions to check how fast these will be fully operational (running):

  • nodejs20

    apiVersion: serverless.kyma-project.io/v1alpha2
    kind: Function
    metadata:
      name: nodejs
    spec:
      runtime: nodejs20
      source:
        inline:
          source: |
            module.exports = {
              main: function(event, context) {
                return 'Hello Serverless!';
              }
            }
  • nodejs20 with a few dependencies

    apiVersion: serverless.kyma-project.io/v1alpha2
    kind: Function
    metadata:
      name: nodejs-deps
    spec:
      runtime: nodejs20
      source:
        inline:
          dependencies: |
            {
              "name": "nodejs-deps",
              "version": "1.0.0",
              "dependencies": {
                "lodash":"^4.17.20",
                "@sap/xsenv":"5.3.0",
                "@sap/hana-client":"2.22.27"
              }
            }
          source: |
            const _ = require('lodash');
            var hana = require('@sap/hana-client');
    
            module.exports = {
              main: function(event, context) {
                return _.kebabCase('Hello World from Node.js 20 Function');
              }
            };
  • python3.12

    apiVersion: serverless.kyma-project.io/v1alpha2
    kind: Function
    metadata:
      name: python
    spec:
      runtime: python312
      source:
        inline:
          source: |
            def main(event, context):
                return "Hello Serverless!"
  • python3.12 with a few dependencies

    apiVersion: serverless.kyma-project.io/v1alpha2
    kind: Function
    metadata:
      name: python-deps
    spec:
      runtime: python312
      source:
        inline:
          dependencies: |
            requests==2.31.0
            camelcase==0.2
          source: |
            import requests
            import json
            from camelcase import CamelCase
            def main(event, context):
                r = requests.get('https://swapi.dev/api/people/13')
                c = CamelCase()
                return c.hump("response is:") + r.text

All functions were deployed at once with modified resource configuration in a few variants described below:

XS:
  requestCpu: "50m"
  requestMemory: "64Mi"
  limitCpu: "100m"
  limitMemory: "128Mi"
S:
  requestCpu: "100m"
  requestMemory: "128Mi"
  limitCpu: "200m"
  limitMemory: "256Mi"
M:
  requestCpu: "200m"
  requestMemory: "256Mi"
  limitCpu: "400m"
  limitMemory: "512Mi"
L:
  requestCpu: "400m"
  requestMemory: "512Mi"
  limitCpu: "800m"
  limitMemory: "1024Mi"
XL:
  requestCpu: "800m"
  requestMemory: "1024Mi"
  limitCpu: "1600m"
  limitMemory: "2048Mi"

Result

The measurement process was really simple and based on the startupProbe looking like this:

startupProbe:
      failureThreshold: 300
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      periodSeconds: 1
      successThreshold: 1
      timeoutSeconds: 1

This allows using the kubectl to check after what period of time the functions pod will be ready with the accuracy of 1s. I know that Kubernetes needs some time to prepare resources for the container, pull images, and so on, so I decided to treat diff times between the Running 0/1 and Running 1/1 as the time needed for the function to be operational. The result of the test is described below:

XS S M L XL No Resources
NodeJS 20 ~11s ~5s ~3s ~2s ~1s ~2s
NodeJS 20 with deps ~70s ~29s ~13s ~7s ~6s ~5s
Python 3.12 ~10s ~6s ~3s ~2s ~1s ~1s
Python 3.12 with deps ~53s ~33s ~12s ~7s ~5s ~5s

As we can see running package managers (npm, pip) requires more time to fully run the function's code. But in my opinion, this time is still satisfactory during we are talking about seconds, not minutes. Of course dependencies I chose for the test are relatively light and probably running functions with the AI toolset will require much more memory, CPU, and time but this is obvious and in my opinion, this is not the use case for Kyma Functions.

Conclusions

  • We can merge POC v1 and v2 into v3 to collect all pros and get rid of all disadvantages
    • skip build
    • No NFS server
    • Support for old functions (or easy migration to new function spec)
    • We can't support spec fields that are about build phase configuration
    • We can use the same runtime bases as we use right now
    • We can perform hot-deploy functionality based on Kubernetes functionalities
  • It is possible to support Python functions without major changes
    • We have to add a new emptyDir to give pip the possibility to download dependencies to the .local dir
    • We have to add a new emptyDir for function sources
    • We have to add a new emptyDir to allow functions to work with the /tmp dir
  • It is possible to resolve user dependencies in runtime
    • It costs in time but not that much
    • The cost can be minimalized by adding more resources to the function
    • It is still faster than building function images and pushing them to the registry

@ptesny
Copy link

ptesny commented Oct 1, 2024

why k3d and not kyma as a testbed?

@pPrecel
Copy link
Contributor

pPrecel commented Oct 1, 2024

why k3d and not kyma as a testbed?

I only showed how to run POC on k3d, because it doesn't require any additional action. Personally, I used BTP to run and test POC implementation. To do this, switch to sources from this PR, export KUBECONFIG env with the path to your cluster kubeconfig file, and run:

helm template --namespace default config/buildless-serverless | kubectl apply -f -
sleep 5
kubectl wait --for condition=ready pod -l app=buildless-serverless -n default --timeout 3m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants