-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use controller-runtime to reconsturct spark operator #2072
Conversation
/assign @yuchaoran2011 |
Thanks @ChenYi015 for this major refactoring! One of the referenced issues was created by a former colleague of mine 5 years ago. Glad to see finally someone take a stab at it. Given the sheer size of the changes, more eyes are needed for the review. We can use help from fellow community members. Tagging some of them here @bnetzi @peter-mcclonski @jacobsalway but anyone is welcome to comment |
Happy to take a look but is there a natural way we could split this PR to make it easier to review? |
21cbf65
to
dc8cd4f
Compare
dc8cd4f
to
04e18a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the static code review, given the major refactoring, we need to ensure all tests pass. I am not certain about our current test coverage, but we should create a new image and test it locally if reviewers have the bandwidth. This would ensure it works with common examples without errors.
We could create a pre-release tagged as 2.0
to include other improvements as well. This will give users time to migrate to the newer version since it introduces breaking changes.
charts/spark-operator-chart/crds/sparkoperator.k8s.io_scheduledsparkapplications.yaml
Outdated
Show resolved
Hide resolved
charts/spark-operator-chart/crds/sparkoperator.k8s.io_scheduledsparkapplications.yaml
Outdated
Show resolved
Hide resolved
Not directly connected to this PR but it's probably a good idea to get some end to end tests running in CI since they're currently disabled |
Let me explain what I had did. First, I use kubebuilder to scaffold the project. mkdir spark-operator
cd spark-operator
kubebuilder init --domain sparkoperator.k8s.io --repo github.com/kubeflow/spark-operator Then, create new API version as follows: $ # Create v1beta1 version (create resource only)
$ kubebuilder create api --version v1beta1 --kind SparkApplication
INFO Create Resource [y/n] y
INFO Create Controller [y/n] n
$ kubebuilder create api --version v1beta1 --kind ScheduledSparkApplication
INFO Create Resource [y/n] y
INFO Create Controller [y/n] n
$ # Create v1beta2 version (create both resource and controller)
$ kubebuilder create api --version v1beta2 --kind SparkApplication
INFO Create Resource [y/n] y
INFO Create Controller [y/n] y
$ kubebuilder create api --version v1beta2 --kind ScheduledSparkApplication
INFO Create Resource [y/n] y
INFO Create Controller [y/n] y The structure of the project is like this: $ tree -L 1
.
├── Dockerfile
├── Makefile
├── PROJECT # Kubebuilder use PROJECT file to hold project matadata and scaffold the project.
├── README.md
├── api
├── bin
├── cmd
├── config
├── go.mod
├── go.sum
├── hack
├── internal
└── test Then I move the orignal API definitions from |
We will define controllers for
For every controller, we need to define event filter (predicate), event handler and reconciler, for detailed information, refer Architecture Concept Diagram. We will define webhooks for internal/webhook
├── doc.go
├── scheduledsparkapplication_defaulter.go
├── scheduledsparkapplication_validator.go
├── sparkapplication_defaulter.go
├── sparkapplication_validator.go
├── sparkpod_defaulter.go
├── sparkpod_defaulter_test.go
├── suite_test.go
└── webhook.go
|
I had did some tests as follows (go, docker, make are needed): # Build operator image, create a kind cluster and load image to the kind cluster
make docker-build kind-create-cluster kind-load-image Install the chart: helm install spark-operator charts/spark-operator-chart \
--namespace spark-operator \
--create-namespace Submit an example spark-pi sparkapplication and wait it to be completed: kubectl apply -f examples/spark-pi.yaml Check out controller logs:
Check out webhook logs:
Finally, we can use |
@vara-bonthu Now we can use $ make unit-test
Running unit tests...
github.com/kubeflow/spark-operator/api/v1beta1 coverage: 0.0% of statements
? github.com/kubeflow/spark-operator/hack/api-docs/template [no test files]
github.com/kubeflow/spark-operator/cmd/operator coverage: 0.0% of statements
github.com/kubeflow/spark-operator/cmd coverage: 0.0% of statements
github.com/kubeflow/spark-operator/cmd/operator/controller coverage: 0.0% of statements
github.com/kubeflow/spark-operator/internal/batchscheduler coverage: 0.0% of statements
github.com/kubeflow/spark-operator/cmd/operator/webhook coverage: 0.0% of statements
ok github.com/kubeflow/spark-operator/api/v1beta2 0.515s coverage: 5.0% of statements
? github.com/kubeflow/spark-operator/internal/controller [no test files]
github.com/kubeflow/spark-operator/internal/controller/mutatingwebhookconfiguration coverage: 0.0% of statements
github.com/kubeflow/spark-operator/internal/controller/validatingwebhookconfiguration coverage: 0.0% of statements
ok github.com/kubeflow/spark-operator/internal/batchscheduler/volcano 0.227s coverage: 12.7% of statements
github.com/kubeflow/spark-operator/pkg/client/clientset/versioned/typed/sparkoperator.k8s.io/v1beta2/fake coverage: 0.0% of statements
github.com/kubeflow/spark-operator/pkg/client/clientset/versioned coverage: 0.0% of statements
github.com/kubeflow/spark-operator/pkg/client/clientset/versioned/typed/sparkoperator.k8s.io/v1beta1/fake coverage: 0.0% of statements
github.com/kubeflow/spark-operator/pkg/client/clientset/versioned/typed/sparkoperator.k8s.io/v1beta2 coverage: 0.0% of statements
github.com/kubeflow/spark-operator/pkg/client/clientset/versioned/typed/sparkoperator.k8s.io/v1beta1 coverage: 0.0% of statements
github.com/kubeflow/spark-operator/pkg/client/clientset/versioned/fake coverage: 0.0% of statements
github.com/kubeflow/spark-operator/pkg/client/clientset/versioned/scheme coverage: 0.0% of statements
? github.com/kubeflow/spark-operator/pkg/client/informers/externalversions/internalinterfaces [no test files]
github.com/kubeflow/spark-operator/pkg/client/informers/externalversions coverage: 0.0% of statements
? github.com/kubeflow/spark-operator/pkg/common [no test files]
github.com/kubeflow/spark-operator/pkg/client/informers/externalversions/sparkoperator.k8s.io/v1beta2 coverage: 0.0% of statements
github.com/kubeflow/spark-operator/pkg/client/informers/externalversions/sparkoperator.k8s.io/v1beta1 coverage: 0.0% of statements
github.com/kubeflow/spark-operator/pkg/client/listers/sparkoperator.k8s.io/v1beta1 coverage: 0.0% of statements
github.com/kubeflow/spark-operator/pkg/client/informers/externalversions/sparkoperator.k8s.io coverage: 0.0% of statements
github.com/kubeflow/spark-operator/pkg/client/listers/sparkoperator.k8s.io/v1beta2 coverage: 0.0% of statements
github.com/kubeflow/spark-operator/sparkctl coverage: 0.0% of statements
ok github.com/kubeflow/spark-operator/internal/controller/scheduledsparkapplication 7.001s coverage: 3.7% of statements
ok github.com/kubeflow/spark-operator/internal/controller/sparkapplication 6.935s coverage: 5.1% of statements
ok github.com/kubeflow/spark-operator/internal/webhook 0.631s coverage: 63.2% of statements
ok github.com/kubeflow/spark-operator/internal/webhook/resourceusage 0.370s coverage: 1.0% of statements
ok github.com/kubeflow/spark-operator/pkg/cert 6.205s coverage: 12.9% of statements
ok github.com/kubeflow/spark-operator/pkg/util 0.741s coverage: 18.9% of statements
ok github.com/kubeflow/spark-operator/sparkctl/cmd 0.416s coverage: 5.2% of statements To test the operator and helm chart with kind cluster, just do as follows:
kind binary will be downloaded automatically to |
/hold for review |
I second @vara-bonthu's suggestion here. Given that this PR has completely reworked the implementation, it makes sense to mark it as the start of v2.0. And we can create a new branch for it and merge it there |
@yuchaoran2011 Maybe we can make a new branch named |
@yuchaoran2011 I had did some benchmarks to compare start latency (created state -> submitted state) between helm chart v1.4.5 and this one. Install the operator with 30 worker threads and webhook enabled, using 20 cpu cores and 16 Gi memory. Then submit 1000 spark-pi application concurrently. For chart v1.4.5, the average start latency is 3m53s, and the controller-runtime version is 3m35s, just a little bit faster. Maybe we can achieve better performance by tuning controller-runtime related parameters. |
6aa3fae
to
58c29f7
Compare
Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: Yi Chen <[email protected]>
58c29f7
to
1d4d4bf
Compare
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that #2089 is merged, this one is good to merge as well. Excited to see we officially kick off 2.0!
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yuchaoran2011 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Use controller-runtime to reconstruct spark operator Signed-off-by: Yi Chen <[email protected]> * Update helm charts Signed-off-by: Yi Chen <[email protected]> * Update examples Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 0dc641b)
* Update helm docs (#2081) Signed-off-by: Carlos Sánchez Páez <[email protected]> (cherry picked from commit eca3fc8) * Update the process to build api-docs, generate CRD manifests and code (#2046) * Update .gitignore Signed-off-by: Yi Chen <[email protected]> * Update .dockerignore Signed-off-by: Yi Chen <[email protected]> * Update Makefile Signed-off-by: Yi Chen <[email protected]> * Update the process to generate api docs Signed-off-by: Yi Chen <[email protected]> * Update the workflow to generate api docs Signed-off-by: Yi Chen <[email protected]> * Use controller-gen to generate CRD and deep copy related methods Signed-off-by: Yi Chen <[email protected]> * Update helm chart CRDs Signed-off-by: Yi Chen <[email protected]> * Update workflow for building spark operator Signed-off-by: Yi Chen <[email protected]> * Update README.md Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 779ea3d) * Add topologySpreadConstraints (#2091) * Update README and documentation (#2047) * Update docs Signed-off-by: Yi Chen <[email protected]> * Remove docs and update README Signed-off-by: Yi Chen <[email protected]> * Add link to monthly community meeting Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> Signed-off-by: jbhalodia-slack <[email protected]> * Add PodDisruptionBudget to chart (#2078) * Add PodDisruptionBudget to chart Signed-off-by: Carlos Sánchez Páez <[email protected]> Signed-off-by: Carlos Sánchez Páez <[email protected]> Signed-off-by: Carlos Sánchez Páez <[email protected]> * PR comments Signed-off-by: Carlos Sánchez Páez <[email protected]> --------- Signed-off-by: Carlos Sánchez Páez <[email protected]> Signed-off-by: Carlos Sánchez Páez <[email protected]> Signed-off-by: jbhalodia-slack <[email protected]> * Set topologySpreadConstraints Signed-off-by: jbhalodia-slack <[email protected]> * Update README and increase patch version Signed-off-by: jbhalodia-slack <[email protected]> * Revert replicaCount change Signed-off-by: jbhalodia-slack <[email protected]> * Update README after master merger Signed-off-by: jbhalodia-slack <[email protected]> * Update README Signed-off-by: jbhalodia-slack <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> Signed-off-by: jbhalodia-slack <[email protected]> Signed-off-by: Carlos Sánchez Páez <[email protected]> Signed-off-by: Carlos Sánchez Páez <[email protected]> Co-authored-by: Yi Chen <[email protected]> Co-authored-by: Carlos Sánchez Páez <[email protected]> (cherry picked from commit 4108f54) * Use controller-runtime to reconsturct spark operator (#2072) * Use controller-runtime to reconstruct spark operator Signed-off-by: Yi Chen <[email protected]> * Update helm charts Signed-off-by: Yi Chen <[email protected]> * Update examples Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 0dc641b) --------- Co-authored-by: Carlos Sánchez Páez <[email protected]> Co-authored-by: jbhalodia-slack <[email protected]>
* Use controller-runtime to reconstruct spark operator Signed-off-by: Yi Chen <[email protected]> * Update helm charts Signed-off-by: Yi Chen <[email protected]> * Update examples Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]>
* Use controller-runtime to reconstruct spark operator Signed-off-by: Yi Chen <[email protected]> * Update helm charts Signed-off-by: Yi Chen <[email protected]> * Update examples Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]>
@ChenYi015 Hello! |
@Viktor3434 For now, livenessProbe and readinessProbe are enabled by default and cannot be configured via helm values. I am wondering why do you want it to be configurable? |
@ChenYi015 Thanks, can you tell me if probes have been added for the SparkApplication resource? |
@Viktor3434 do you mean probes on the Spark driver and executor pods? If so, can you go into more detail on the reason for wanting to? |
@jacobsalway Hi! sorry for the long wait for a reply We have a SparkApplication that reads data from kafka and writes it to greenplum. |
@Viktor3434 no problem. Spark itself doesn't have a healthcheck port that you could configure a probe for. I would suggest you scrape metrics from the driver and set up separate alerting and monitoring based on streaming progress, Kafka consumer group lag or other metrics that could give you a proxy on whether the job is progressing or stuck. If you'd like, you can start a thread in the community Slack and tag me so we can discuss in more detail. |
…ubeflow#2108) * Update helm docs (kubeflow#2081) Signed-off-by: Carlos Sánchez Páez <[email protected]> (cherry picked from commit eca3fc8) * Update the process to build api-docs, generate CRD manifests and code (kubeflow#2046) * Update .gitignore Signed-off-by: Yi Chen <[email protected]> * Update .dockerignore Signed-off-by: Yi Chen <[email protected]> * Update Makefile Signed-off-by: Yi Chen <[email protected]> * Update the process to generate api docs Signed-off-by: Yi Chen <[email protected]> * Update the workflow to generate api docs Signed-off-by: Yi Chen <[email protected]> * Use controller-gen to generate CRD and deep copy related methods Signed-off-by: Yi Chen <[email protected]> * Update helm chart CRDs Signed-off-by: Yi Chen <[email protected]> * Update workflow for building spark operator Signed-off-by: Yi Chen <[email protected]> * Update README.md Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 779ea3d) * Add topologySpreadConstraints (kubeflow#2091) * Update README and documentation (kubeflow#2047) * Update docs Signed-off-by: Yi Chen <[email protected]> * Remove docs and update README Signed-off-by: Yi Chen <[email protected]> * Add link to monthly community meeting Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> Signed-off-by: jbhalodia-slack <[email protected]> * Add PodDisruptionBudget to chart (kubeflow#2078) * Add PodDisruptionBudget to chart Signed-off-by: Carlos Sánchez Páez <[email protected]> Signed-off-by: Carlos Sánchez Páez <[email protected]> Signed-off-by: Carlos Sánchez Páez <[email protected]> * PR comments Signed-off-by: Carlos Sánchez Páez <[email protected]> --------- Signed-off-by: Carlos Sánchez Páez <[email protected]> Signed-off-by: Carlos Sánchez Páez <[email protected]> Signed-off-by: jbhalodia-slack <[email protected]> * Set topologySpreadConstraints Signed-off-by: jbhalodia-slack <[email protected]> * Update README and increase patch version Signed-off-by: jbhalodia-slack <[email protected]> * Revert replicaCount change Signed-off-by: jbhalodia-slack <[email protected]> * Update README after master merger Signed-off-by: jbhalodia-slack <[email protected]> * Update README Signed-off-by: jbhalodia-slack <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> Signed-off-by: jbhalodia-slack <[email protected]> Signed-off-by: Carlos Sánchez Páez <[email protected]> Signed-off-by: Carlos Sánchez Páez <[email protected]> Co-authored-by: Yi Chen <[email protected]> Co-authored-by: Carlos Sánchez Páez <[email protected]> (cherry picked from commit 4108f54) * Use controller-runtime to reconsturct spark operator (kubeflow#2072) * Use controller-runtime to reconstruct spark operator Signed-off-by: Yi Chen <[email protected]> * Update helm charts Signed-off-by: Yi Chen <[email protected]> * Update examples Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 0dc641b) --------- Co-authored-by: Carlos Sánchez Páez <[email protected]> Co-authored-by: jbhalodia-slack <[email protected]>
Purpose of this PR
Use controller-runtime to reconsturct spark operator.
Close #547
Close #507
Close #2052
Proposed changes
Spark Operator Executable
Split the operator into two separate components i.e. controller and webhook.
Use cobra to implement
spark-operator
command, and it has two sub commandscontroller
andwebhook
, one can start controller/webhook byspark-operator controller start
andspark-operator webhook start
repectively.SparkApplication/ScheduledSparkController controller only watches resources in the spark job namespaces.
Add finalizer for SparkApplication to ensure all sub resources e.g. pods, services, ingress related to that app will be deleted when deleting the SparkApplication.
Helm Chart
imagePullSecrets
toimage.pullSecrets
.controller
e.g.controller.replicaCount
andcontroller.resyncInterval
webhook
e.g.webhook.replicaCount
and `webhook.failurePolicy.promethues
e.g.promethues.metrics
andpromethues.podMonitor
.timeoutSeconds
from30
to10
.failurePolicy
fromIgnore
toFail
.[]
to["default]
, thus the SparkApplication underexamples
directory can be running directly without creating rbac resources manually.controller.serviceAccount
,webhook.serviceAccount
andspark.serviceAccount
respectively.controller.rbac
,webhook.rbac
andspark.rbac
respectively.info
,debug
anderror
.Change Category
Indicate the type of change by marking the applicable boxes:
Rationale
Checklist
Before submitting your PR, please review the following:
Additional Notes