Skip to content

Commit

Permalink
Update README and documentation (#2047)
Browse files Browse the repository at this point in the history
* Update docs

Signed-off-by: Yi Chen <[email protected]>

* Remove docs and update README

Signed-off-by: Yi Chen <[email protected]>

* Add link to monthly community meeting

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
  • Loading branch information
ChenYi015 authored Jun 27, 2024
1 parent 4bb4b5c commit 16cd35a
Show file tree
Hide file tree
Showing 10 changed files with 71 additions and 1,690 deletions.
70 changes: 36 additions & 34 deletions docs/who-is-using.md → ADOPTERS.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,50 @@
## Who Is Using the Kubernetes Operator for Apache Spark?
# Adopters of Kubeflow Spark Operator

Below are the adopters of project Spark Operator. If you are using Spark Operator please add yourself into the following list by a pull request. Please keep the list in alphabetical order.

| Organization | Contact (GitHub User Name) | Environment | Description of Use |
| ------------- | ------------- | ------------- | ------------- |
| [Caicloud](https://intl.caicloud.io/) | @gaocegege | Production | Cloud-Native AI Platform |
| Microsoft (MileIQ) | @dharmeshkakadia | Production | AI & Analytics |
| Lightbend | @yuchaoran2011 | Production | Data Infrastructure & Operations |
| StackTome | @emiliauskas-fuzzy | Production | Data pipelines |
| Salesforce | @khogeland | Production | Data transformation |
| [Beeline](https://beeline.ru) | @spestua | Evaluation | ML & Data Infrastructure |
| Bringg | @EladDolev | Production | ML & Analytics Data Platform |
| [Siigo](https://www.siigo.com) | @Juandavi1 | Production | Data Migrations & Analytics Data Platform |
| [Caicloud](https://intl.caicloud.io/) | @gaocegege | Production | Cloud-Native AI Platform |
| Carrefour | @AliGouta | Production | Data Platform |
| CERN|@mrow4a| Evaluation | Data Mining & Analytics |
| Lyft |@kumare3| Evaluation | ML & Data Infrastructure |
| MapR Technologies |@sarjeet2013| Evaluation | ML/AI & Analytics Data Platform |
| Uber| @chenqin| Evaluation| Spark / ML |
| HashmapInc| @prem0132 | Evaluation | Analytics Data Platform |
| Tencent | @runzhliu | Evaluation | ML Analytics Platform |
| Exacaster | @minutis | Evaluation | Data pipelines |
| Riskified | @henbh | Evaluation | Analytics Data Platform |
| [CloudPhysics](https://www.cloudphysics.com) | @jkleckner | Production | ML/AI & Analytics |
| CloudZone | @iftachsc | Evaluation | Big Data Analytics Consultancy |
| Cyren | @avnerl | Evaluation | Data pipelines |
| Shell (Agile Hub) | @TomLous | Production | Data pipelines |
| Nielsen Identity Engine | @roitvt | Evaluation | Data pipelines |
| [C2FO](https://www.c2fo.com/) | @vanhoale | Production | Data Platform / Data Infrastructure |
| [Data Mechanics](https://www.datamechanics.co) | @jrj-d | Production | Managed Spark Platform |
| [PUBG](https://careers.pubg.com/#/en/) | @jacobhjkim | Production | ML & Data Infrastructure |
| [Beeline](https://beeline.ru) | @spestua | Evaluation | ML & Data Infrastructure |
| [Stitch Fix](https://multithreaded.stitchfix.com/) | @nssalian | Evaluation | Data pipelines |
| [Typeform](https://typeform.com/) | @afranzi | Production | Data & ML pipelines |
| incrmntal(https://incrmntal.com/) | @scravy | Production | ML & Data Infrastructure |
| [CloudPhysics](https://www.cloudphysics.com) | @jkleckner | Production | ML/AI & Analytics |
| [MongoDB](https://www.mongodb.com) | @chickenpopcorn | Production | Data Infrastructure |
| [MavenCode](https://www.mavencode.com) | @charlesa101 | Production | MLOps & Data Infrastructure |
| [Gojek](https://www.gojek.io/) | @pradithya | Production | Machine Learning Platform |
| Fossil | @duyet | Production | Data Platform |
| Carrefour | @AliGouta | Production | Data Platform |
| Scaling Smart | @tarek-izemrane | Evaluation | Data Platform |
| [Tongdun](https://www.tongdun.net/) | @lomoJG | Production | AI/ML & Analytics |
| [Totvs Labs](https://www.totvslabs.com) | @luizm | Production | Data Platform |
| [DiDi](https://www.didiglobal.com) | @Run-Lin | Evaluation | Data Infrastructure |
| [DeepCure](https://www.deepcure.ai) | @mschroering | Production | Spark / ML |
| [C2FO](https://www.c2fo.com/) | @vanhoale | Production | Data Platform / Data Infrastructure |
| [Timo](https://timo.vn) | @vanducng | Production | Data Platform |
| [DiDi](https://www.didiglobal.com) | @Run-Lin | Evaluation | Data Infrastructure |
| Exacaster | @minutis | Evaluation | Data pipelines |
| Fossil | @duyet | Production | Data Platform |
| [Gojek](https://www.gojek.io/) | @pradithya | Production | Machine Learning Platform |
| HashmapInc| @prem0132 | Evaluation | Analytics Data Platform |
| [incrmntal](https://incrmntal.com/) | @scravy | Production | ML & Data Infrastructure |
| [Inter&Co](https://inter.co/) | @ignitz | Production | Data pipelines |
| [Kognita](https://kognita.com.br/) | @andreclaudino | Production | MLOps, Data Platform / Data Infrastructure, ML/AI |
| Lightbend | @yuchaoran2011 | Production | Data Infrastructure & Operations |
| Lyft |@kumare3| Evaluation | ML & Data Infrastructure |
| MapR Technologies |@sarjeet2013| Evaluation | ML/AI & Analytics Data Platform |
| [MavenCode](https://www.mavencode.com) | @charlesa101 | Production | MLOps & Data Infrastructure |
| Microsoft (MileIQ) | @dharmeshkakadia | Production | AI & Analytics |
| [Molex](https://www.molex.com/) | @AshishPushpSingh | Evaluation/Production | Data Platform |
| [MongoDB](https://www.mongodb.com) | @chickenpopcorn | Production | Data Infrastructure |
| Nielsen Identity Engine | @roitvt | Evaluation | Data pipelines |
| [PUBG](https://careers.pubg.com/#/en/) | @jacobhjkim | Production | ML & Data Infrastructure |
| [Qualytics](https://www.qualytics.co/) | @josecsotomorales | Production | Data Quality Platform |
| Riskified | @henbh | Evaluation | Analytics Data Platform |
| [Roblox](https://www.roblox.com/) | @matschaffer-roblox | Evaluation | Data Infrastructure |
| [Rokt](https://www.rokt.com) | @jacobsalway | Production | Data Infrastructure |
| [Inter&Co](https://inter.co/) | @ignitz | Production | Data pipelines |
| Salesforce | @khogeland | Production | Data transformation |
| Scaling Smart | @tarek-izemrane | Evaluation | Data Platform |
| Shell (Agile Hub) | @TomLous | Production | Data pipelines |
| [Siigo](https://www.siigo.com) | @Juandavi1 | Production | Data Migrations & Analytics Data Platform |
| StackTome | @emiliauskas-fuzzy | Production | Data pipelines |
| [Stitch Fix](https://multithreaded.stitchfix.com/) | @nssalian | Evaluation | Data pipelines |
| Tencent | @runzhliu | Evaluation | ML Analytics Platform |
| [Timo](https://timo.vn) | @vanducng | Production | Data Platform |
| [Tongdun](https://www.tongdun.net/) | @lomoJG | Production | AI/ML & Analytics |
| [Totvs Labs](https://www.totvslabs.com) | @luizm | Production | Data Platform |
| [Typeform](https://typeform.com/) | @afranzi | Production | Data & ML pipelines |
| Uber| @chenqin| Evaluation| Spark / ML |
81 changes: 35 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# Kubeflow Spark Operator

[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/spark-operator)](https://goreportcard.com/report/github.com/kubeflow/spark-operator)

## Overview
## What is Spark Operator?

The Kubernetes Operator for Apache Spark aims to make specifying and running [Spark](https://github.com/apache/spark) applications as easy and idiomatic as running other workloads on Kubernetes. It uses
[Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
for specifying, running, and surfacing status of Spark applications. For a complete reference of the custom resource definitions, please refer to the [API Definition](docs/api-docs.md). For details on its design, please refer to the [design doc](docs/design.md). It requires Spark 2.3 and above that supports Kubernetes as a native scheduler backend.
[Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) for specifying, running, and surfacing status of Spark applications.

## Overview

For a complete reference of the custom resource definitions, please refer to the [API Definition](docs/api-docs.md). For details on its design, please refer to the [Architecture](https://www.kubeflow.org/docs/components/spark-operator/overview/#architecture). It requires Spark 2.3 and above that supports Kubernetes as a native scheduler backend.

The Kubernetes Operator for Apache Spark currently supports the following list of features:

Expand All @@ -28,69 +33,53 @@ The Kubernetes Operator for Apache Spark currently supports the following list o

**If you are currently using the `v1beta1` version of the APIs in your manifests, please update them to use the `v1beta2` version by changing `apiVersion: "sparkoperator.k8s.io/<version>"` to `apiVersion: "sparkoperator.k8s.io/v1beta2"`. You will also need to delete the `previous` version of the CustomResourceDefinitions named `sparkapplications.sparkoperator.k8s.io` and `scheduledsparkapplications.sparkoperator.k8s.io`, and replace them with the `v1beta2` version either by installing the latest version of the operator or by running `kubectl create -f manifest/crds`.**

Customization of Spark pods, e.g., mounting arbitrary volumes and setting pod affinity, is implemented using a Kubernetes [Mutating Admission Webhook](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/), which became beta in Kubernetes 1.9. The mutating admission webhook is disabled by default if you install the operator using the Helm [chart](charts/spark-operator-chart). Check out the [Quick Start Guide](docs/quick-start-guide.md#using-the-mutating-admission-webhook) on how to enable the webhook.

## Prerequisites

* Version >= 1.13 of Kubernetes to use the [`subresource` support for CustomResourceDefinitions](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#subresources), which became beta in 1.13 and is enabled by default in 1.13 and higher.

* Version >= 1.16 of Kubernetes to use the `MutatingWebhook` and `ValidatingWebhook` of `apiVersion: admissionregistration.k8s.io/v1`.

## Installation
## Getting Started

The easiest way to install the Kubernetes Operator for Apache Spark is to use the Helm [chart](charts/spark-operator-chart/).
For getting started with Spark operator, please refer to [Getting Started](https://www.kubeflow.org/docs/components/spark-operator/getting-started/).

```bash
$ helm repo add spark-operator https://kubeflow.github.io/spark-operator
## User Guide

$ helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace
```
For detailed user guide and API documentation, please refer to [User Guide](https://www.kubeflow.org/docs/components/spark-operator/user-guide/) and [API Specification](docs/api-docs.md).

This will install the Kubernetes Operator for Apache Spark into the namespace `spark-operator`. The operator by default watches and handles `SparkApplication`s in every namespaces. If you would like to limit the operator to watch and handle `SparkApplication`s in a single namespace, e.g., `default` instead, add the following option to the `helm install` command:

```
--set "sparkJobNamespaces={default}"
```

For configuration options available in the Helm chart, please refer to the chart's [README](charts/spark-operator-chart/README.md).
If you are running Spark operator on Google Kubernetes Engine (GKE) and want to use Google Cloud Storage (GCS) and/or BigQuery for reading/writing data, also refer to the [GCP guide](https://www.kubeflow.org/docs/components/spark-operator/user-guide/gcp/).

## Version Matrix

The following table lists the most recent few versions of the operator.

| Operator Version | API Version | Kubernetes Version | Base Spark Version | Operator Image Tag |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| `latest` (master HEAD) | `v1beta2` | 1.13+ | `3.0.0` | `latest` |
| `v1beta2-1.3.3-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` | `v1beta2-1.3.3-3.1.1` |
| `v1beta2-1.3.2-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` | `v1beta2-1.3.2-3.1.1` |
| `v1beta2-1.3.0-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` | `v1beta2-1.3.0-3.1.1` |
| `v1beta2-1.2.3-3.1.1` | `v1beta2` | 1.13+ | `3.1.1` | `v1beta2-1.2.3-3.1.1` |
| `v1beta2-1.2.0-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` | `v1beta2-1.2.0-3.0.0` |
| `v1beta2-1.1.2-2.4.5` | `v1beta2` | 1.13+ | `2.4.5` | `v1beta2-1.1.2-2.4.5` |
| `v1beta2-1.0.1-2.4.4` | `v1beta2` | 1.13+ | `2.4.4` | `v1beta2-1.0.1-2.4.4` |
| `v1beta2-1.0.0-2.4.4` | `v1beta2` | 1.13+ | `2.4.4` | `v1beta2-1.0.0-2.4.4` |
| `v1beta1-0.9.0` | `v1beta1` | 1.13+ | `2.4.0` | `v2.4.0-v1beta1-0.9.0` |

When installing using the Helm chart, you can choose to use a specific image tag instead of the default one, using the following option:
| Operator Version | API Version | Kubernetes Version | Base Spark Version |
| ------------- | ------------- | ------------- | ------------- |
| `v1beta2-1.6.x-3.5.0` | `v1beta2` | 1.16+ | `3.5.0` |
| `v1beta2-1.5.x-3.5.0` | `v1beta2` | 1.16+ | `3.5.0` |
| `v1beta2-1.4.x-3.5.0` | `v1beta2` | 1.16+ | `3.5.0` |
| `v1beta2-1.3.x-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` |
| `v1beta2-1.2.3-3.1.1` | `v1beta2` | 1.13+ | `3.1.1` |
| `v1beta2-1.2.2-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` |
| `v1beta2-1.2.1-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` |
| `v1beta2-1.2.0-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` |
| `v1beta2-1.1.x-2.4.5` | `v1beta2` | 1.13+ | `2.4.5` |
| `v1beta2-1.0.x-2.4.4` | `v1beta2` | 1.13+ | `2.4.4` |

```
--set image.tag=<operator image tag>
```
## Developer Guide

## Get Started
For developing with Spark Operator, please refer to [Developer Guide](https://www.kubeflow.org/docs/components/spark-operator/developer-guide/).

Get started quickly with the Kubernetes Operator for Apache Spark using the [Quick Start Guide](docs/quick-start-guide.md).
## Contributor Guide

If you are running the Kubernetes Operator for Apache Spark on Google Kubernetes Engine and want to use Google Cloud Storage (GCS) and/or BigQuery for reading/writing data, also refer to the [GCP guide](docs/gcp.md).
For contributing to Spark Operator, please refer to [Contributor Guide](CONTRIBUTING.md).

For more information, check the [Design](docs/design.md), [API Specification](docs/api-docs.md) and detailed [User Guide](docs/user-guide.md).

## Contributing
## Community

Please check [CONTRIBUTING.md](CONTRIBUTING.md) and the [Developer Guide](docs/developer-guide.md) out.
* Join the [CNCF Slack Channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) and then join `#kubeflow-spark-operator` Channel.
* Check out our blog post [Announcing the Kubeflow Spark Operator: Building a Stronger Spark on Kubernetes Community](https://blog.kubeflow.org/operators/2024/04/15/kubeflow-spark-operator.html).
* Join our monthly community meeting [Kubeflow Spark Operator Meeting Notes](https://bit.ly/3VGzP4n).

## Community
## Adopters

* Join the [CNCF Slack Channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) and then join ```#kubeflow-spark-operator``` Channel.
* Check out our blog post [Announcing the Kubeflow Spark Operator: Building a Stronger Spark on Kubernetes Community](https://blog.kubeflow.org/operators/2024/04/15/kubeflow-spark-operator.html)
* Check out [who is using the Kubernetes Operator for Apache Spark](docs/who-is-using.md).
Check out [adopters of Spark Operator](ADOPTERS.md).
1 change: 0 additions & 1 deletion docs/_config.yml

This file was deleted.

Binary file removed docs/architecture-diagram.png
Binary file not shown.
Loading

0 comments on commit 16cd35a

Please sign in to comment.