Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some instructions to use Spark EMR docker image #965

Merged
merged 3 commits into from
Dec 6, 2024

Conversation

normanj-bitquill
Copy link
Contributor

Description

Add some instruction on how to use the Spark EMR docker image to test the OpenSearch Spark PPL extension. Also included some sample files to help a user get started.

Related Issues

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

* Also included some sample files

Signed-off-by: Norman Jordan <[email protected]>
Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill this is very nice

  • I'd also like to see this running using the spark docker image

Create a local build or copy of the OpenSearch Spark PPL extension. Make a note of the
location of the Jar file as well as the name of the Jar file.

## Run the Spark EMR Image
Copy link
Member

@YANG-DB YANG-DB Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also please add a folder for emr based docker-compose ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a docker-compose.yml for EMR.

@@ -0,0 +1,8 @@
name := "MyApp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets create a simple yet more complex use case application that will run multiple ppl queries and result with a report - similar to the next html report

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is intended to provide instructions so that a developer could run queries against docker images of Spark.

I think that your comment here is asking to generate an HTML report of the integration test results. See my comment below, I consider integration tests out of scope for this PR.

I can add more to the app if it will help a developer better understand how to run their queries. Perhaps loading a simple table.

println("Deploy Mode :" + spark.sparkContext.deployMode);
println("Master :" + spark.sparkContext.master);

spark.sql("CREATE table foo (id int, name varchar(100))").show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a more generic mechanism of creating & loading table data and running a list of dedicated PPL queries based on the table's fields

  • we can use a simpler version of our IT tests for the first iteration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, but I consider that out of scope for this PR. This is work that is planned for follow up PRs. This PR is intended to provide instructions and minimal "code" for a developer to be able to run queries against docker images of Spark.

This PR is not intended to run the integration tests.

The follow up PRs would be:

  1. Create docker files that are intended to be used by the integration tests. This would include containers for S3 and the dashboard server. Only focus on testing against Apache Spark release (not EMR).
  2. Update integration tests to run against the docker containers. This includes loading data, running queries and verifying the output. Also includes running the integration tests from the SBT build.
  3. Create docker files for running the integration tests against Spark EMR. This includes changes so that the user can choose whether integration tests run against Apache Spark or Spark EMR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sound good !
thanks !

source: ./spark-defaults.conf
target: /opt/bitnami/spark/conf/spark-defaults.conf
- type: bind
source: ../../ppl-spark-integration/target/scala-2.12/ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill can u plz use docker .env file for parametrizing the ppl-spark-assembly-*.jar and maybe the PORTs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sample .env file for the host ports and the PPL Jar location. * doesn't work in the PPL location in a .env file.

The Bitnami Apache Spark image can be used to run a Spark cluster and also to run
`spark-shell` for running queries.

## Setup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz add the docker .env configuration ppl-spark-assembly-*.jar step here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above. This was added.


## Prepare OpenSearch Spark PPL Extension

Create a local build or copy of the OpenSearch Spark PPL extension. Make a note of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add sbt assembly command here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the sbt command.

for them to use.

```
docker network create spark-network
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why cant this be done in the docker-compose file ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, and should be more clear in the documentation now. There are two sections, one for using docker compose and one for manual setup. docker network create ... is only needed for the manual setup.

docker exec -it spark /opt/bitnami/spark/bin/spark-shell
```

Within the Spark Shell, you can submit queries, including PPL queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please give an example here and link to our example doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example and a link to the example doc.


Within the Spark Shell, you can submit queries, including PPL queries.

## Docker Compose Sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should be the first step and the actual specific detailed instruction for running each spark element should be last

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved up much higher in the file. There are two main sections, using docker compose and manual setup. Using docker compose comes first.

Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill Mostly small comments
can u also add a docker-compose for emr ? if its too much effort lets add this in the next phase ...
thanks

@YANG-DB YANG-DB added documentation Improvements or additions to documentation Lang:PPL Pipe Processing Language support 0.7 labels Dec 5, 2024
@YANG-DB
Copy link
Member

YANG-DB commented Dec 5, 2024

@normanj-bitquill
can u plz add sign-off for the next commit ?
8a3155bbe89a7775a2551fcaa1914a8e65eb0b8e The sign-off is missing.
thanks

@normanj-bitquill
Copy link
Contributor Author

@normanj-bitquill Mostly small comments can u also add a docker-compose for emr ? if its too much effort lets add this in the next phase ... thanks

This has been added. I feel it fits in this PR, since it allows developers to make use of what I have found so far.

@normanj-bitquill
Copy link
Contributor Author

@normanj-bitquill can u plz add sign-off for the next commit ? 8a3155bbe89a7775a2551fcaa1914a8e65eb0b8e The sign-off is missing. thanks

Fixed the offending commit. All commits are not signed-off.

@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I have updated this. Take another look when you have time.

@YANG-DB
Copy link
Member

YANG-DB commented Dec 6, 2024

@LantaoJin @penghuo @dai-chen
could u plz also review and give some feedback here ?
thanks

@YANG-DB YANG-DB merged commit 16fbfea into opensearch-project:main Dec 6, 2024
4 checks passed
kenrickyap pushed a commit to Bit-Quill/opensearch-spark that referenced this pull request Dec 11, 2024
…ct#965)

* Add some instructions to use Spark EMR docker image

* Also included some sample files

Signed-off-by: Norman Jordan <[email protected]>

* Added instructions for using Bitnami Spark images

Signed-off-by: Norman Jordan <[email protected]>

* Added docker compose files for EMR

Signed-off-by: Norman Jordan <[email protected]>

---------

Signed-off-by: Norman Jordan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.7 documentation Improvements or additions to documentation Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants