-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some instructions to use Spark EMR docker image #965
Add some instructions to use Spark EMR docker image #965
Conversation
* Also included some sample files Signed-off-by: Norman Jordan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@normanj-bitquill this is very nice
- I'd also like to see this running using the spark docker image
Create a local build or copy of the OpenSearch Spark PPL extension. Make a note of the | ||
location of the Jar file as well as the name of the Jar file. | ||
|
||
## Run the Spark EMR Image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also please add a folder for emr based docker-compose ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a docker-compose.yml
for EMR.
@@ -0,0 +1,8 @@ | |||
name := "MyApp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets create a simple yet more complex use case application that will run multiple ppl queries and result with a report - similar to the next html report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is intended to provide instructions so that a developer could run queries against docker images of Spark.
I think that your comment here is asking to generate an HTML report of the integration test results. See my comment below, I consider integration tests out of scope for this PR.
I can add more to the app if it will help a developer better understand how to run their queries. Perhaps loading a simple table.
println("Deploy Mode :" + spark.sparkContext.deployMode); | ||
println("Master :" + spark.sparkContext.master); | ||
|
||
spark.sql("CREATE table foo (id int, name varchar(100))").show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a more generic mechanism of creating & loading table data and running a list of dedicated PPL queries based on the table's fields
- we can use a simpler version of our IT tests for the first iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, but I consider that out of scope for this PR. This is work that is planned for follow up PRs. This PR is intended to provide instructions and minimal "code" for a developer to be able to run queries against docker images of Spark.
This PR is not intended to run the integration tests.
The follow up PRs would be:
- Create docker files that are intended to be used by the integration tests. This would include containers for S3 and the dashboard server. Only focus on testing against Apache Spark release (not EMR).
- Update integration tests to run against the docker containers. This includes loading data, running queries and verifying the output. Also includes running the integration tests from the SBT build.
- Create docker files for running the integration tests against Spark EMR. This includes changes so that the user can choose whether integration tests run against Apache Spark or Spark EMR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sound good !
thanks !
source: ./spark-defaults.conf | ||
target: /opt/bitnami/spark/conf/spark-defaults.conf | ||
- type: bind | ||
source: ../../ppl-spark-integration/target/scala-2.12/ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@normanj-bitquill can u plz use docker .env
file for parametrizing the ppl-spark-assembly-*.jar
and maybe the PORTs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a sample .env
file for the host ports and the PPL Jar location. *
doesn't work in the PPL location in a .env
file.
docs/spark-docker.md
Outdated
The Bitnami Apache Spark image can be used to run a Spark cluster and also to run | ||
`spark-shell` for running queries. | ||
|
||
## Setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz add the docker .env
configuration ppl-spark-assembly-*.jar
step here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment above. This was added.
|
||
## Prepare OpenSearch Spark PPL Extension | ||
|
||
Create a local build or copy of the OpenSearch Spark PPL extension. Make a note of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add sbt
assembly command here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the sbt
command.
for them to use. | ||
|
||
``` | ||
docker network create spark-network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why cant this be done in the docker-compose file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can, and should be more clear in the documentation now. There are two sections, one for using docker compose and one for manual setup. docker network create ...
is only needed for the manual setup.
docs/spark-docker.md
Outdated
docker exec -it spark /opt/bitnami/spark/bin/spark-shell | ||
``` | ||
|
||
Within the Spark Shell, you can submit queries, including PPL queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please give an example here and link to our example doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an example and a link to the example doc.
docs/spark-docker.md
Outdated
|
||
Within the Spark Shell, you can submit queries, including PPL queries. | ||
|
||
## Docker Compose Sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should be the first step and the actual specific detailed instruction for running each spark element should be last
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved up much higher in the file. There are two main sections, using docker compose and manual setup. Using docker compose comes first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@normanj-bitquill Mostly small comments
can u also add a docker-compose for emr ? if its too much effort lets add this in the next phase ...
thanks
@normanj-bitquill |
Signed-off-by: Norman Jordan <[email protected]>
Signed-off-by: Norman Jordan <[email protected]>
8a3155b
to
9ac1925
Compare
This has been added. I feel it fits in this PR, since it allows developers to make use of what I have found so far. |
Fixed the offending commit. All commits are not signed-off. |
@YANG-DB I have updated this. Take another look when you have time. |
@LantaoJin @penghuo @dai-chen |
…ct#965) * Add some instructions to use Spark EMR docker image * Also included some sample files Signed-off-by: Norman Jordan <[email protected]> * Added instructions for using Bitnami Spark images Signed-off-by: Norman Jordan <[email protected]> * Added docker compose files for EMR Signed-off-by: Norman Jordan <[email protected]> --------- Signed-off-by: Norman Jordan <[email protected]>
Description
Add some instruction on how to use the Spark EMR docker image to test the OpenSearch Spark PPL extension. Also included some sample files to help a user get started.
Related Issues
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.