Created docker files for an integ test cluster (#601) #986

normanj-bitquill · 2024-12-10T17:05:13Z

Description

Created a cluster that can later be used for integration tests. It contains a docker-compose.yml file that can be used to start the whole cluster.

Cluster contains:

Spark master
Spark worker
OpenSearch server
OpenSearch dashboards
Minio server

Currently the Minio server is unused.

Spark nodes are configured to include the Flint and PPL extensions as well as to be able to query the OpenSearch server.

The OpenSearch dashboards are configured to connect to the OpenSearch server.

Related Issues

#601

Check List

Updated documentation (docs/ppl-lang/README.md)
Implemented unit tests
Implemented tests for combination with other commands
New added source code should include a copyright header
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Cluster contains: * Spark master * Spark worker * OpenSearch server * OpenSearch dashboards * Minio server Signed-off-by: Norman Jordan <[email protected]>

YANG-DB

@normanj-bitquill thanks!!
lets try to use / utilize the existing IT pythons scripts

Signed-off-by: Norman Jordan <[email protected]>

normanj-bitquill · 2024-12-11T22:43:20Z

@YANG-DB I am part way through altering the integ test script to run against the docker containers. I have been able to create the indices for http_logs and nested. Those two indices cover about half of the tests.

Some tests now pass when they were expected to fail. This could be caused by more recent changes.

Some tests fail when they were expected to pass. These fall into 3 categories:

Refer to a field that doesn't exist (such as stat from http_logs rather than status)
Typos (such as wrong character for single quote or too many starting parenthesis)
Query fails with no obvious explanation. Also fails when run in Spark Shell. For example: https://github.com/opensearch-project/opensearch-spark/blob/main/integ-test/script/test_cases.csv?plain=1#L21

I will continue to update the script for running the tests to also get the report at the end.

YANG-DB

@normanj-bitquill how would spark-connect be used ?
will it be via python ? scala ?
could you plz describe the use case ?

normanj-bitquill · 2024-12-11T23:56:15Z

@YANG-DB I have been repurposing the script:
https://github.com/opensearch-project/opensearch-spark/blob/main/integ-test/script/SanityTest.py

With that it is:
Python script -> Spark Connect -> Spark Master Node

This would be an initial phase in this PR. The follow up PR would be to make use of the Scala integration test framework already in place. Update it to connect with Spark Connect and run tests.

YANG-DB · 2024-12-12T00:06:53Z

@YANG-DB I have been repurposing the script:
https://github.com/opensearch-project/opensearch-spark/blob/main/integ-test/script/SanityTest.py

With that it is:
Python script -> Spark Connect -> Spark Master Node

This would be an initial phase in this PR. The follow up PR would be to make use of the Scala integration test framework already in place. Update it to connect with Spark Connect and run tests.

I'm not sure EMR supports spark connect...

normanj-bitquill · 2024-12-12T00:13:35Z

I doubt that EMR would support Spark Connect. I am keeping that in mind, but I don't have an obvious solution for Spark EMR as yet. In the end the integration tests need to be able to run queries against either standard Spark containers or Spark EMR. The integration tests should not care which they are using.

When I get to creating docker files for integration tests with Spark EMR, I will find a solution to this problem. It may require altering how integration tests connect to run queries, but for now I'd like to get a starting point out.

The Python script for integration tests was updated to run queries against the docker cluster. The required indices are created as part of the script. The queries for the Python script were likely out of date. These have been updated when the fix for the query was obvious. There are still 6 tests that fail. Signed-off-by: Norman Jordan <[email protected]>

normanj-bitquill · 2024-12-12T23:54:31Z

@YANG-DB I have updated this PR so that the Python script for integration tests will now run against the docker cluster.

Below is one idea for the long term solution of running integration tests. Let me know what you think and if we should discuss this elsewhere.

Proposal

Create a directory structure for the tests.

integ-test-data
  +- queries
  +- query-plans
  +- expected-results

queries - contains the queries. One query per file.
query-plans - expected query plans with names that correspond to filenames in queries
expected-results - expected results of the queries in queries, with names that correspond to filenames in queries

Create a Spark App that makes use of the integ-test-data directory. It runs each query and places the output into another directory. It also calls EXPLAIN for each query and places the output into another directory.

The Spark (either master container or EMR container) have the following directories mounted:

integ-test-data
query-results
explain-results

The integration tests (run from sbt) will start the docker cluster and then upload the Spark App by either calling spark-submit remotely or using docker to run spark-submit.

After the tests finish, the integration tests (run from sbt) examine the query results and explain results to verify if they match the expected results.

YANG-DB · 2024-12-13T00:24:25Z

@YANG-DB I have updated this PR so that the Python script for integration tests will now run against the docker cluster.

Below is one idea for the long term solution of running integration tests. Let me know what you think and if we should discuss this elsewhere.

Proposal

Create a directory structure for the tests.
integ-test-data
  +- queries
  +- query-plans
  +- expected-results
queries - contains the queries. One query per file. query-plans - expected query plans with names that correspond to filenames in queries expected-results - expected results of the queries in queries, with names that correspond to filenames in queries

Create a Spark App that makes use of the integ-test-data directory. It runs each query and places the output into another directory. It also calls EXPLAIN for each query and places the output into another directory.

The Spark (either master container or EMR container) have the following directories mounted:

integ-test-data

query-results

explain-results

The integration tests (run from sbt) will start the docker cluster and then upload the Spark App by either calling spark-submit remotely or using docker to run spark-submit.

After the tests finish, the integration tests (run from sbt) examine the query results and explain results to verify if they match the expected results.

@normanj-bitquill
Thanks for the feedback - lets take the discussion and create a dedicated issue (RFC) for that

normanj-bitquill · 2024-12-13T23:05:13Z

@YANG-DB I have created this issue
#992

to continue discussion of how integration tests could be run on each of the Docker clusters.

YANG-DB

@normanj-bitquill
looks great , can you please add some link to the ./script/README.md file from our main readme.md file ?
right below this

YANG-DB · 2024-12-16T21:20:49Z

integ-test/script/README.md

-pip install requests pandas openpyxl
+pip install requests pandas openpyxl pyspark setuptools pyarrow grpcio grpcio-status protobuf
+```
+


plz also mention that both ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar & flint-spark-integration-assembly-0.7.0-SNAPSHOT.jar needed to be build using :

sbt clean sparkSqlApplicationCosmetic/assembly

sbt clean sparkPPLCosmetic/assembly
before the docker can run...

Added this section.

YANG-DB · 2024-12-16T21:31:13Z

integ-test/script/README.md

 ```
-You need to replace the placeholders with your actual values of URL_ADDRESS, DATASOURCE_NAME and USERNAME, PASSWORD for authentication to your endpoint.
+You need to replace the placeholders with your actual values of URL_ADDRESS, OPENSEARCH_URL and USERNAME, PASSWORD for authentication to your endpoint.

 For more details of the command line parameters, you can see the help manual via command:
 ```shell
 python SanityTest.py --help   

 usage: SanityTest.py [-h] --base-url BASE_URL --username USERNAME --password PASSWORD --datasource DATASOURCE --input-csv INPUT_CSV


what is an example value for ${URL_ADDRESS} ? if it the spark's url ?
please mention that

Fixed this up. It should actually be SPARK_URL. Also provided an example value.

Signed-off-by: Norman Jordan <[email protected]>

normanj-bitquill · 2024-12-16T23:43:18Z

@normanj-bitquill looks great , can you please add some link to the ./script/README.md file from our main readme.md file ? right below this

Added a link in the top-level README.md

Signed-off-by: Norman Jordan <[email protected]>

normanj-bitquill · 2024-12-17T19:56:12Z

@YANG-DB I have added a section to the integ test README.md to describe the test indices.

Created docker files for an integ test cluster (opensearch-project#601)

9a1503a

Cluster contains: * Spark master * Spark worker * OpenSearch server * OpenSearch dashboards * Minio server Signed-off-by: Norman Jordan <[email protected]>

normanj-bitquill requested review from dai-chen, mengweieric, penghuo, seankao-az, anirudha, kaituo, YANG-DB, noCharger, LantaoJin and ykmr1224 as code owners December 10, 2024 17:05

YANG-DB reviewed Dec 10, 2024

View reviewed changes

Updated to start Spark Connect on the Spark master container

418ee7e

Signed-off-by: Norman Jordan <[email protected]>

YANG-DB reviewed Dec 11, 2024

View reviewed changes

YANG-DB approved these changes Dec 16, 2024

View reviewed changes

YANG-DB reviewed Dec 16, 2024

View reviewed changes

normanj-bitquill added 2 commits December 16, 2024 15:38

Fixed up the documentation for docker integration tests

7ac1266

Signed-off-by: Norman Jordan <[email protected]>

Added a link in the toplevel README

a52ecff

Signed-off-by: Norman Jordan <[email protected]>

YANG-DB approved these changes Dec 17, 2024

View reviewed changes

Described creation of test indices

1e7e73c

Signed-off-by: Norman Jordan <[email protected]>

YANG-DB approved these changes Dec 17, 2024

View reviewed changes

YANG-DB merged commit 957de4e into opensearch-project:main Dec 17, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Created docker files for an integ test cluster (#601) #986

Created docker files for an integ test cluster (#601) #986

normanj-bitquill commented Dec 10, 2024

YANG-DB left a comment

normanj-bitquill commented Dec 11, 2024

YANG-DB left a comment

normanj-bitquill commented Dec 11, 2024

YANG-DB commented Dec 12, 2024

normanj-bitquill commented Dec 12, 2024

normanj-bitquill commented Dec 12, 2024

YANG-DB commented Dec 13, 2024

Proposal

normanj-bitquill commented Dec 13, 2024

YANG-DB left a comment

YANG-DB Dec 16, 2024

normanj-bitquill Dec 16, 2024

YANG-DB Dec 16, 2024

normanj-bitquill Dec 16, 2024

normanj-bitquill commented Dec 16, 2024

normanj-bitquill commented Dec 17, 2024

Created docker files for an integ test cluster (#601) #986

Created docker files for an integ test cluster (#601) #986

Conversation

normanj-bitquill commented Dec 10, 2024

Description

Related Issues

Check List

YANG-DB left a comment

Choose a reason for hiding this comment

normanj-bitquill commented Dec 11, 2024

YANG-DB left a comment

Choose a reason for hiding this comment

normanj-bitquill commented Dec 11, 2024

YANG-DB commented Dec 12, 2024

normanj-bitquill commented Dec 12, 2024

normanj-bitquill commented Dec 12, 2024

Proposal

YANG-DB commented Dec 13, 2024

Proposal

normanj-bitquill commented Dec 13, 2024

YANG-DB left a comment

Choose a reason for hiding this comment

YANG-DB Dec 16, 2024

Choose a reason for hiding this comment

normanj-bitquill Dec 16, 2024

Choose a reason for hiding this comment

YANG-DB Dec 16, 2024

Choose a reason for hiding this comment

normanj-bitquill Dec 16, 2024

Choose a reason for hiding this comment

normanj-bitquill commented Dec 16, 2024

normanj-bitquill commented Dec 17, 2024