Implement FlintJob Logic for EMR-S #52

kaituo · 2023-09-29T22:26:07Z

Description

This commit introduces FlintJob logic for EMR-S, mirroring the existing SQLJob implementation for EMR cluster. The key differences in FlintJob are:

It reads OpenSearch host information from spark command parameters.
It ensures the existence of a result index with the correct mapping in OpenSearch, creating it if necessary. This process occurs in parallel to SQL query execution.
It reports an error if the result index mapping is incorrect.
It saves a failure status if the SQL execution fails.

Testing:

Manual testing was conducted using the EMR-S CLI.
New unit tests were added to verify the functionality.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This commit introduces FlintJob logic for EMR-S, mirroring the existing SQLJob implementation for EMR cluster. The key differences in FlintJob are: 1. It reads OpenSearch host information from spark command parameters. 2. It ensures the existence of a result index with the correct mapping in OpenSearch, creating it if necessary. This process occurs in parallel to SQL query execution. 3. It reports an error if the result index mapping is incorrect. 4. It saves a failure status if the SQL execution fails. Testing: 1. Manual testing was conducted using the EMR-S CLI. 2. New unit tests were added to verify the functionality. Signed-off-by: Kaituo Li <[email protected]>

vmmusings · 2023-09-29T23:29:18Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+        StructField("jobRunId", StringType, nullable = true),
+        StructField("applicationId", StringType, nullable = true),
+        StructField("dataSourceName", StringType, nullable = true),
+        StructField("status", StringType, nullable = true)


Can we update documentation with status as well.

vmmusings · 2023-09-29T23:32:19Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+      val data = executeQuery(spark, query, dataSource)
+
+      val correctMapping = ThreadUtils.awaitResult(futureMappingCheck, Duration(10, MINUTES))
+      writeData(spark, data, resultIndex, correctMapping, dataSource)


can we write errors from executingQuery back to the result index. This way we can propagate some information to the user when the Query Fails.

added errors in the result index

vmmusings · 2023-09-30T01:06:56Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+    val spark = createSparkSession(conf)
+
+    val threadPool = ThreadUtils.newDaemonFixedThreadPool(1, "check-create-index")
+    implicit val executionContext = ExecutionContext.fromExecutor(threadPool)


For my understanding: If we don't provide the above thread pool, I am assuming there will be a default threadPool? If yes, with how many threads?[No of Cores]. Does this overwrite the default threadpool.

I don't have much acquaintance with Scala but we are not using this variable down the file. Is this exceutionContext implicitly used when we submit a task to Future.

yes, the executionContext is implicitly used when submiting a task to Future. We have to use a new threadpool. Using global threadpool is gonna fail scala fmt. This does not overwrite the default threadpool. It creates a new one.

vmmusings · 2023-09-30T01:14:17Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+      }
+      val data = executeQuery(spark, query, dataSource)
+
+      val correctMapping = ThreadUtils.awaitResult(futureMappingCheck, Duration(10, MINUTES))


Shouldn't we reduce this to a minute? 10 Minutes is too longer to wait for an opensearch mapping query? Opensearch itself times out after 60 seconds???

I agree with the approach. But, do we really gain substantially but parallelizing the mapping creation part?

reduced to 1 minute.

not sure how much we gain. but we are trying to contain the query under 10~20 seconds. Every few seconds saved is good.

penghuo · 2023-09-30T02:54:19Z

build.sbt

@@ -114,6 +114,7 @@ lazy val flintSparkIntegration = (project in file("flint-spark-integration"))
      "com.stephenn" %% "scalatest-json-jsonassert" % "0.2.5" % "test",
      "com.github.sbt" % "junit-interface" % "0.13.3" % "test"),
    libraryDependencies ++= deps(sparkVersion),
+    libraryDependencies += "com.typesafe.play" %% "play-json" % "2.9.2",


what is it for?

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

penghuo · 2023-09-30T04:43:13Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+      """{
+        "dynamic": false,
+        "properties": {
+          "result": {


result include plain json. no search requirement.
using "enabled": false?

added "enabled": false

penghuo · 2023-09-30T04:43:37Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+              }
+            }
+          },
+          "schema": {


schema include plain json. no search requirement.
using "enabled": false?

added "enabled": false

penghuo · 2023-09-30T04:52:16Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+    val inputJson = Json.parse(input)
+    val mappingJson = Json.parse(mapping)
+    logInfo(s"inputJson $inputJson")
+    logInfo(s"mappingJson $mappingJson")


log mapping may has security concern.

penghuo · 2023-09-30T04:52:59Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+    spark.createDataFrame(rows).toDF(schema.fields.map(_.name): _*)
+  }
+
+  def isSuperset(input: String, mapping: String): Boolean = {


why not simply check exactly match?

as discussed offline, there is a strange issue of getIndexMetadata output where boolean is converted to String, but integer stays as integers. Will keep current code.

penghuo · 2023-09-30T04:56:49Z

spark-sql-application/README.md

+  "_index" : ".query_execution_result",
+  "_id" : "A2WOsYgBMUoqCqlDJHrn",
+  "_score" : 1.0,
+  "_source" : {


missing status field.

penghuo · 2023-09-30T05:03:53Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+        throw e
+    } finally {
+      // Stop SparkSession
+      spark.stop()


Create Skipping Index is Spark Structured Streaming query. Main thread should wait. In case plugin will set EMR-S JobTimeout = 0 for long streaming query, then plugin can pass isStreaming parameters to FlintJob also.

if (isStreamingQuery) { spark.streams.awaitAnyTermination() }

For your reference: #53

Plugin related change.

opensearch-project/sql#2193

penghuo · 2023-09-30T05:07:42Z

spark-sql-application/src/test/scala/org/apache/spark/sql/FlintJobTest.scala

+  val input: DataFrame =
+    spark.createDataFrame(spark.sparkContext.parallelize(inputRows), inputSchema)
+
+  test("Test getFormattedData method") {


add tests for explain and ddl

explain: explain select * from table

ddl: create table ...

will do it later to unblock testing

Signed-off-by: Kaituo Li <[email protected]>

kaituo requested review from dai-chen, rupal-bq, vmmusings, penghuo and anirudha as code owners September 29, 2023 22:26

kaituo force-pushed the flintJob branch from 11ca7ae to 64cbfd9 Compare September 29, 2023 22:30

vmmusings reviewed Sep 29, 2023

View reviewed changes

vmmusings reviewed Sep 30, 2023

View reviewed changes

penghuo reviewed Sep 30, 2023

View reviewed changes

penghuo mentioned this pull request Oct 4, 2023

[Refactor] Manage Spark job parameter for different query opensearch-project/sql#2192

Closed

address comments

5c7bcf9

Signed-off-by: Kaituo Li <[email protected]>

penghuo approved these changes Oct 4, 2023

View reviewed changes

penghuo merged commit 71d67a0 into opensearch-project:main Oct 4, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement FlintJob Logic for EMR-S #52

Implement FlintJob Logic for EMR-S #52

kaituo commented Sep 29, 2023

vmmusings Sep 29, 2023

kaituo Oct 3, 2023

vmmusings Sep 29, 2023

kaituo Oct 4, 2023

vmmusings Sep 30, 2023 •

edited

Loading

kaituo Oct 4, 2023

vmmusings Sep 30, 2023

kaituo Oct 4, 2023

penghuo Sep 30, 2023

kaituo Oct 3, 2023

penghuo Sep 30, 2023

kaituo Oct 3, 2023

penghuo Sep 30, 2023

kaituo Oct 3, 2023

penghuo Sep 30, 2023

kaituo Oct 3, 2023

penghuo Sep 30, 2023

kaituo Oct 3, 2023

penghuo Sep 30, 2023

kaituo Oct 3, 2023

penghuo Sep 30, 2023

dai-chen Oct 2, 2023

penghuo Oct 3, 2023

kaituo Oct 3, 2023

penghuo Sep 30, 2023

kaituo Oct 3, 2023

Implement FlintJob Logic for EMR-S #52

Implement FlintJob Logic for EMR-S #52

Conversation

kaituo commented Sep 29, 2023

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmmusings Sep 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmmusings Sep 30, 2023 •

edited

Loading