Add support for spark sql #113

jphalip · 2023-10-27T20:57:00Z

No description provided.

jphalip · 2023-10-28T06:29:28Z

/gcbrun

yigress · 2023-10-31T18:56:06Z

...src/main/java/com/google/cloud/hive/bigquery/connector/output/direct/DirectRecordWriter.java

    this.jobDetails = jobDetails;
+    String taskAttemptID = UUID.randomUUID().toString();


we are relying on overwriting task attempt file when there are multiple attempts to handle attempt failures (and cleanup?). if each attempt has a different file, how is attempt faiilures handled? also preserving the task attempt info helps troubleshooting.

The reason I changed this is because there is no available hive query id in the Hadoop conf when using Spark SQL. Do you know how else we could handle this?

@yigress Can you think of a way to reproduce this situation? It would be good to cover that in our test suite.

the test would have first few attempts fail then another attempt succeed. I can't think of a good way to do that. maybe another thread monitor the writing location and switching permissions between first and later attempts, or something similar?

Ok so I've added some code to retrieve the task ID from Spark. Let me know what you think

Note: The Spark task ID now uses the Spark app ID and the partition ID. The temp output folder also uses the ".hive-staging-XXX" folder.

...ry-connector/src/test/java/com/google/cloud/hive/bigquery/connector/sparksql/SparkUtils.java

cloudbuild/cloudbuild.yaml

...ommon/src/main/java/com/google/cloud/hive/bigquery/connector/BigQueryStorageHandlerBase.java

...mon/src/main/java/com/google/cloud/hive/bigquery/connector/output/MapReduceOutputFormat.java

...ctor-common/src/main/java/com/google/cloud/hive/bigquery/connector/BigQueryMetaHookBase.java

...ctor-common/src/main/java/com/google/cloud/hive/bigquery/connector/utils/hive/HiveUtils.java

yigress · 2023-11-06T18:20:33Z

...ctor-common/src/main/java/com/google/cloud/hive/bigquery/connector/utils/hive/HiveUtils.java

+      Class<?> taskContextClass = Class.forName("org.apache.spark.TaskContext");
+      Method getMethod = taskContextClass.getMethod("get");
+      Object taskContext = getMethod.invoke(null);
+      Method taskAttemptIdMethod = taskContextClass.getMethod("taskAttemptId");


we want task id not task attempt id here?

For Hive we are getting the task attempt ID. Isn't that right?

we are only getting task id for hive. this remind that we need to set readme that spark/hive speculative is not supported.

@yigress Are you sure? It looks like we're using the attempt id, not just the task id: https://github.com/GoogleCloudDataproc/hive-bigquery-connector/blob/main/hive-bigquery-connector-common/src/main/java/com/google/cloud/hive/bigquery/connector/utils/hive/HiveUtils.java#L63-L105

we are writing output file to be just task id. JobUtils#getTaskWriterOutputFile. here we can have the task attempt id just need to make sure the taskid can be fetched from it.

this works if the spark application just fails when the insert query fails. however if user choose not to fail the application (for example try catch the query failure) or using spark-shell, it will be the same spark application, if user do another query either insert into same table or insert into another table, what would happen if the spark job info file already exists?

cloudbuild/presubmit.sh

yigress · 2023-11-29T23:37:50Z

add acceptance test or integration test for 'insert overwrite'? test on dataproc-2.1 shows incorrect results
``scala> spark.sql("select * from ba").show
+---+----+
| id|name|
+---+----+
| 11| aaa|
| 22| bbb|
+---+----+

scala> spark.sql("insert overwrite table snoop select * from ba")
res8: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from snoop").show
+---+-------+
|num|str_val|
+---+-------+
| 11| aaa|
| 22| bbb|
| 1| aaa|
| 2| bbb|
| 11| aaa|
| 22| bbb|
| 3| ccc|
| 4| ddd|
| 11| aaa|
| 22| bbb|
+---+-------+```

yigress · 2023-11-30T22:28:23Z

...ctor-common/src/main/java/com/google/cloud/hive/bigquery/connector/utils/hive/HiveUtils.java

+      Object partitionId = partitionIdMethod.invoke(taskContext);
+      Method stageIdMethod = taskContextClass.getMethod("stageId");
+      Object stageId = stageIdMethod.invoke(taskContext);
+      return String.format("stage-%s-partition-%s", stageId, partitionId);


right now all stream ref files are dumped into a parent path of .../
/ and we are assuming there is only 1 stage write to the table? if there are multiple stages write into same table, then for each stage all the stream ref files will be picked up, this will have data correctness issue.

I've updated the code to use the ".hive-staging-XXXX" folder name, when present.

See the getQueryTempOutputPath() method. We check for the presence of the FileOutputFormat.OUTDIR (mapreduce.output.fileoutputformat.outputdir) property, which is set by Spark. If present, then we extract the "hive-staging-XXXX" part of that folder name and then use that for our own temp output dir.

jphalip · 2023-12-08T21:35:34Z

@yigress INSERT OVERWRITE should now work. I've added an integration test for it as well.

yigress · 2023-12-18T23:19:38Z

can you run some tests with dataproc cluster?
i tried this

spark-shell --conf spark.sql.hive.convertMetastoreBigquery=false --conf spark.sql.extensions=com.google.cloud.hive.bigquery.connector.sparksql.HiveBigQuerySparkSQLExtension --jars gs://yigress/test/hive-2-bigquery-connector-2.0.0-SNAPSHOT.jar

spark.sql("insert into region2 select * from region")
spark.sql("select * from region2")

scala> spark.sql("select * from region2").show
+-----------+------+---------+
|r_regionkey|r_name|r_comment|
+-----------+------+---------+
+-----------+------+---------+

there is no result, nor error message

it looks like a delay in spark query result when insert into an emtpy bq table. the results eventually show up when query again at a later time.

...or-common/src/main/java/com/google/cloud/hive/bigquery/connector/sparksql/SparkSQLUtils.java

...c/main/java/com/google/cloud/hive/bigquery/connector/sparksql/HiveBigQuerySparkStrategy.java

yigress · 2023-12-20T21:01:59Z

there are bunch errors in the integration test

OK
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFNamedStruct@6afcb427. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFNamedStruct@6afcb427. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap@ad168b5. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap@ad168b5. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFNamedStruct@6afcb427. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFNamedStruct@6afcb427. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap@ad168b5. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap@ad168b5. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFNamedStruct@6afcb427. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFNamedStruct@6afcb427. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap@ad168b5. Return value unrecoginizable.
2023-12-19 00:35:24 ERROR Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap@ad168b5. Return value unrecoginizable.
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20231219003523_43efc817-d4b1-4170-9400-f70966da8691
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2023-12-19 00:35:24,904 Stage-3 map = 0%, reduce = 0%
2023-12-19 00:35:24 ERROR java.lang.NoClassDefFoundError: Could not initialize class io.grpc.netty.Utils
at io.grpc.netty.UdsNettyChannelProvider.isAvailable(UdsNettyChannelProvider.java:34)
at io.grpc.ManagedChannelRegistry$ManagedChannelPriorityAccessor.isAvailable(ManagedChannelRegistry.java:211)
at io.grpc.ManagedChannelRegistry$ManagedChannelPriorityAccessor.isAvailable(ManagedChannelRegistry.java:207)
at io.grpc.ServiceProviders.loadAll(ServiceProviders.java:68)
at io.grpc.ManagedChannelRegistry.getDefaultRegistry(ManagedChannelRegistry.java:101)
at io.grpc.ManagedChannelProvider.provider(ManagedChannelProvider.java:43)
at io.grpc.ManagedChannelBuil

yigress · 2023-12-20T21:13:35Z

...src/main/java/com/google/cloud/hive/bigquery/connector/sparksql/SparkSQLOutputCommitter.java

+
+  @Override
+  public void commitJob(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException {
+    SparkSQLUtils.cleanUpSparkJobFile(jobContext.getConfiguration());


can this spark job file be moved to inside per query? right now it is per spark application, in case a previous query failure or the committer not called, this existing file will cause problem for next query?

So this Spark job file only contains one piece of information: what tables are "INSERT" and what tables are "INSERT OVERWRITE". So I think it's relevant for the whole Spark application. Makes sense?

The Spark job file basically mimics what Hive itself does here: https://github.com/apache/hive/blob/8190d2be7b7165effa62bd21b7d60ef81fb0e4af/ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java#L69-L70

this works if the spark application fails when the insert query fails. but if user choose to handle the spark sql failure and continue the spark application (maybe a try catch), or user run in spark-shell, the same spark application can continue, so later another insert query (either insert into same table or different table), what would happen?

In that case I think the spark job file would be overwritten when a new query is run.

BTW, if you run multiple queries as part of the same Spark shell session, would the spark app ID be the same for all queries, would they be unique?

they will have the same application id.
the logic looks like readOrElseCreate the spark job json file for each query?

You're right, I was using some readOrElseCreate logic because somehow Spark calls the apply() method multiple times. I've just changed it to always write the file. That way it should always be overwritten. What do you think?

would that cause locking issue if there are concurrent writes to this file? Is there difficulty of moving this under the temporary query directory (or is the temporary query directory is not available at this time)?

@yigress What do you mean by "temporary query directory"? Are you referring to the directory named after hive.query.id when using plain Hive? If so, as far as I can tell, there's no way to retrieve a unique ID for the query provided in the query plan object in the Spark extension strategy's apply() function. Or is there?

The challenge here is that we'd need some unique way of identifying the query that would be deterministically created or retrieved both from the Spark extension and from the Hive storage handler.

For now, the only thing I'm aware of is the spark app ID. But as you mentioned, that's not enough to identify a specific query.

Another question for you: If you run multiple queries as part of the same Spark Shell session (i.e. the same Spark App ID), could multiple queries be run in parallel, or could they only be run in sequence? If in sequence, then perhaps the Spark App ID is enough. But if in parallel, then we're in trouble :)

jphalip · 2023-12-28T00:49:24Z

@yigress The integration tests are now passing

jphalip force-pushed the sparksql-support branch from f347bdc to fc237ce Compare October 31, 2023 06:23

jphalip requested review from yigress, davidrabinowitz and functicons October 31, 2023 15:33

yigress requested changes Oct 31, 2023

View reviewed changes

Add support for Spark SQL

651fbb0

jphalip force-pushed the sparksql-support branch from fc237ce to 651fbb0 Compare November 1, 2023 17:19

jphalip commented Nov 1, 2023

View reviewed changes

cloudbuild/cloudbuild.yaml Show resolved Hide resolved

functicons reviewed Nov 6, 2023

View reviewed changes

...ommon/src/main/java/com/google/cloud/hive/bigquery/connector/BigQueryStorageHandlerBase.java Outdated Show resolved Hide resolved

...mon/src/main/java/com/google/cloud/hive/bigquery/connector/output/MapReduceOutputFormat.java Outdated Show resolved Hide resolved

yigress reviewed Nov 6, 2023

View reviewed changes

jphalip added 3 commits November 6, 2023 11:50

Updates

3f18362

Multiple fixes

a405e6d

Fix Spark for indirect writes

3fd7609

yigress reviewed Nov 29, 2023

View reviewed changes

cloudbuild/presubmit.sh Show resolved Hide resolved

yigress reviewed Nov 30, 2023

View reviewed changes

jphalip added 4 commits December 7, 2023 11:37

Add support for insert overwrites to Spark SQL

26a751b

Increase cloud build timeout

35459f4

Attempt to fix integration tests

173bc0c

Change output file name and directory for Spark SQL

aaee23f

yigress reviewed Dec 18, 2023

View reviewed changes

...or-common/src/main/java/com/google/cloud/hive/bigquery/connector/sparksql/SparkSQLUtils.java Outdated Show resolved Hide resolved

...c/main/java/com/google/cloud/hive/bigquery/connector/sparksql/HiveBigQuerySparkStrategy.java Outdated Show resolved Hide resolved

Small updates

396df89

yigress reviewed Dec 20, 2023

View reviewed changes

jphalip added 3 commits December 25, 2023 15:49

Attempt to fix test failures

03ea384

One more attempt to fix conflicts

1a2a4af

Making progress?

0e7c698

jphalip added 5 commits December 26, 2023 21:47

Another try

e24610e

One more try

d3bc3a2

pom cleanups

9c414eb

One more pom fix

0977af8

Simplify management of grpc libs

927ce9b

Small tweak to Spark strategy class

d668cd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for spark sql #113

Add support for spark sql #113

jphalip commented Oct 27, 2023

jphalip commented Oct 28, 2023

yigress Oct 31, 2023

jphalip Oct 31, 2023

jphalip Oct 31, 2023

yigress Nov 1, 2023

jphalip Nov 1, 2023

jphalip Dec 28, 2023

yigress Nov 6, 2023

jphalip Nov 6, 2023

yigress Nov 6, 2023

jphalip Nov 6, 2023

yigress Nov 6, 2023

yigress Jan 2, 2024

yigress commented Nov 29, 2023

yigress Nov 30, 2023

jphalip Dec 8, 2023

jphalip Dec 28, 2023

jphalip commented Dec 8, 2023

yigress commented Dec 18, 2023 •

edited

Loading

yigress commented Dec 20, 2023

yigress Dec 20, 2023

jphalip Dec 22, 2023 •

edited

Loading

jphalip Dec 28, 2023

yigress Jan 2, 2024

jphalip Jan 2, 2024

yigress Jan 3, 2024

jphalip Jan 5, 2024

yigress Jan 5, 2024

jphalip Jan 11, 2024

jphalip commented Dec 28, 2023

		this.jobDetails = jobDetails;
		String taskAttemptID = UUID.randomUUID().toString();

Add support for spark sql #113

Are you sure you want to change the base?

Add support for spark sql #113

Conversation

jphalip commented Oct 27, 2023

jphalip commented Oct 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yigress commented Nov 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jphalip commented Dec 8, 2023

yigress commented Dec 18, 2023 • edited Loading

yigress commented Dec 20, 2023

Choose a reason for hiding this comment

jphalip Dec 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jphalip commented Dec 28, 2023

yigress commented Dec 18, 2023 •

edited

Loading

jphalip Dec 22, 2023 •

edited

Loading