-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for spark sql #113
base: main
Are you sure you want to change the base?
Conversation
/gcbrun |
f347bdc
to
fc237ce
Compare
this.jobDetails = jobDetails; | ||
String taskAttemptID = UUID.randomUUID().toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are relying on overwriting task attempt file when there are multiple attempts to handle attempt failures (and cleanup?). if each attempt has a different file, how is attempt faiilures handled? also preserving the task attempt info helps troubleshooting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I changed this is because there is no available hive query id in the Hadoop conf when using Spark SQL. Do you know how else we could handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yigress Can you think of a way to reproduce this situation? It would be good to cover that in our test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test would have first few attempts fail then another attempt succeed. I can't think of a good way to do that. maybe another thread monitor the writing location and switching permissions between first and later attempts, or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I've added some code to retrieve the task ID from Spark. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: The Spark task ID now uses the Spark app ID and the partition ID. The temp output folder also uses the ".hive-staging-XXX" folder.
...ry-connector/src/test/java/com/google/cloud/hive/bigquery/connector/sparksql/SparkUtils.java
Outdated
Show resolved
Hide resolved
fc237ce
to
651fbb0
Compare
...ommon/src/main/java/com/google/cloud/hive/bigquery/connector/BigQueryStorageHandlerBase.java
Outdated
Show resolved
Hide resolved
...mon/src/main/java/com/google/cloud/hive/bigquery/connector/output/MapReduceOutputFormat.java
Outdated
Show resolved
Hide resolved
...ctor-common/src/main/java/com/google/cloud/hive/bigquery/connector/BigQueryMetaHookBase.java
Show resolved
Hide resolved
...ctor-common/src/main/java/com/google/cloud/hive/bigquery/connector/BigQueryMetaHookBase.java
Show resolved
Hide resolved
...ctor-common/src/main/java/com/google/cloud/hive/bigquery/connector/utils/hive/HiveUtils.java
Outdated
Show resolved
Hide resolved
Class<?> taskContextClass = Class.forName("org.apache.spark.TaskContext"); | ||
Method getMethod = taskContextClass.getMethod("get"); | ||
Object taskContext = getMethod.invoke(null); | ||
Method taskAttemptIdMethod = taskContextClass.getMethod("taskAttemptId"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want task id not task attempt id here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Hive we are getting the task attempt ID. Isn't that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are only getting task id for hive. this remind that we need to set readme that spark/hive speculative is not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yigress Are you sure? It looks like we're using the attempt id, not just the task id: https://github.com/GoogleCloudDataproc/hive-bigquery-connector/blob/main/hive-bigquery-connector-common/src/main/java/com/google/cloud/hive/bigquery/connector/utils/hive/HiveUtils.java#L63-L105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are writing output file to be just task id. JobUtils#getTaskWriterOutputFile. here we can have the task attempt id just need to make sure the taskid can be fetched from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works if the spark application just fails when the insert query fails. however if user choose not to fail the application (for example try catch the query failure) or using spark-shell, it will be the same spark application, if user do another query either insert into same table or insert into another table, what would happen if the spark job info file already exists?
add acceptance test or integration test for 'insert overwrite'? test on dataproc-2.1 shows incorrect results scala> spark.sql("insert overwrite table snoop select * from ba") scala> spark.sql("select * from snoop").show |
Object partitionId = partitionIdMethod.invoke(taskContext); | ||
Method stageIdMethod = taskContextClass.getMethod("stageId"); | ||
Object stageId = stageIdMethod.invoke(taskContext); | ||
return String.format("stage-%s-partition-%s", stageId, partitionId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now all stream ref files are dumped into a parent path of .../
/ and we are assuming there is only 1 stage write to the table? if there are multiple stages write into same table, then for each stage all the stream ref files will be picked up, this will have data correctness issue.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code to use the ".hive-staging-XXXX" folder name, when present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the getQueryTempOutputPath()
method. We check for the presence of the FileOutputFormat.OUTDIR
(mapreduce.output.fileoutputformat.outputdir
) property, which is set by Spark. If present, then we extract the "hive-staging-XXXX" part of that folder name and then use that for our own temp output dir.
@yigress |
can you run some tests with dataproc cluster? spark-shell --conf spark.sql.hive.convertMetastoreBigquery=false --conf spark.sql.extensions=com.google.cloud.hive.bigquery.connector.sparksql.HiveBigQuerySparkSQLExtension --jars gs://yigress/test/hive-2-bigquery-connector-2.0.0-SNAPSHOT.jar scala> spark.sql("select * from region").show spark.sql("insert into region2 select * from region") scala> spark.sql("select * from region2").show there is no result, nor error message it looks like a delay in spark query result when insert into an emtpy bq table. the results eventually show up when query again at a later time. |
...or-common/src/main/java/com/google/cloud/hive/bigquery/connector/sparksql/SparkSQLUtils.java
Outdated
Show resolved
Hide resolved
...c/main/java/com/google/cloud/hive/bigquery/connector/sparksql/HiveBigQuerySparkStrategy.java
Outdated
Show resolved
Hide resolved
there are bunch errors in the integration test OK |
|
||
@Override | ||
public void commitJob(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException { | ||
SparkSQLUtils.cleanUpSparkJobFile(jobContext.getConfiguration()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this spark job file be moved to inside per query? right now it is per spark application, in case a previous query failure or the committer not called, this existing file will cause problem for next query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this Spark job file only contains one piece of information: what tables are "INSERT" and what tables are "INSERT OVERWRITE". So I think it's relevant for the whole Spark application. Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Spark job file basically mimics what Hive itself does here: https://github.com/apache/hive/blob/8190d2be7b7165effa62bd21b7d60ef81fb0e4af/ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java#L69-L70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works if the spark application fails when the insert query fails. but if user choose to handle the spark sql failure and continue the spark application (maybe a try catch), or user run in spark-shell, the same spark application can continue, so later another insert query (either insert into same table or different table), what would happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I think the spark job file would be overwritten when a new query is run.
BTW, if you run multiple queries as part of the same Spark shell session, would the spark app ID be the same for all queries, would they be unique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they will have the same application id.
the logic looks like readOrElseCreate the spark job json file for each query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I was using some readOrElseCreate logic because somehow Spark calls the apply()
method multiple times. I've just changed it to always write the file. That way it should always be overwritten. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would that cause locking issue if there are concurrent writes to this file? Is there difficulty of moving this under the temporary query directory (or is the temporary query directory is not available at this time)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yigress What do you mean by "temporary query directory"? Are you referring to the directory named after hive.query.id
when using plain Hive? If so, as far as I can tell, there's no way to retrieve a unique ID for the query provided in the query plan object in the Spark extension strategy's apply()
function. Or is there?
The challenge here is that we'd need some unique way of identifying the query that would be deterministically created or retrieved both from the Spark extension and from the Hive storage handler.
For now, the only thing I'm aware of is the spark app ID. But as you mentioned, that's not enough to identify a specific query.
Another question for you: If you run multiple queries as part of the same Spark Shell session (i.e. the same Spark App ID), could multiple queries be run in parallel, or could they only be run in sequence? If in sequence, then perhaps the Spark App ID is enough. But if in parallel, then we're in trouble :)
@yigress The integration tests are now passing |
No description provided.