Create sample command fails in Google Dataproc Spark 2.11.8 #163

sanjayio · 2018-07-17T03:09:00Z

I am getting the error java.io.IOException: Mkdirs failed to create file:/home/sanjay/spark-warehouse/default_verdict.db/vt23_1/.hive-staging_hive_2018-07-17_03-03-28_842_6156432897141230125-1/-ext-10000/_temporary/0/_temporary/attempt_20180717030333_0002_m_000016_3 when I run the command vc.sql("create sample of default.advertiser_apr_orc").show(false).
I am running on Dataproc image 1.2, with spark 2.11.8 and verdict-spark-lib-0.4.8.jar. I am running this command as the root user and have done chmod 755 to the dir /home/sanjay/

The text was updated successfully, but these errors were encountered:

sanjayio · 2018-07-17T07:29:34Z

I even tried with the same configuration given in the documentation with Dataproc 1.0 image and verdict-core-0.3.0-jar-with-dependencies.jar. When I do the create sample command I am getting this error
org.apache.hadoop.hive.common.FileUtils: Creating directory if it doesn't exist: hdfs://cluster-16-m/user/hive/warehouse/null_verdict.db/vt66_4/.hive-staging_hive_2018-07-17_07-25-01_583_8460634515176170110-1 java.lang.NullPointerException at edu.umich.verdict.util.StringManipulations.quoteString(StringManipulations.java:132) at edu.umich.verdict.dbms.DbmsSpark.insertEntry(DbmsSpark.java:138) at edu.umich.verdict.dbms.Dbms.insertSampleNameEntryIntoDBMS(Dbms.java:480) at edu.umich.verdict.dbms.DbmsSpark.updateSampleNameEntryIntoDBMS(DbmsSpark.java:146) at edu.umich.verdict.VerdictMeta.insertSampleInfo(VerdictMeta.java:200) at edu.umich.verdict.query.CreateSampleQuery.createUniformRandomSample(CreateSampleQuery.java:120) at edu.umich.verdict.query.CreateSampleQuery.buildSamples(CreateSampleQuery.java:57) at edu.umich.verdict.query.CreateSampleQuery.buildSamples(CreateSampleQuery.java:81) at edu.umich.verdict.query.CreateSampleQuery.compute(CreateSampleQuery.java:39) at edu.umich.verdict.query.Query.computeDataFrame(Query.java:107) at edu.umich.verdict.VerdictSparkHiveContext.execute(VerdictSparkHiveContext.java:40) at edu.umich.verdict.VerdictContext.executeSparkQuery(VerdictContext.java:125) at edu.umich.verdict.VerdictContext.sql(VerdictContext.java:131) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:38) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:40) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:42) at $iwC$$iwC$$iwC.<init>(<console>:44) at $iwC$$iwC.<init>(<console>:46) at $iwC.<init>(<console>:48) at <init>(<console>:50) at .<init>(<console>:54) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

sanjayio · 2018-07-17T07:55:11Z

I tried building the master branch ( verdict-spark-lib-0.4.11.jar ) and ran it on a fresh instance of google dataproc 1.2 version. Even in that instance when I run

import edu.umich.verdict.VerdictSpark2Context
scala> val vc = new VerdictSpark2Context(sc)
scala> vc.sql("show databases").show(false)
scala> vc.sql("create sample of default.advertiser_06_01_orc").show(false)

I am getting the following error
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to create database path file:/home/sanjay/spark-warehouse/default_verdict.db, failed to create database default_verdict); at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.doCreateDatabase(HiveExternalCatalog.scala:163) at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.createDatabase(ExternalCatalog.scala:69) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:219) at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:66) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) at edu.umich.verdict.dbms.DbmsSpark2.execute(DbmsSpark2.java:84) at edu.umich.verdict.dbms.DbmsSpark2.executeUpdate(DbmsSpark2.java:91) at edu.umich.verdict.dbms.Dbms.createCatalog(Dbms.java:192) at edu.umich.verdict.dbms.Dbms.createDatabase(Dbms.java:183) at edu.umich.verdict.query.CreateSampleQuery.buildSamples(CreateSampleQuery.java:93) at edu.umich.verdict.query.CreateSampleQuery.compute(CreateSampleQuery.java:64) at edu.umich.verdict.query.Query.computeDataset(Query.java:192) at edu.umich.verdict.VerdictSpark2Context.execute(VerdictSpark2Context.java:61) at edu.umich.verdict.VerdictContext.executeSpark2Query(VerdictContext.java:160) at edu.umich.verdict.VerdictSpark2Context.sql(VerdictSpark2Context.java:81)

What does this error mean?

pyongjoo · 2018-07-17T13:38:50Z

This seems to be the HDFS (or Hive) permission issue. When I observed similar errors, it was due to the lack of write permission in the spark-warehouse directory. Can you see if the regular SparkSession.sql("create schema myschema") works? If you are using the Spark interactive shell, the command should be spark.sql("create schema myschema"). Otherwise, the variable "spark" should be replaced with the instance of SparkSession. Depending on the result of the above command, our investigation will take a different direction. Thanks, Yongjoo

…

On Tue, Jul 17, 2018 at 3:55 AM Sanjay Kumar ***@***.***> wrote: I tried building the master branch and ran it on a fresh instance of google dataproc 1.2 version. Even in that instance when I run import edu.umich.verdict.VerdictSpark2Context scala> val vc = new VerdictSpark2Context(sc) scala> vc.sql("show databases").show(false) scala> vc.sql("create sample of database_name.table_name").show(false) I am getting the following error org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to create database path file:/home/sanjay/spark-warehouse/default_verdict.db, failed to create database default_verdict); at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.doCreateDatabase(HiveExternalCatalog.scala:163) at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.createDatabase(ExternalCatalog.scala:69) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:219) at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:66) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) at edu.umich.verdict.dbms.DbmsSpark2.execute(DbmsSpark2.java:84) at edu.umich.verdict.dbms.DbmsSpark2.executeUpdate(DbmsSpark2.java:91) at edu.umich.verdict.dbms.Dbms.createCatalog(Dbms.java:192) at edu.umich.verdict.dbms.Dbms.createDatabase(Dbms.java:183) at edu.umich.verdict.query.CreateSampleQuery.buildSamples(CreateSampleQuery.java:93) at edu.umich.verdict.query.CreateSampleQuery.compute(CreateSampleQuery.java:64) at edu.umich.verdict.query.Query.computeDataset(Query.java:192) at edu.umich.verdict.VerdictSpark2Context.execute(VerdictSpark2Context.java:61) at edu.umich.verdict.VerdictContext.executeSpark2Query(VerdictContext.java:160) at edu.umich.verdict.VerdictSpark2Context.sql(VerdictSpark2Context.java:81) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#163 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQYCvrlAesx3-l_DAhu7C6-gwSsHLPUks5uHZhfgaJpZM4VSHLP> .

-- Yongjoo Park, Ph.D. Research Fellow Computer Science and Engineering University of Michigan 2260 Hayward St. Ann Arbor, MI 48109-2121 Office: 4957 Beyster Phone: (734) 707-9206 Website: yongjoopark.com

sanjayio · 2018-07-18T00:22:00Z

@pyongjoo I think you are right. I am not able to create the schema as well. I am getting the same error when I try that. How can I resolve this issue?

pyongjoo · 2018-07-18T00:36:13Z

In my case, I used the regular hdfs command. An example is "hdfs dfs chmod 777 /.../spark-warehouse". This command is usually possible when you have a separate installation of HDFS and Spark is using the HDFS installation. I attach a link for more hdfs commands: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html I am pretty sure there are lots of other documentation for Google dataproc, but I cannot test it right now. FYI, we plan to update VerdictDB soon. In that version, you will be able to configure the schema used by Verdict directly.

…

On Tue, Jul 17, 2018 at 8:22 PM Sanjay Kumar ***@***.***> wrote: @pyongjoo <https://github.com/pyongjoo> I think you are right. I am not able to create the schema as well. I am getting the same error when I try that. How can I resolve this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#163 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQYCvBsmvyc5zL3A1eNs7gW24ktU6r5ks5uHn-ogaJpZM4VSHLP> .

-- Yongjoo Park, Ph.D. Research Fellow Computer Science and Engineering University of Michigan 2260 Hayward St. Ann Arbor, MI 48109-2121 Office: 4957 Beyster Phone: (734) 707-9206 Website: yongjoopark.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create sample command fails in Google Dataproc Spark 2.11.8 #163

Create sample command fails in Google Dataproc Spark 2.11.8 #163

sanjayio commented Jul 17, 2018 •

edited

Loading

sanjayio commented Jul 17, 2018

sanjayio commented Jul 17, 2018 •

edited

Loading

pyongjoo commented Jul 17, 2018 via email

sanjayio commented Jul 18, 2018

pyongjoo commented Jul 18, 2018 via email

Create sample command fails in Google Dataproc Spark 2.11.8 #163

Create sample command fails in Google Dataproc Spark 2.11.8 #163

Comments

sanjayio commented Jul 17, 2018 • edited Loading

sanjayio commented Jul 17, 2018

sanjayio commented Jul 17, 2018 • edited Loading

pyongjoo commented Jul 17, 2018 via email

sanjayio commented Jul 18, 2018

pyongjoo commented Jul 18, 2018 via email

sanjayio commented Jul 17, 2018 •

edited

Loading

sanjayio commented Jul 17, 2018 •

edited

Loading