Need help on next step : after trainMatch job Zingg on Databricks #575
Replies: 7 comments 3 replies
-
Seems like something is going wrong with the data. Can you please run only train and report what happens. Please share the cluster logs |
Beta Was this translation helpful? Give feedback.
-
Nice to see the progress @iitbhumanish - congrats on running your match jobs successfully. The incremental job can be done using the link phase. https://docs.zingg.ai/zingg/stepbystep/link |
Beta Was this translation helpful? Give feedback.
-
To link two datasets, please add two pipes to the input, first one (parquet from the previous match run) as the first pipe and incremental csv as the second pipe. Please drop the z columns from the first dataset or make a copy so that both datasets have the same schema. You can check the config in the doc I shared earlier to see how that it is configured. Zingg Open Source does not link against the matched results directly. However, you can link two datasets of the same schema easily using an existing model on the same schema. |
Beta Was this translation helpful? Give feedback.
-
could you please tell me what should be the input pipe code for parquet files. As I use below code for csv file in which I use CsvPipe function and schema . Is there any specific code for parquet files ? inputPipe = CsvPipe("testFebrl", "/FileStore/ZinggPOC/IncrementalNewData.csv", schema) |
Beta Was this translation helpful? Give feedback.
-
there is no separate pipe for parquet. You can use the base Pipe class and add property for location |
Beta Was this translation helpful? Give feedback.
-
Hi Sonal, I tried to add both input pipe but at the end only one input pipe is shown in args. Please suggest me if missing something. |
Beta Was this translation helpful? Give feedback.
-
Kindly open different issues/discussion for different problems. Keeping a single item/question per thread makes it easier for others to easily find help. An uber issue is difficult to read through and find answers for others. |
Beta Was this translation helpful? Give feedback.
-
I have succesfully run trainMatch option and save matched records in the output of delta format.
See that these fours records of Chavez fall under same cluster( id=1329) , expected result.
As a next step , I have prepared some more test data with same Chavez and trying to run it through saved model and expecting is new test data would fall with same cluster of existing Chavez. I am following below steps but getting null point expection in TrainMatch.
set input /output parameter for new records
run TrainMatch job --> failed due to null pointer exception. Please advise how to resolve.
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o1999.execute.
: zingg.client.ZinggClientException: Job aborted due to stage failure: Task 1 in stage 4336.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4336.0 (TID 12133) (10.192.99.74 executor 3): java.lang.NullPointerException
at zingg.block.Block$BlockFunction.call(Block.java:403)
at zingg.block.Block$BlockFunction.call(Block.java:393)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at zingg.Matcher.execute(Matcher.java:159)
at zingg.TrainMatcher.execute(TrainMatcher.java:52)
at zingg.client.Client.execute(Client.java:242)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:750)
Beta Was this translation helpful? Give feedback.
All reactions