Need help on next step : after trainMatch job Zingg on Databricks #575

iitbhumanish · 2023-04-26T11:40:29Z

iitbhumanish
Apr 26, 2023

I have succesfully run trainMatch option and save matched records in the output of delta format.

See that these fours records of Chavez fall under same cluster( id=1329) , expected result.

As a next step , I have prepared some more test data with same Chavez and trying to run it through saved model and expecting is new test data would fall with same cluster of existing Chavez. I am following below steps but getting null point expection in TrainMatch.

set input /output parameter for new records
run TrainMatch job --> failed due to null pointer exception. Please advise how to resolve.

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o1999.execute.
: zingg.client.ZinggClientException: Job aborted due to stage failure: Task 1 in stage 4336.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4336.0 (TID 12133) (10.192.99.74 executor 3): java.lang.NullPointerException
at zingg.block.Block$BlockFunction.call(Block.java:403)
at zingg.block.Block$BlockFunction.call(Block.java:393)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
at zingg.Matcher.execute(Matcher.java:159)
at zingg.TrainMatcher.execute(TrainMatcher.java:52)
at zingg.client.Client.execute(Client.java:242)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:750)

sonalgoyal · 2023-04-26T14:38:53Z

sonalgoyal
Apr 26, 2023
Maintainer

Seems like something is going wrong with the data. Can you please run only train and report what happens. Please share the cluster logs

1 reply

iitbhumanish May 1, 2023
Author

Hi Sonal,
I am trying to use saved model for incremental load in which , trying to identify duplicates between incoming and previously observed records .
So far I ran successfully two options: [ClientOptions.PHASE,"findTrainingData"] and ([ClientOptions.PHASE,"trainMatch"]) and saved the model. . All matched records are also saved.

As a next step, I want to use saved model to identify duplicates between incoming and previously observed records . So not sure what ClientOptions.PHASE to choose and how to append new record into existing saved records . Please advise.

sonalgoyal · 2023-05-01T10:00:04Z

sonalgoyal
May 1, 2023
Maintainer

Nice to see the progress @iitbhumanish - congrats on running your match jobs successfully. The incremental job can be done using the link phase. https://docs.zingg.ai/zingg/stepbystep/link

1 reply

iitbhumanish May 1, 2023
Author

Thanks Sonal but do not get exact command how to link new records with existing matched records.

After running two options: [ClientOptions.PHASE,"findTrainingData"] and ([ClientOptions.PHASE,"trainMatch"]) and saved the model. . All matched records are also saved at this location dbfs:/user/hive/zinggpoc/matchedTest
These are parquet files and also enriched with the probability scores (z_minscore, z_maxscore, z_cluster)
I have an incremental load data set /FileStore/ZinggPOC/IncrementalNewData.csv

Please help to get exact command to run NewData set with existing matched records.
Note: Incoming new data is in CSV file format and prior matched records are stored in parquet files.

I tried below steps but did not get any result because it seems I missed to add prior matched records.

sonalgoyal · 2023-05-01T11:31:22Z

sonalgoyal
May 1, 2023
Maintainer

To link two datasets, please add two pipes to the input, first one (parquet from the previous match run) as the first pipe and incremental csv as the second pipe. Please drop the z columns from the first dataset or make a copy so that both datasets have the same schema.

You can check the config in the doc I shared earlier to see how that it is configured.

Zingg Open Source does not link against the matched results directly. However, you can link two datasets of the same schema easily using an existing model on the same schema.

1 reply

iitbhumanish May 1, 2023
Author

Thanks a lot for the valuable information. I will try this .

iitbhumanish · 2023-05-05T14:26:58Z

iitbhumanish
May 5, 2023
Author

could you please tell me what should be the input pipe code for parquet files. As I use below code for csv file in which I use CsvPipe function and schema . Is there any specific code for parquet files ?

inputPipe = CsvPipe("testFebrl", "/FileStore/ZinggPOC/IncrementalNewData.csv", schema)
args.setData(inputPipe)

0 replies

sonalgoyal · 2023-05-06T05:20:55Z

sonalgoyal
May 6, 2023
Maintainer

there is no separate pipe for parquet. You can use the base Pipe class and add property for location

0 replies

iitbhumanish · 2023-05-08T11:28:47Z

iitbhumanish
May 8, 2023
Author

Hi Sonal,

I tried to add both input pipe but at the end only one input pipe is shown in args. Please suggest me if missing something.

0 replies

sonalgoyal · 2023-05-08T11:47:54Z

sonalgoyal
May 8, 2023
Maintainer

Kindly open different issues/discussion for different problems. Keeping a single item/question per thread makes it easier for others to easily find help. An uber issue is difficult to read through and find answers for others.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help on next step : after trainMatch job Zingg on Databricks #575

{{title}}

Replies: 7 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Need help on next step : after trainMatch job Zingg on Databricks #575

iitbhumanish Apr 26, 2023

Replies: 7 comments · 3 replies

sonalgoyal Apr 26, 2023 Maintainer

iitbhumanish May 1, 2023 Author

sonalgoyal May 1, 2023 Maintainer

iitbhumanish May 1, 2023 Author

sonalgoyal May 1, 2023 Maintainer

iitbhumanish May 1, 2023 Author

iitbhumanish May 5, 2023 Author

sonalgoyal May 6, 2023 Maintainer

iitbhumanish May 8, 2023 Author

sonalgoyal May 8, 2023 Maintainer

iitbhumanish
Apr 26, 2023

Replies: 7 comments 3 replies

sonalgoyal
Apr 26, 2023
Maintainer

iitbhumanish May 1, 2023
Author

sonalgoyal
May 1, 2023
Maintainer

iitbhumanish May 1, 2023
Author

sonalgoyal
May 1, 2023
Maintainer

iitbhumanish May 1, 2023
Author

iitbhumanish
May 5, 2023
Author

sonalgoyal
May 6, 2023
Maintainer

iitbhumanish
May 8, 2023
Author

sonalgoyal
May 8, 2023
Maintainer