You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4367 in stage 6.0 failed 4 times, most recent failure: Lost task 4367.3 in stage 6.0 (TID 61683, slave004): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Details of my running:
I run Dima with this Scala script to conduct a Jaccard similarity self-join on the adjacency sets of the Livejournal graph.
spark.sql.joins.numSimialrityPartitions is 10240.
Threshold: 0.9.
The preprocessed dataset file can be downloaded from here with the access password KmSSjq.
Spark job configuration:
Number of executors: 32;
Executor cores: 8;
Executor memory: 20 GB;
Driver memory: 10 GB;
Details of the dataset:
Number of records: 3997962;
Average record length: 17.3.
The number of records per partition on average is 390.4 (=3997962/10240), which is not very big.
It seems that some RDD partition becomes too large during the execution. Do I need to change some parameters to enable Dima to run on this dataset?
Thanks you very much for inspecting this bug report!
The text was updated successfully, but these errors were encountered:
Hi! Sorry for disturbing you with another bug report.
Recently, when I tried to run Dima on the com-LiveJournal dataset from SNAP Datasets, I met another exception.
The exception is
Details of my running:
spark.sql.joins.numSimialrityPartitions
is 10240.KmSSjq
.Details of the dataset:
The number of records per partition on average is 390.4 (=3997962/10240), which is not very big.
It seems that some RDD partition becomes too large during the execution. Do I need to change some parameters to enable Dima to run on this dataset?
Thanks you very much for inspecting this bug report!
The text was updated successfully, but these errors were encountered: