Dima throws "Size exceeds Integer.MAX_VALUE" exception when processing the LiveJournal dataset #6

wangzk · 2018-02-05T09:51:31Z

Hi! Sorry for disturbing you with another bug report.

Recently, when I tried to run Dima on the com-LiveJournal dataset from SNAP Datasets, I met another exception.

The exception is

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4367 in stage 6.0 failed 4 times, most recent failure: Lost task 4367.3 in stage 6.0 (TID 61683, slave004): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
        at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
        at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
        at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
        at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
        at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
        at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
        at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Details of my running:

I run Dima with this Scala script to conduct a Jaccard similarity self-join on the adjacency sets of the Livejournal graph.
spark.sql.joins.numSimialrityPartitions is 10240.
Threshold: 0.9.
The preprocessed dataset file can be downloaded from here with the access password KmSSjq.
Spark job configuration:
- Number of executors: 32;
- Executor cores: 8;
- Executor memory: 20 GB;
- Driver memory: 10 GB;

Details of the dataset:

Number of records: 3997962;
Average record length: 17.3.

The number of records per partition on average is 390.4 (=3997962/10240), which is not very big.
It seems that some RDD partition becomes too large during the execution. Do I need to change some parameters to enable Dima to run on this dataset?

Thanks you very much for inspecting this bug report!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dima throws "Size exceeds Integer.MAX_VALUE" exception when processing the LiveJournal dataset #6

Dima throws "Size exceeds Integer.MAX_VALUE" exception when processing the LiveJournal dataset #6

wangzk commented Feb 5, 2018

Dima throws "Size exceeds Integer.MAX_VALUE" exception when processing the LiveJournal dataset #6

Dima throws "Size exceeds Integer.MAX_VALUE" exception when processing the LiveJournal dataset #6

Comments

wangzk commented Feb 5, 2018