Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dima throws "Size exceeds Integer.MAX_VALUE" exception when processing the LiveJournal dataset #6

Open
wangzk opened this issue Feb 5, 2018 · 0 comments

Comments

@wangzk
Copy link

wangzk commented Feb 5, 2018

Hi! Sorry for disturbing you with another bug report.

Recently, when I tried to run Dima on the com-LiveJournal dataset from SNAP Datasets, I met another exception.

The exception is

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4367 in stage 6.0 failed 4 times, most recent failure: Lost task 4367.3 in stage 6.0 (TID 61683, slave004): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
        at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
        at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
        at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
        at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
        at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
        at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
        at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Details of my running:

  • I run Dima with this Scala script to conduct a Jaccard similarity self-join on the adjacency sets of the Livejournal graph.
  • spark.sql.joins.numSimialrityPartitions is 10240.
  • Threshold: 0.9.
  • The preprocessed dataset file can be downloaded from here with the access password KmSSjq.
  • Spark job configuration:
    • Number of executors: 32;
    • Executor cores: 8;
    • Executor memory: 20 GB;
    • Driver memory: 10 GB;

Details of the dataset:

  • Number of records: 3997962;
  • Average record length: 17.3.

The number of records per partition on average is 390.4 (=3997962/10240), which is not very big.
It seems that some RDD partition becomes too large during the execution. Do I need to change some parameters to enable Dima to run on this dataset?

Thanks you very much for inspecting this bug report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant