Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when ever am passing a bam file as input am getting nothing out of this. #375

Open
ankushreddy opened this issue Jan 27, 2016 · 12 comments
Open

Comments

@ankushreddy
Copy link

Hi team,

When am submitting the job it is getting executed successfully but it is not calling any genotypes.

could you please help me with this issue.

       spark-submit --master yarn --deploy-mode client --driver-java-options -Dlog4j.configuration=/local/guacamole/scripts/logs4j.properties --executor-memory 4g --driver-memory 10g --num-executors 20 --executor-cores 10 --class org.hammerlab.guacamole.Guacamole --verbose /local/guacamole/target/guacamole-with-dependencies-0.0.1-SNAPSHOT.jar germline-threshold --reads hdfs:///shared/avocado_test/NA06984.454.MOSAIK.SRP000033.2009_11.bam --out hdfs:///user/asugured/guacamole/result.vcf

please see the output of the spark-submit I have used.

16/01/27 16:14:02 INFO YarnScheduler: Adding task set 19.0 with 1 tasks
16/01/27 16:14:02 INFO TaskSetManager: Starting task 0.0 in stage 19.0 (TID 14, istb1-l2-b12-07.hadoop.priv, PROCESS_LOCAL, 1432 bytes)
16/01/27 16:14:02 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on istb1-l2-b12-07.hadoop.priv:38654 (size: 1808.0 B, free: 2.1 GB)
16/01/27 16:14:02 INFO DAGScheduler: Stage 19 (count at VariationRDDFunctions.scala:144) finished in 0.101 s
16/01/27 16:14:02 INFO TaskSetManager: Finished task 0.0 in stage 19.0 (TID 14) in 88 ms on istb1-l2-b12-07.hadoop.priv (1/1)
16/01/27 16:14:02 INFO YarnScheduler: Removed TaskSet 19.0, whose tasks have all completed, from pool
16/01/27 16:14:02 INFO DAGScheduler: Job 5 finished: count at VariationRDDFunctions.scala:144, took 0.115971 s
16/01/27 16:14:02 INFO VariantContextRDDFunctions: Write 0 records
16/01/27 16:14:02 INFO MapPartitionsRDD: Removing RDD 22 from persistence list
16/01/27 16:14:02 INFO BlockManager: Removing RDD 22
*** Delayed Messages ***
Called 0 genotypes.
Region counts: filtered 0 total regions to 0 relevant regions, expanded for overlaps by NaN% to 0
Regions per task: min=NaN 25%=NaN median=NaN (mean=NaN) 75%=NaN max=NaN. Max is NaN% more than mean.
16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}

Thanks & Regards,
Ankush Reddy

@arahuja
Copy link
Contributor

arahuja commented Jan 28, 2016

Hi @ankushreddy - thanks for checking out Guacamole. Most of the callers are still in progress, but hopefully we can help you test them out. For germline-threshold, there is also a --threshold parameter to be aware of which is lowest VAF necessary to call a variant. It's unlikely you are hitting this, but just wanted to make you aware.

Does your BAM file have MDTags? If not, germline-threshold currently requires the reads already has mdtags, otherwise unfortunately, removes them. You can add mdtags through samtools or adam.

We do support computing of mdtags in Guacamole as well, but looking over the code, this isn't configured correctly right now. I wll file an issue for that and fix it.

@ankushreddy
Copy link
Author

Hi @arahuja Thanks for the quick reply I just got few more questions am actually new to genomics. So don't know what exactly is going on with the code.

**I used adam-submit and added the tags by using this adam-submit command.

./adam-submit transform /shared/avocado_test/NA06984.454.ssaha.SRP000033.2009_10.bam /shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam -add_md_tags /shared/avocado_test/human_b36_male.fa
**
Then I got three outputs.

drwxr-xr-x - asugured hdfs 0 2016-01-28 12:37 /shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam
-rw-r--r-- 3 asugured hdfs 1350 2016-01-28 12:32 /shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam.rgdict
-rw-r--r-- 3 asugured hdfs 4513 2016-01-28 12:32 /shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam.seqdict

Later I used /shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam to submit it with spark-submit for guacamole.
Please find the submit command and error message.
spark-submit --master yarn --deploy-mode client --driver-java-options -Dlog4j.configuration=/local/guacamole/scripts/logs4j.properties --executor-memory 4g --driver-memory 10g --num-executors 20 --executor-cores 10 --class org.hammerlab.guacamole.Guacamole --verbose /local/guacamole/target/guacamole-with-dependencies-0.0.1-SNAPSHOT.jar germline-threshold --reads hdfs:///shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam --out hdfs:///user/asugured/guacamole/result2.vcf

Error it is throwing avro parquet schema error.

16/01/28 12:46:07 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, istb1-l2-b11-01.hadoop.priv): org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.
at org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:128)
at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:89)
at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:64)
at org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34)
at org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:138)
at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130)
at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:179)
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:201)
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

16/01/28 12:46:07 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 2, istb1-l2-b11-01.hadoop.priv, NODE_LOCAL, 1516 bytes)
16/01/28 12:46:07 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on executor istb1-l2-b11-01.hadoop.priv: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.) [duplicate 1]
16/01/28 12:46:07 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 3, istb1-l2-b13-05.hadoop.priv, NODE_LOCAL, 1516 bytes)
16/01/28 12:46:07 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor istb1-l2-b12-09.hadoop.priv: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.) [duplicate 2]
16/01/28 12:46:07 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 4, istb1-l2-b12-09.hadoop.priv, NODE_LOCAL, 1516 bytes)
16/01/28 12:46:08 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4) on executor istb1-l2-b12-09.hadoop.priv: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.) [duplicate 3]
16/01/28 12:46:08 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 5, istb1-l2-b12-09.hadoop.priv, NODE_LOCAL, 1516 bytes)
16/01/28 12:46:08 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 5) on executor istb1-l2-b12-09.hadoop.priv: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.) [duplicate 4]
16/01/28 12:46:08 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 6, istb1-l2-b12-09.hadoop.priv, NODE_LOCAL, 1516 bytes)
16/01/28 12:46:08 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on executor istb1-l2-b12-09.hadoop.priv: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.) [duplicate 5]
16/01/28 12:46:08 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
16/01/28 12:46:08 INFO YarnScheduler: Cancelling stage 0
16/01/28 12:46:08 INFO YarnScheduler: Stage 0 was cancelled
16/01/28 12:46:08 INFO DAGScheduler: Job 0 failed: reduce at ADAMRDDFunctions.scala:127, took 4.545263 s
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/01/28 12:46:08 INFO SparkUI: Stopped Spark web UI at http://istb1-l2-b13-u35.hadoop.priv:4042
16/01/28 12:46:08 INFO DAGScheduler: Stopping DAGScheduler
16/01/28 12:46:08 INFO YarnClientSchedulerBackend: Shutting down all executors
16/01/28 12:46:08 INFO YarnClientSchedulerBackend: Asking each executor to shut down
16/01/28 12:46:08 INFO YarnClientSchedulerBackend: Stopped
16/01/28 12:46:08 INFO OutputCommitCoordinator$OutputCommitCoordinatorActor: OutputCommitCoordinator stopped!
16/01/28 12:46:08 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
16/01/28 12:46:08 INFO MemoryStore: MemoryStore cleared
16/01/28 12:46:08 INFO BlockManager: BlockManager stopped
16/01/28 12:46:08 INFO BlockManagerMaster: BlockManagerMaster stopped
16/01/28 12:46:08 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, istb1-l2-b12-09.hadoop.priv): org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.
at org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:128)
at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:89)
at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:64)
at org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34)
at org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:138)
at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130)
at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:179)
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:201)
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Jan 28, 2016 12:46:03 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 2
16/01/28 12:46:08 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/01/28 12:46:08 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/01/28 12:46:08 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

Could you please help me in understanding this.

Thanks & Regards,
Ankush Reddy.

@arahuja
Copy link
Contributor

arahuja commented Jan 28, 2016

What version of ADAM are you using, if you have moved to a new version of ADAM the schema of the ADAM format may be different. If you can use the BAM output of ADAM that may work better, but I have not used that before.

@ankushreddy
Copy link
Author

I am adam latest version just now I cloned it from git and started using it.
when we use transform in adam submit it will adam format or something parquet format of data.

please correct me if am following the correct process.

@ryan-williams
Copy link
Member

Hey @ankushreddy HEAD of ADAM has different schemas than guacamole expects; we depend on ADAM 0.18.1 and they've been doing big refactorings recently.

If you can try those steps again using that version of ADAM, guacamole should be able to read the .adam files correctly.

Or, as @arahuja said, you can try using .bam as your intermediate format instead of .adam.

@ankushreddy
Copy link
Author

@ryan-williams hi ryan thanks for guiding me I used adam 0.18.1

am getting null pointer exception.
please find the LOG.

16/01/28 20:05:38 INFO MemoryStore: ensureFreeSpace(303352) called with curMem=0, maxMem=5556991426
16/01/28 20:05:38 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 296.2 KB, free 5.2 GB)
16/01/28 20:05:38 INFO MemoryStore: ensureFreeSpace(27127) called with curMem=303352, maxMem=5556991426
16/01/28 20:05:38 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 26.5 KB, free 5.2 GB)
16/01/28 20:05:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.107.18.34:36737 (size: 26.5 KB, free: 5.2 GB)
16/01/28 20:05:38 INFO SparkContext: Created broadcast 0 from newAPIHadoopFile at ADAMContext.scala:158
16/01/28 20:05:39 INFO FileInputFormat: Total input paths to process : 2
16/01/28 20:05:39 INFO SparkContext: Starting job: reduce at ADAMRDDFunctions.scala:127
16/01/28 20:05:39 INFO DAGScheduler: Got job 0 (reduce at ADAMRDDFunctions.scala:127) with 2 output partitions (allowLocal=false)
16/01/28 20:05:39 INFO DAGScheduler: Final stage: ResultStage 0(reduce at ADAMRDDFunctions.scala:127)
16/01/28 20:05:39 INFO DAGScheduler: Parents of final stage: List()
16/01/28 20:05:39 INFO DAGScheduler: Missing parents: List()
16/01/28 20:05:39 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at mapPartitions at ADAMRDDFunctions.scala:126), which has no missing parents
16/01/28 20:05:39 INFO MemoryStore: ensureFreeSpace(3496) called with curMem=330479, maxMem=5556991426
16/01/28 20:05:39 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.4 KB, free 5.2 GB)
16/01/28 20:05:39 INFO MemoryStore: ensureFreeSpace(1965) called with curMem=333975, maxMem=5556991426
16/01/28 20:05:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1965.0 B, free 5.2 GB)
16/01/28 20:05:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.107.18.34:36737 (size: 1965.0 B, free: 5.2 GB)
16/01/28 20:05:39 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
16/01/28 20:05:39 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at mapPartitions at ADAMRDDFunctions.scala:126)
16/01/28 20:05:39 INFO YarnScheduler: Adding task set 0.0 with 2 tasks
16/01/28 20:05:39 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, istb1-l2-b14-09.hadoop.priv, NODE_LOCAL, 1589 bytes)
16/01/28 20:05:39 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, istb1-l2-b14-19.hadoop.priv, NODE_LOCAL, 1590 bytes)
16/01/28 20:05:40 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on istb1-l2-b14-19.hadoop.priv:54486 (size: 1965.0 B, free: 2.1 GB)
16/01/28 20:05:40 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on istb1-l2-b14-09.hadoop.priv:49118 (size: 1965.0 B, free: 2.1 GB)
16/01/28 20:05:41 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on istb1-l2-b14-19.hadoop.priv:54486 (size: 26.5 KB, free: 2.1 GB)
16/01/28 20:05:41 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on istb1-l2-b14-09.hadoop.priv:49118 (size: 26.5 KB, free: 2.1 GB)
16/01/28 20:05:46 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, istb1-l2-b14-19.hadoop.priv): java.lang.NullPointerException
at org.bdgenomics.adam.models.SequenceRecord$.fromADAMContig(SequenceDictionary.scala:268)
at org.bdgenomics.adam.models.SequenceRecord$.fromSpecificRecord(SequenceDictionary.scala:325)
at org.bdgenomics.adam.rdd.ADAMSpecificRecordSequenceDictionaryRDDAggregator.getSequenceRecordsFromElement(ADAMRDDFunctions.scala:153)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$mergeRecords$1(ADAMRDDFunctions.scala:108)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

16/01/28 20:05:46 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 2, istb1-l2-b14-19.hadoop.priv, NODE_LOCAL, 1590 bytes)
16/01/28 20:05:51 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on executor istb1-l2-b14-19.hadoop.priv: java.lang.NullPointerException (null) [duplicate 1]
16/01/28 20:05:51 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 3, istb1-l2-b14-07.hadoop.priv, NODE_LOCAL, 1590 bytes)
16/01/28 20:05:51 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 12732 ms on istb1-l2-b14-09.hadoop.priv (1/2)
16/01/28 20:05:53 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on istb1-l2-b14-07.hadoop.priv:56257 (size: 1965.0 B, free: 2.1 GB)
16/01/28 20:05:53 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on istb1-l2-b14-07.hadoop.priv:56257 (size: 26.5 KB, free: 2.1 GB)
16/01/28 20:06:00 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 3, istb1-l2-b14-07.hadoop.priv): java.lang.NullPointerException
at org.bdgenomics.adam.models.SequenceRecord$.fromADAMContig(SequenceDictionary.scala:268)
at org.bdgenomics.adam.models.SequenceRecord$.fromSpecificRecord(SequenceDictionary.scala:325)
at org.bdgenomics.adam.rdd.ADAMSpecificRecordSequenceDictionaryRDDAggregator.getSequenceRecordsFromElement(ADAMRDDFunctions.scala:153)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$mergeRecords$1(ADAMRDDFunctions.scala:108)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

16/01/28 20:06:00 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 4, istb1-l2-b14-07.hadoop.priv, NODE_LOCAL, 1590 bytes)
16/01/28 20:06:04 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 4) on executor istb1-l2-b14-07.hadoop.priv: java.lang.NullPointerException (null) [duplicate 1]
16/01/28 20:06:04 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; aborting job
16/01/28 20:06:04 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/01/28 20:06:04 INFO YarnScheduler: Cancelling stage 0
16/01/28 20:06:04 INFO DAGScheduler: ResultStage 0 (reduce at ADAMRDDFunctions.scala:127) failed in 25.492 s
16/01/28 20:06:04 INFO DAGScheduler: Job 0 failed: reduce at ADAMRDDFunctions.scala:127, took 25.594264 s
16/01/28 20:06:04 INFO SparkUI: Stopped Spark web UI at http://10.107.18.34:4041
16/01/28 20:06:04 INFO DAGScheduler: Stopping DAGScheduler
16/01/28 20:06:04 INFO YarnClientSchedulerBackend: Shutting down all executors
16/01/28 20:06:04 INFO YarnClientSchedulerBackend: Interrupting monitor thread
16/01/28 20:06:04 INFO YarnClientSchedulerBackend: Asking each executor to shut down
16/01/28 20:06:04 INFO YarnClientSchedulerBackend: Stopped
16/01/28 20:06:04 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/01/28 20:06:04 INFO Utils: path = /tmp/spark-86a917a9-9a0e-4d2e-9b6c-c0492c8a5ddc/blockmgr-b310de96-f340-4640-bdd8-6fa3c5d0e409, already present as root for deletion.
16/01/28 20:06:04 INFO MemoryStore: MemoryStore cleared
16/01/28 20:06:04 INFO BlockManager: BlockManager stopped
16/01/28 20:06:04 INFO BlockManagerMaster: BlockManagerMaster stopped
16/01/28 20:06:04 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 4, istb1-l2-b14-07.hadoop.priv): java.lang.NullPointerException
at org.bdgenomics.adam.models.SequenceRecord$.fromADAMContig(SequenceDictionary.scala:268)
at org.bdgenomics.adam.models.SequenceRecord$.fromSpecificRecord(SequenceDictionary.scala:325)
at org.bdgenomics.adam.rdd.ADAMSpecificRecordSequenceDictionaryRDDAggregator.getSequenceRecordsFromElement(ADAMRDDFunctions.scala:153)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$mergeRecords$1(ADAMRDDFunctions.scala:108)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
16/01/28 20:06:04 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
Jan 28, 2016 8:05:39 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 2
16/01/28 20:06:04 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/01/28 20:06:04 INFO Utils: Shutdown hook called
16/01/28 20:06:04 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/01/28 20:06:04 INFO Utils: Deleting directory /tmp/spark-86a917a9-9a0e-4d2e-9b6c-c0492c8a5ddc

Any kind of help is appreciated.

Thanks & Regards,
Ankush Reddy

@ryan-williams
Copy link
Member

Your NPE is coming from this line; some contig is null. I'm not sure why that would happen, at a glance.

@ankushreddy
Copy link
Author

hi @ryan-williams I have tried with different .bam files but am facing the same issue. Could you please let me know when guacamole is ready to handle both .bam and fasta reference file.

Thanks & Regards,
Ankush Reddy.

@arahuja
Copy link
Contributor

arahuja commented Mar 25, 2016

Hi @ankushreddy

Can you tell us more about the error you hit using a BAM file. As we mentioned earlier if the ADAM format is different than that support we say issues, but that part of the code has seen little use/testing as well. We recently upgraded our ADAM input if you wanted to retest.

However, if you are seeing a similar error with a BAM input that would be good to know about

@ankushreddy
Copy link
Author

Hi @arahuja just want to check what is the current version of adam I should use or is it enough if I just use a bam or sam file that is aligned with the reference I will test it once again and let you know the results.

@arahuja
Copy link
Contributor

arahuja commented Apr 11, 2016

@ankushreddy We actually now only support loading the reference explicitly and do not rely on md-tags anymore. Also, we aren't really supporting germline-threshold if that is what you are using, and it will likely be removed. We have updated the README for new sample commands to try.

@ankushreddy
Copy link
Author

@arahuja Thanks for the reply am actually testing it on the sam file but i see lot of variants are being called is there any way we can minimize the variants based on quality or anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants