Core dumped at herro inference #36

sivico26 · 2024-05-30T10:03:45Z

Hello there,

I am trying to herro with quite a big dataset (4 Gb plant genome, ~72x depth). I already did the AvA step, But now I am struggling in the inference step:

The command I am using is:

herro="$herro_dir/herro.sif"   ## herro_dir -> /path/to/herro_cloned_repository
mnt_alns="/data/out_mappings"
mnt_reads="/data/ont_reads.fastq.gz"

singularity run --nv $herro inference -t 64 -m /herro/model_v0.1.pt --read-alns $mnt_alns -b 128 $mnt_reads /results/corrected_reads.fasta

But I am getting this error

Error log

thread '<unnamed>' panicked at src/inference.rs:172:64:
called `Result::unwrap()` on an `Err` value: Torch("The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File \"code/__torch__/model.py\", line 31, in forward
    target_positions: List[Tensor]) -> Tuple[Tensor, Tensor]:
    embedding = self.embedding
    bases_embeds = (embedding).forward(bases, )
                    ~~~~~~~~~~~~~~~~~~ <--- HERE
    _0 = [bases_embeds, torch.unsqueeze(qualities, -1)]
    x = torch.cat(_0, -1)
  File \"code/__torch__/torch/nn/modules/sparse.py\", line 18, in forward
    _0 = __torch__.torch.nn.functional.embedding
    weight = self.weight
    _1 = _0(input, weight, 11, None, 2., False, False, )
         ~~ <--- HERE
    return _1
  File \"code/__torch__/torch/nn/functional.py\", line 37, in embedding
  else:
    input0 = input
  _3 = torch.embedding(weight, input0, padding_idx0, scale_grad_by_freq, sparse)
       ~~~~~~~~~~~~~~~ <--- HERE
  return _3
def batch_norm(input: Tensor,

Traceback of TorchScript, original code (most recent call last):
  File \"/raid/scratch/stanojevicd/projects/haec-BigBird/model.py\", line 118, in forward
        '''
        # (batch_size, sequence_length, num_alignment_rows, bases_embedding_size)
        bases_embeds = self.embedding(bases)
                       ~~~~~~~~~~~~~~ <--- HERE
    
        # concatenate base qualities to embedding vectors
  File \"/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/modules/sparse.py\", line 162, in forward
    def forward(self, input: Tensor) -> Tensor:
        return F.embedding(
               ~~~~~~~~~~~ <--- HERE
            input, self.weight, self.padding_idx, self.max_norm,
            self.norm_type, self.scale_grad_by_freq, self.sparse)
  File \"/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/functional.py\", line 2233, in embedding
        # remove once script supports set_grad_enabled
        _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Given the final lines of the log, I thought it was a stochastic error, but I ran it again and got the same, so it seems consistent. Do you have any idea of what could be happening?

Thanks in advance.

The text was updated successfully, but these errors were encountered:

dominikstanojevic · 2024-07-09T06:04:10Z

Hello,

sorry for the late response. Which GPU are you using?

Best,
Dominik

sivico26 · 2024-07-09T07:48:29Z

Hello Dominik. It is okay. That job was running on a H100.

dominikstanojevic · 2024-07-09T08:21:49Z

OK, maybe the CUDA version is too old for H100. Is it possible for you to build the singularity image (requires elevated privileges)?

If it is, please change these lines:

Line 2: From: nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
Line 22: wget -q -O libtorch.zip https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.0.1%2Bcu118.zip
Line 33: From: nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

Maybe this should help.

Best,
Dominik

sivico26 · 2024-07-09T13:32:06Z

Thank you for the suggestions, Dominik.

I ran dorado correct on the same data a while ago with successful results. After that, I removed the herro alignments to save disk space. Thus I can not resume it from where I left it. I would say it does not matter anymore.

I left the issue open in case it was something you should be aware of, but if you suspect it is just a CUDA version mismatch then it sounds like a minor thing.

I would say we can close this issue

1Wencai mentioned this issue Dec 2, 2024

herro inference error #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core dumped at herro inference #36

Core dumped at herro inference #36

sivico26 commented May 30, 2024

dominikstanojevic commented Jul 9, 2024

sivico26 commented Jul 9, 2024 •

edited

Loading

dominikstanojevic commented Jul 9, 2024

sivico26 commented Jul 9, 2024

Core dumped at herro inference #36

Core dumped at herro inference #36

Comments

sivico26 commented May 30, 2024

dominikstanojevic commented Jul 9, 2024

sivico26 commented Jul 9, 2024 • edited Loading

dominikstanojevic commented Jul 9, 2024

sivico26 commented Jul 9, 2024

sivico26 commented Jul 9, 2024 •

edited

Loading