Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dumped at herro inference #36

Open
sivico26 opened this issue May 30, 2024 · 4 comments
Open

Core dumped at herro inference #36

sivico26 opened this issue May 30, 2024 · 4 comments

Comments

@sivico26
Copy link

Hello there,

I am trying to herro with quite a big dataset (4 Gb plant genome, ~72x depth). I already did the AvA step, But now I am struggling in the inference step:

The command I am using is:

herro="$herro_dir/herro.sif"   ## herro_dir -> /path/to/herro_cloned_repository
mnt_alns="/data/out_mappings"
mnt_reads="/data/ont_reads.fastq.gz"

singularity run --nv $herro inference -t 64 -m /herro/model_v0.1.pt --read-alns $mnt_alns -b 128 $mnt_reads /results/corrected_reads.fasta

But I am getting this error

Error log
thread '<unnamed>' panicked at src/inference.rs:172:64:
called `Result::unwrap()` on an `Err` value: Torch("The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File \"code/__torch__/model.py\", line 31, in forward
    target_positions: List[Tensor]) -> Tuple[Tensor, Tensor]:
    embedding = self.embedding
    bases_embeds = (embedding).forward(bases, )
                    ~~~~~~~~~~~~~~~~~~ <--- HERE
    _0 = [bases_embeds, torch.unsqueeze(qualities, -1)]
    x = torch.cat(_0, -1)
  File \"code/__torch__/torch/nn/modules/sparse.py\", line 18, in forward
    _0 = __torch__.torch.nn.functional.embedding
    weight = self.weight
    _1 = _0(input, weight, 11, None, 2., False, False, )
         ~~ <--- HERE
    return _1
  File \"code/__torch__/torch/nn/functional.py\", line 37, in embedding
  else:
    input0 = input
  _3 = torch.embedding(weight, input0, padding_idx0, scale_grad_by_freq, sparse)
       ~~~~~~~~~~~~~~~ <--- HERE
  return _3
def batch_norm(input: Tensor,

Traceback of TorchScript, original code (most recent call last):
  File \"/raid/scratch/stanojevicd/projects/haec-BigBird/model.py\", line 118, in forward
        '''
        # (batch_size, sequence_length, num_alignment_rows, bases_embedding_size)
        bases_embeds = self.embedding(bases)
                       ~~~~~~~~~~~~~~ <--- HERE
    
        # concatenate base qualities to embedding vectors
  File \"/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/modules/sparse.py\", line 162, in forward
    def forward(self, input: Tensor) -> Tensor:
        return F.embedding(
               ~~~~~~~~~~~ <--- HERE
            input, self.weight, self.padding_idx, self.max_norm,
            self.norm_type, self.scale_grad_by_freq, self.sparse)
  File \"/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/functional.py\", line 2233, in embedding
        # remove once script supports set_grad_enabled
        _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Given the final lines of the log, I thought it was a stochastic error, but I ran it again and got the same, so it seems consistent. Do you have any idea of what could be happening?

Thanks in advance.

@dominikstanojevic
Copy link
Member

Hello,

sorry for the late response. Which GPU are you using?

Best,
Dominik

@sivico26
Copy link
Author

sivico26 commented Jul 9, 2024

Hello Dominik. It is okay. That job was running on a H100.

@dominikstanojevic
Copy link
Member

OK, maybe the CUDA version is too old for H100. Is it possible for you to build the singularity image (requires elevated privileges)?

If it is, please change these lines:

  1. Line 2: From: nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
  2. Line 22: wget -q -O libtorch.zip https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.0.1%2Bcu118.zip
  3. Line 33: From: nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

Maybe this should help.

Best,
Dominik

@sivico26
Copy link
Author

sivico26 commented Jul 9, 2024

Thank you for the suggestions, Dominik.

I ran dorado correct on the same data a while ago with successful results. After that, I removed the herro alignments to save disk space. Thus I can not resume it from where I left it. I would say it does not matter anymore.

I left the issue open in case it was something you should be aware of, but if you suspect it is just a CUDA version mismatch then it sounds like a minor thing.

I would say we can close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants