Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransientError Condition? OSError: [Errno 5] Input/output error #591

Open
nkemnitz opened this issue Dec 24, 2023 · 20 comments
Open

TransientError Condition? OSError: [Errno 5] Input/output error #591

nkemnitz opened this issue Dec 24, 2023 · 20 comments
Assignees

Comments

@nkemnitz
Copy link
Collaborator

nkemnitz commented Dec 24, 2023

Part of a longer run that crashed last night with:

2023-12-24 03:28:09.049 ERROR    mazepa /home/nkemnitz/zetta_utils/zetta_utils/mazepa/execution_state.py: 129
                                 Task traceback: Traceback (most recent call last):
                                   File "/opt/zetta_utils/zetta_utils/mazepa/worker.py", line 45, in run_worker
                                     task_msgs = task_queue.pull(max_num=max_pull_num)
                                   File "/opt/zetta_utils/zetta_utils/message_queues/sqs/queue.py", line 89, in pull
                                     payload = serialization.deserialize(tq_task.task_ser)
                                   File "/opt/zetta_utils/zetta_utils/message_queues/serialization.py", line 31, in deserialize
                                     result = _deserialize(s, pickle)
                                   File "/opt/zetta_utils/zetta_utils/message_queues/serialization.py", line 25, in _deserialize
                                     result = module.loads(zlib.decompress(codecs.decode(s.encode(), "base64")))
                                   File "/opt/conda/lib/python3.10/encodings/__init__.py", line 99, in search_function
                                   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
                                   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
                                   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
                                   File "<frozen importlib._bootstrap_external>", line 879, in exec_module
                                   File "<frozen importlib._bootstrap_external>", line 1016, in get_code
                                   File "<frozen importlib._bootstrap_external>", line 1074, in get_data
                                 OSError: [Errno 5] Input/output error

Not sure what causes it, but restarting the same flow worked without a problem, so must be another retriable error?

Update: 2024-07-30
Have not seen this error in a while, which makes me think it is indeed related to Image Streaming on GKE. Never happened before we enabled it. And now that Image streaming doesn't seem to work anymore, the error is gone, too.

@supersergiy
Copy link
Member

Looks like it!

@nkemnitz
Copy link
Collaborator Author

nkemnitz commented Feb 6, 2024

Getting this now extremely often when cold-starting a k8s cluster:

/home/nkemnitz/zetta/zetta_utils/venv-3.11) nkemnitz@Eriador:~/zetta/zetta_utils$ kubectl logs --previous hissing-piquant-bear-of-honeydew-554dfdd88f-zqjl9
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/conda/bin/zetta:5 in <module>                                           │
│                                                                              │
│   2 # -*- coding: utf-8 -*-                                                  │
│   3 import re                                                                │
│   4 import sys                                                               │
│ ❱ 5 from zetta_utils.cli.main import cli                                     │
│   6 if __name__ == '__main__':                                               │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│   8 │   sys.exit(cli())                                                      │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/__init__.py:3 in <module>                       │
│                                                                              │
│    1 # pylint: disable=unused-import, import-outside-toplevel                │
│    2 """Zetta AI Computational Connectomics Toolkit."""                      │
│ ❱  3 from . import log, typing, parsing, builder, common                     │
│    4 from . import geometry, distributions, layer, ng                        │
│    5                                                                         │
│    6 builder.registry.MUTLIPROCESSING_INCOMPATIBLE_CLASSES.add("mazepa")     │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/parsing/__init__.py:2 in <module>               │
│                                                                              │
│   1 from . import cue                                                        │
│ ❱ 2 from . import ngl_state                                                  │
│   3 from . import json                                                       │
│   4                                                                          │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/parsing/ngl_state.py:16 in <module>             │
│                                                                              │
│    13 │   make_layer,                                                        │
│    14 )                                                                      │
│    15                                                                        │
│ ❱  16 from zetta_utils.geometry import BBox3D, Vec3D                         │
│    17 from zetta_utils.log import get_logger                                 │
│    18                                                                        │
│    19 logger = get_logger("zetta_utils")                                     │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/geometry/__init__.py:2 in <module>              │
│                                                                              │
│   1 from .vec import Vec3D, IntVec3D, RawVec3D                               │
│ ❱ 2 from .bbox import BBox3D                                                 │
│   3 from .bbox_strider import BBoxStrider                                    │
│   4                                                                          │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/geometry/bbox.py:10 in <module>                 │
│                                                                              │
│     7 import attrs                                                           │
│     8 from typeguard import typechecked                                      │
│     9                                                                        │
│ ❱  10 from zetta_utils import builder                                        │
│    11 from zetta_utils.geometry.vec import VEC3D_PRECISION                   │
│    12                                                                        │
│    13 from . import Vec3D                                                    │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/builder/__init__.py:11 in <module>              │
│                                                                              │
│    8 │   get_initial_builder_spec,                                           │
│    9 │   UnpicklableDict,                                                    │
│   10 )                                                                       │
│ ❱ 11 from . import built_in_registrations                                    │
│   12                                                                         │
│   13 PARALLEL_BUILD_ALLOWED: bool = False                                    │
│   14                                                                         │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/builder/built_in_registrations.py:5 in <module> │
│                                                                              │
│    2                                                                         │
│    3 from typing import Any, Callable, Optional                              │
│    4                                                                         │
│ ❱  5 import torch  # pylint: disable=unused-import                           │
│    6                                                                         │
│    7 from .building import BuilderPartial                                    │
│    8 from .registry import register                                          │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/__init__.py:1465 in <module>   │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_meta_registrations.py:7 in    │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_decomp/__init__.py:169 in     │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_decomp/decompositions.py:10   │
│ in <module>                                                                  │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_prims/__init__.py:33 in       │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_subclasses/__init__.py:3 in   │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py:13  │
│ in <module>                                                                  │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_guards.py:14 in <module>      │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/__init__.py:73 in <module>     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/__init__.py:75 in        │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/polyfuncs.py:11 in       │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/specialpolys.py:297 in   │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/rings.py:30 in <module>  │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/printing/__init__.py:25 in     │
│ <module>                                                                     │
│ in _find_and_load:1027                                                       │
│ in _find_and_load_unlocked:1006                                              │
│ in _load_unlocked:688                                                        │
│ in exec_module:879                                                           │
│ in get_code:1016                                                             │
│ in get_data:1074                                                             │
╰──────────────────────────────────────────────────────────────────────────────╯
OSError: [Errno 5] Input/output error
Bus error (core dumped)

@trivoldus28
Copy link
Contributor

How do we fix this? It is very annoying.

@supersergiy
Copy link
Member

A fix would be to add a corresponding transient error condition to mazepa

@trivoldus28
Copy link
Contributor

@supersergiy Is this the list? https://github.com/ZettaAI/zetta_utils/blob/9295d11ce0d53c48fe1ce4b49b5901ce4f8b5838/zetta_utils/mazepa/transient_errors.py

Should it be something like?

TransientErrorCondition(
        exception_type=OSError,
        text_signature="[Errno 5] Input/output error",
    ),

@supersergiy
Copy link
Member

Yup, that's the one!

@supersergiy
Copy link
Member

Sorry I should've linked it in my comment to begin with

@trivoldus28 trivoldus28 self-assigned this Feb 21, 2024
@trivoldus28
Copy link
Contributor

Made a branch for the fix: https://github.com/ZettaAI/zetta_utils/tree/tri/transient-errors. Need to test though before making a PR.

@nkemnitz
Copy link
Collaborator Author

nkemnitz commented Jul 2, 2024

Another one:
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/universe/universe_domain from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

Full traceback:

Task traceback: Traceback (most recent call last):          
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 61, in get_connection                    
    conn = self.pool.get(block=False)                       
           ^^^^^^^^^^^^^^^^^^^^^^^^^^                       
  File "/usr/lib/python3.11/queue.py", line 168, in get     
    raise Empty            
_queue.Empty               
                           
During handling of the above exception, another exception occurred:                          
                           
Traceback (most recent call last):                          
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 99, in __call__                  
    return_value = self._call_task_fn(debug=debug)          
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^          
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn             
    return_value = self.fn(*self.args, **self.kwargs)       
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       
  File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/volumetric_callable_operation.py", line 90, in __call__   
    task_kwargs = _process_callable_kwargs(idx_input_padded, kwargs)                         
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                         
  File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/callable_operation.py", line 23, in _process_callable_kwargs                               
    result[k] = v.read_with_procs(idx)                      
                ^^^^^^^^^^^^^^^^^^^^^^                      
  File "/opt/zetta_utils/zetta_utils/layer/layer_base.py", line 53, in read_with_procs       
    data_backend = self.backend.read(idx=idx_proced)        
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^        
  File "/opt/zetta_utils/zetta_utils/layer/volumetric/cloudvol/backend.py", line 198, in read
    data_raw = cvol[idx.to_slices()]                        
               ~~~~^^^^^^^^^^^^^^^^^                        
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/frontends/precomputed.py", line 551, in __getitem__               
    img = self.download(requested_bbox, self.mip)           
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^           
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/frontends/precomputed.py", line 731, in download                  
    tup = self.image.download(                              
          ^^^^^^^^^^^^^^^^^^^^                              
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/__init__.py", line 200, in download  
    return rx.download(    
           ^^^^^^^^^^^^    
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 295, in download        
    download_chunks_threaded(                               
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 599, in download_chunks_threaded                         
    schedule_jobs(         
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 150, in schedule_jobs                         
    return schedule_threaded_jobs(fns, concurrency, progress, total)                         
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                         
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 37, in schedule_threaded_jobs                 
    with ThreadedQueue(n_threads=concurrency) as tq:        
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 257, in __exit__                         
    self.wait(progress=self.with_progress)                  
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 227, in wait                             
    self._check_errors()   
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 191, in _check_errors                    
    raise err              
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 153, in _consume_queue                   
    self._consume_queue_execution(fn)                       
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 180, in _consume_queue_execution         
    fn()                   
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 32, in realupdatefn                           
    res = fn()             
          ^^^^             
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 554, in process         
    labels, bbox = download_chunk(                          
                   ^^^^^^^^^^^^^^^                          
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 510, in download_chunk  
    ).get([ filename ], raw=True)                           
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^                           
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 100, in inner_decor                           
    return fn(*args, **kwargs)                              
           ^^^^^^^^^^^^^^^^^^^                              
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 441, in get  
    ret = download(first(paths))                            
          ^^^^^^^^^^^^^^^^^^^^^^                            
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 426, in download                              
    raise error            
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 410, in download                              
    with self._get_connection() as conn:                    
         ^^^^^^^^^^^^^^^^^^^^^^                             
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 311, in _get_connection                       
    return self._interface_cls(                             
           ^^^^^^^^^^^^^^^^^^^^                             
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/interfaces.py", line 503, in __init__                              
    self._bucket = GC_POOL[GCloudBucketPoolParams(self._path.bucket, self._request_payer)].get_connection(secrets, None)      
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^      
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 64, in get_connection                    
    conn = self._create_connection(secrets, endpoint)       
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       
  File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 326, in wrapped_f
    return self(f, *args, **kw)                             
           ^^^^^^^^^^^^^^^^^^^^                             
  File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 406, in __call__ 
    do = self.iter(retry_state=retry_state)                 
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                 
  File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 362, in iter     
    raise retry_exc.reraise()                               
          ^^^^^^^^^^^^^^^^^^^                               
  File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 195, in reraise  
    raise self.last_attempt.result()                        
          ^^^^^^^^^^^^^^^^^^^^^^^^^^                        
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result                
    return self.__get_result()                              
           ^^^^^^^^^^^^^^^^^^^                              
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result          
    raise self._exception  
  File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 409, in __call__ 
    result = fn(*args, **kwargs)                            
             ^^^^^^^^^^^^^^^^^^^                            
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 150, in _create_connection               
    client = Client(       
             ^^^^^^^       
  File "/usr/local/lib/python3.11/dist-packages/google/cloud/storage/client.py", line 235, in __init__                        
    if self._credentials.universe_domain != self.universe_domain:                            
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                    
  File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/credentials.py", line 154, in universe_domain      
    self._universe_domain = _metadata.get_universe_domain(  
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  
  File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/_metadata.py", line 284, in get_universe_domain    
    universe_domain = get( 
                      ^^^^ 
  File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/_metadata.py", line 217, in get                    
    raise exceptions.TransportError(                        
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/universe/universe_domain from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

@nkemnitz
Copy link
Collaborator Author

And another one:

mazepa /home/nkemnitz/zetta_utils/zetta_utils/mazepa/execution_state.py: 138
Task traceback: Traceback (most recent call last):
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 99, in __call__
    return_value = self._call_task_fn(debug=debug)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
    return_value = self.fn(*self.args, **self.kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/subchunkable_apply_flow.py", line 73, in __call__
    mazepa.Executor(
  File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 47, in __call__
    return execute(
           ^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 129, in execute
    _execute_from_state(
  File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 179, in _execute_from_state
    submit_ready_tasks(
  File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 216, in submit_ready_tasks
    task_outcomes = outcome_queue.pull(max_num=100)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/autoexecute_task_queue.py", line 43, in pull
    results.append(execute_task(task, self.debug, self.handle_exceptions))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/autoexecute_task_queue.py", line 56, in execute_task
    finished_processing, outcome = process_task_message(
                                   ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/worker.py", line 107, in process_task_message
    outcome = task(debug=debug, handle_exceptions=handle_exceptions)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 110, in __call__
    return_value = self._call_task_fn(debug=debug)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
    return_value = self.fn(*self.args, **self.kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/internal/alignment/flows/compute_field_flow.py", line 147, in __call__
    src_data, src_field_data, src_translation = translation_adjusted_download(
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/internal/alignment/flows/common.py", line 45, in translation_adjusted_download
    xy_translation_raw = alignment.field.profile_field2d_percentile(field_data)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/internal/alignment/field.py", line 26, in profile_field2d_percentile
    if nonzero_field.sum() == 0 or len(nonzero_field.shape) == 1:
       ^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@nkemnitz
Copy link
Collaborator Author

This one caused a crash tonight, but restarting the job finished without problems, I do think though that this could be a legitimate error. So maybe an error that should get retried just a few times:

2024-10-15 21:53:38.871 ERROR    mazepa /home/nkemnitz/zetta_utils/zetta_utils/mazepa/execution_state.py: 138
                                 Task traceback: Traceback (most recent call last):
                                   File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 99, in __call__
                                     return_value = self._call_task_fn(debug=debug)
                                   File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
                                     return_value = self.fn(*self.args, **self.kwargs)
                                   File "/opt/zetta_utils/zetta_utils/internal/alignment/flows/compute_field_flow.py", line 101, in __call__
                                     src_data, src_field_data, src_translation = translation_adjusted_download(
                                   File "/opt/zetta_utils/zetta_utils/internal/alignment/flows/common.py", line 45, in translation_adjusted_download
                                     xy_translation_raw = alignment.field.profile_field2d_percentile(field_data)
                                   File "/opt/zetta_utils/zetta_utils/internal/alignment/field.py", line 26, in profile_field2d_percentile
                                     if nonzero_field.sum() == 0 or len(nonzero_field.shape) == 1:
                                 RuntimeError: CUDA error: device-side assert triggered
                                 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
                                 For debugging consider passing CUDA_LAUNCH_BLOCKING=1
                                 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

@supersergiy
Copy link
Member

I think unless we write our own CUDA code, nothing we would run trigger a device side assert

@trivoldus28
Copy link
Contributor

trivoldus28 commented Oct 16, 2024 via email

@supersergiy
Copy link
Member

Yeah I agree. What I meant to say is that it's never a legitimate error, so should be safe to just add to the transient list

@trivoldus28
Copy link
Contributor

trivoldus28 commented Oct 16, 2024 via email

@nkemnitz
Copy link
Collaborator Author

Do you prefer adding these errors to the exception list over the "treat everything as transient" option?

@nkemnitz
Copy link
Collaborator Author

  • OSError: [Errno 5] Input/output error
  • google.auth.exceptions.TransportError: Failed to retrieve
  • RuntimeError: CUDA error: an illegal memory access was encountered
  • RuntimeError: CUDA error: device-side assert triggered

Maybe can combine the last two... but not sure what other RuntimeError: CUDA error: messages there might be

@supersergiy
Copy link
Member

Explicit error retries does seem better than implicit. The list looks good!

@nkemnitz
Copy link
Collaborator Author

Also, I think the OSError pretty much went away together with the Image Streaming speedup...

@trivoldus28
Copy link
Contributor

trivoldus28 commented Oct 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants