-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TransientError Condition? OSError: [Errno 5] Input/output error #591
Comments
Looks like it! |
Getting this now extremely often when cold-starting a k8s cluster:
|
How do we fix this? It is very annoying. |
A fix would be to add a corresponding transient error condition to mazepa |
@supersergiy Is this the list? https://github.com/ZettaAI/zetta_utils/blob/9295d11ce0d53c48fe1ce4b49b5901ce4f8b5838/zetta_utils/mazepa/transient_errors.py Should it be something like?
|
Yup, that's the one! |
Sorry I should've linked it in my comment to begin with |
Made a branch for the fix: https://github.com/ZettaAI/zetta_utils/tree/tri/transient-errors. Need to test though before making a PR. |
Another one: Full traceback: Task traceback: Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 61, in get_connection
conn = self.pool.get(block=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/queue.py", line 168, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 99, in __call__
return_value = self._call_task_fn(debug=debug)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
return_value = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/volumetric_callable_operation.py", line 90, in __call__
task_kwargs = _process_callable_kwargs(idx_input_padded, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/callable_operation.py", line 23, in _process_callable_kwargs
result[k] = v.read_with_procs(idx)
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/layer/layer_base.py", line 53, in read_with_procs
data_backend = self.backend.read(idx=idx_proced)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/layer/volumetric/cloudvol/backend.py", line 198, in read
data_raw = cvol[idx.to_slices()]
~~~~^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/frontends/precomputed.py", line 551, in __getitem__
img = self.download(requested_bbox, self.mip)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/frontends/precomputed.py", line 731, in download
tup = self.image.download(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/__init__.py", line 200, in download
return rx.download(
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 295, in download
download_chunks_threaded(
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 599, in download_chunks_threaded
schedule_jobs(
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 150, in schedule_jobs
return schedule_threaded_jobs(fns, concurrency, progress, total)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 37, in schedule_threaded_jobs
with ThreadedQueue(n_threads=concurrency) as tq:
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 257, in __exit__
self.wait(progress=self.with_progress)
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 227, in wait
self._check_errors()
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 191, in _check_errors
raise err
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 153, in _consume_queue
self._consume_queue_execution(fn)
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 180, in _consume_queue_execution
fn()
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 32, in realupdatefn
res = fn()
^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 554, in process
labels, bbox = download_chunk(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 510, in download_chunk
).get([ filename ], raw=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 100, in inner_decor
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 441, in get
ret = download(first(paths))
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 426, in download
raise error
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 410, in download
with self._get_connection() as conn:
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 311, in _get_connection
return self._interface_cls(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/interfaces.py", line 503, in __init__
self._bucket = GC_POOL[GCloudBucketPoolParams(self._path.bucket, self._request_payer)].get_connection(secrets, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 64, in get_connection
conn = self._create_connection(secrets, endpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 326, in wrapped_f
return self(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 406, in __call__
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 362, in iter
raise retry_exc.reraise()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 195, in reraise
raise self.last_attempt.result()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 409, in __call__
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 150, in _create_connection
client = Client(
^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/google/cloud/storage/client.py", line 235, in __init__
if self._credentials.universe_domain != self.universe_domain:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/credentials.py", line 154, in universe_domain
self._universe_domain = _metadata.get_universe_domain(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/_metadata.py", line 284, in get_universe_domain
universe_domain = get(
^^^^
File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/_metadata.py", line 217, in get
raise exceptions.TransportError(
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/universe/universe_domain from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable |
And another one: mazepa /home/nkemnitz/zetta_utils/zetta_utils/mazepa/execution_state.py: 138
Task traceback: Traceback (most recent call last):
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 99, in __call__
return_value = self._call_task_fn(debug=debug)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
return_value = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/subchunkable_apply_flow.py", line 73, in __call__
mazepa.Executor(
File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 47, in __call__
return execute(
^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 129, in execute
_execute_from_state(
File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 179, in _execute_from_state
submit_ready_tasks(
File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 216, in submit_ready_tasks
task_outcomes = outcome_queue.pull(max_num=100)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/autoexecute_task_queue.py", line 43, in pull
results.append(execute_task(task, self.debug, self.handle_exceptions))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/autoexecute_task_queue.py", line 56, in execute_task
finished_processing, outcome = process_task_message(
^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/worker.py", line 107, in process_task_message
outcome = task(debug=debug, handle_exceptions=handle_exceptions)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 110, in __call__
return_value = self._call_task_fn(debug=debug)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
return_value = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/internal/alignment/flows/compute_field_flow.py", line 147, in __call__
src_data, src_field_data, src_translation = translation_adjusted_download(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/internal/alignment/flows/common.py", line 45, in translation_adjusted_download
xy_translation_raw = alignment.field.profile_field2d_percentile(field_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/internal/alignment/field.py", line 26, in profile_field2d_percentile
if nonzero_field.sum() == 0 or len(nonzero_field.shape) == 1:
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. |
This one caused a crash tonight, but restarting the job finished without problems, I do think though that this could be a legitimate error. So maybe an error that should get retried just a few times:
|
I think unless we write our own CUDA code, nothing we would run trigger a device side assert |
I think it certainly can just be a stochastic hardware/sw error
…On Wed, Oct 16, 2024 at 10:01 PM supersergiy ***@***.***> wrote:
I think unless we write our own CUDA code, nothing we would run trigger a
device side assert
—
Reply to this email directly, view it on GitHub
<#591 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR5XOFXHDGHLRVXMERPEILZ3ZPTHAVCNFSM6AAAAABKHXLEHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJWG43TQNBVGE>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Yeah I agree. What I meant to say is that it's never a legitimate error, so should be safe to just add to the transient list |
Ah gotcha yea 100% agree
…On Wed, Oct 16, 2024 at 10:34 PM supersergiy ***@***.***> wrote:
Yeah I agree. What I meant to say is that it's never a legitimate error,
so should be safe to just add to the transient list
—
Reply to this email directly, view it on GitHub
<#591 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR5XOF3CUVVSY43KZOD7EDZ3ZTOPAVCNFSM6AAAAABKHXLEHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJWHA2TOOJRGI>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Do you prefer adding these errors to the exception list over the "treat everything as transient" option? |
Maybe can combine the last two... but not sure what other |
Explicit error retries does seem better than implicit. The list looks good! |
Also, I think the OSError pretty much went away together with the Image Streaming speedup... |
Ah yea that oserror is probably due to image streaming server being
overloaded
…On Wed, Oct 16, 2024 at 11:46 PM Nico Kemnitz ***@***.***> wrote:
Also, I think the OSError pretty much went away together with the Image
Streaming speedup...
—
Reply to this email directly, view it on GitHub
<#591 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR5XODF3MUJNF6FIQ5NWJLZ3Z353AVCNFSM6AAAAABKHXLEHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJXGA2TONZZGM>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Part of a longer run that crashed last night with:
Not sure what causes it, but restarting the same flow worked without a problem, so must be another retriable error?
Update: 2024-07-30
Have not seen this error in a while, which makes me think it is indeed related to Image Streaming on GKE. Never happened before we enabled it. And now that Image streaming doesn't seem to work anymore, the error is gone, too.
The text was updated successfully, but these errors were encountered: