Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verify issue running tensorflow/tensorflow:latest-gpu on dual RTX-A4500 with nvlink not on dual RTX-4090 PCIeX8 : ensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence #16

Open
obriensystems opened this issue Mar 10, 2024 · 0 comments
Assignees

Comments

@obriensystems
Copy link
Member

obriensystems commented Mar 10, 2024

Docker rebuild occurred on tensorflow image change - so far only on new containers build as of a week ago

FROM tensorflow/tensorflow:latest-gpu
WORKDIR /src
COPY /src/tflow.py .
CMD ["python", "tflow.py"]

#RUN pip install -U jupyterlab pandas matplotlib
#EXPOSE 8888
#ENTRYPOINT ["jupyter", "lab","--ip=0.0.0.0","--allow-root","--no-browser"]

[+] Building 137.8s (9/9) FINISHED                                                                           docker:default
 => [internal] load build definition from Dockerfile                                                                   0.0s
 => => transferring dockerfile: 285B                                                                                   0.0s
 => [internal] load .dockerignore                                                                                      0.0s
 => => transferring context: 2B                                                                                        0.0s
 => [internal] load metadata for docker.io/tensorflow/tensorflow:latest-gpu                                            1.2s
 => [auth] tensorflow/tensorflow:pull token for registry-1.docker.io                                                   0.0s
 => [1/3] FROM docker.io/tensorflow/tensorflow:latest-gpu@sha256:4ab9ffddd6ffacc9251ac6439f431eb38d66200d3f52397b5d  135.7s

2024-03-10 04:23:07.220266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 17782 MB memory:  -> device: 1, name: NVIDIA RTX A4500, pci bus id: 0000:02:00.0, compute capability: 8.6
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 ━━━━━━━━━━━━━━━━━━━━ 3s 0us/step
Epoch 1/100
2024-03-10 04:24:13.940526: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8906
2024-03-10 04:24:13.966146: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8906
24/25 ━━━━━━━━━━━━━━━━━━━━ 0s 300ms/step - accuracy: 0.0478 - loss: 11.33632024-03-10 04:24:26.203680: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:26.203757: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
2024-03-10 04:24:26.206671: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:26.206710: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 75s 417ms/step - accuracy: 0.0485 - loss: 11.0441
Epoch 2/100
24/25 ━━━━━━━━━━━━━━━━━━━━ 0s 317ms/step - accuracy: 0.1615 - loss: 8.35482024-03-10 04:24:37.038815: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:37.038866: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
2024-03-10 04:24:37.039859: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:37.039936: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 8s 316ms/step - accuracy: 0.1601 - loss: 8.1802
Epoch 3/100
24/25 ━━━━━━━━━━━━━━━━━━━━ 0s 310ms/step - accuracy: 0.2835 - loss: 7.41432024-03-10 04:24:44.885256: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:44.885302: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
2024-03-10 04:24:44.896965: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:44.897036: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 8s 307ms/step - accuracy: 0.2795 - loss: 7.2638
Epoch 4/100
19/25 ━━━━━━━━━━━━━━━━━━━━ 1s 309ms/step - accuracy: 0.3960 - loss: 6.7741Traceback (most recent call last):
  File "/src/tflow.py", line 81, in <module>
    parallel_model.fit(x_train, y_train, epochs=100, batch_size=2048)#7168)#7168)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 118, in error_handler
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 323, in fit
    logs = self.train_function(iterator)

@obriensystems obriensystems self-assigned this Mar 10, 2024
@obriensystems obriensystems changed the title verify issue on dual RTX-A4500 with nvlink not on dual RTX-4090 PCIeX8 : ensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence verify issue running tensorflow/tensorflow:latest-gpu on dual RTX-A4500 with nvlink not on dual RTX-4090 PCIeX8 : ensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence Mar 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant