Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow metal on M4 Pro #30

Open
obriensystems opened this issue Nov 17, 2024 · 5 comments
Open

tensorflow metal on M4 Pro #30

obriensystems opened this issue Nov 17, 2024 · 5 comments
Assignees

Comments

@obriensystems
Copy link
Member

obriensystems commented Nov 17, 2024

Issues with tensorflow version
https://developer.apple.com/metal/tensorflow-plugin/

   58  python -m pip show tensorflow-metal
   59  python -m pip show tensorflow
   60  python -m pip show keras
   61  python tflow.py
   62  python -m pip install --upgrade tensorflow
   63  python -m pip install --upgrade tensorflow-metal
   64  python -m pip install --upgrade tensorflow
   65  python tflow.py
   66  python -m pip uninstall tensorflow
   67  python -m pip uninstall tensorflow-metal
   68  python -m pip uninstall keras
   69  python -m pip uninstall ml-dtypes
   70  python -m pip uninstall tensorboard
   71  python -m pip install tensorflow==2.14.0
   72  python -m pip install tensorflow-metal==1.1.0


michaelobrien@mini08s-Mini src % python tflow.py                              
Traceback (most recent call last):
  File "/Users/michaelobrien/wse_github/obrienlabsdev/machine-learning/environments/windows/src/tflow.py", line 21, in <module>
    strategy = tf.distribute.OneDeviceStrategy(device="/gpu")
AttributeError: module 'tensorflow' has no attribute 'distribute'
michaelobrien@mini08s-Mini src % history
@obriensystems obriensystems self-assigned this Nov 17, 2024
@obriensystems
Copy link
Member Author

obriensystems commented Nov 17, 2024

michaelobrien@mini08s-Mini src % python3 version
/usr/local/bin/python3: can't open file '/Users/michaelobrien/wse_github/obrienlabsdev/machine-learning/environments/windows/src/version': [Errno 2] No such file or directory
michaelobrien@mini08s-Mini src % python3 --version
Python 3.9.6
michaelobrien@mini08s-Mini src % python3 -m venv /venv-metal
michaelobrien@mini08s-Mini src % source /venv-metal/bin/activate
(venv-metal) michaelobrien@mini08s-Mini src % python -m pip install -U pip
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pip in /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (21.2.4)
Collecting pip
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
|████████████████████████████████| 1.8 MB 3.6 MB/s
Installing collected packages: pip
WARNING: The scripts pip, pip3 and pip3.9 are installed in '/Users/michaelobrien/Library/Python/3.9/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pip-24.3.1
WARNING: You are using pip version 21.2.4; however, version 24.3.1 is available.
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.
(venv-metal) michaelobrien@mini08s-Mini src % python -m pip install tensorflow
Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow
Downloading tensorflow-2.18.0-cp39-cp39-macosx_12_0_arm64.whl.metadata (4.0 kB)
Requirement already satisfied: absl-py>=1.0.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (2.1.0)
Requirement already satisfied: astunparse>=1.6.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (1.6.3)
Requirement already satisfied: flatbuffers>=24.3.25 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (24.3.25)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (0.6.0)
Requirement already satisfied: google-pasta>=0.1.1 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (0.2.0)
Requirement already satisfied: libclang>=13.0.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (18.1.1)
Requirement already satisfied: opt-einsum>=2.3.2 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (3.4.0)
Requirement already satisfied: packaging in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (24.2)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (4.25.5)
Requirement already satisfied: requests<3,>=2.21.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (2.32.3)
Requirement already satisfied: setuptools in /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from tensorflow) (58.0.4)
Requirement already satisfied: six>=1.12.0 in /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from tensorflow) (1.15.0)
Requirement already satisfied: termcolor>=1.1.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (2.5.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (4.12.2)
Requirement already satisfied: wrapt>=1.11.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (1.14.1)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (1.68.0)
Collecting tensorboard<2.19,>=2.18 (from tensorflow)
Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Collecting keras>=3.5.0 (from tensorflow)
Downloading keras-3.6.0-py3-none-any.whl.metadata (5.8 kB)
Requirement already satisfied: numpy<2.1.0,>=1.26.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (2.0.2)
Requirement already satisfied: h5py>=3.11.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (3.12.1)
Collecting ml-dtypes<0.5.0,>=0.4.0 (from tensorflow)
Downloading ml_dtypes-0.4.1-cp39-cp39-macosx_10_9_universal2.whl.metadata (20 kB)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorflow) (0.37.1)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from astunparse>=1.6.0->tensorflow) (0.37.0)
Requirement already satisfied: rich in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from keras>=3.5.0->tensorflow) (13.9.4)
Requirement already satisfied: namex in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from keras>=3.5.0->tensorflow) (0.0.8)
Requirement already satisfied: optree in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from keras>=3.5.0->tensorflow) (0.13.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from requests<3,>=2.21.0->tensorflow) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from requests<3,>=2.21.0->tensorflow) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from requests<3,>=2.21.0->tensorflow) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from requests<3,>=2.21.0->tensorflow) (2024.8.30)
Requirement already satisfied: markdown>=2.6.8 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorboard<2.19,>=2.18->tensorflow) (3.7)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorboard<2.19,>=2.18->tensorflow) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from tensorboard<2.19,>=2.18->tensorflow) (3.1.3)
Requirement already satisfied: importlib-metadata>=4.4 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from markdown>=2.6.8->tensorboard<2.19,>=2.18->tensorflow) (8.5.0)
Requirement already satisfied: MarkupSafe>=2.1.1 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from werkzeug>=1.0.1->tensorboard<2.19,>=2.18->tensorflow) (3.0.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from rich->keras>=3.5.0->tensorflow) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from rich->keras>=3.5.0->tensorflow) (2.18.0)
Requirement already satisfied: zipp>=3.20 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.19,>=2.18->tensorflow) (3.21.0)
Requirement already satisfied: mdurl
=0.1 in /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.5.0->tensorflow) (0.1.2)
Downloading tensorflow-2.18.0-cp39-cp39-macosx_12_0_arm64.whl (239.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.4/239.4 MB 101.9 MB/s eta 0:00:00
Downloading keras-3.6.0-py3-none-any.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 72.2 MB/s eta 0:00:00
Downloading ml_dtypes-0.4.1-cp39-cp39-macosx_10_9_universal2.whl (396 kB)
Downloading tensorboard-2.18.0-py3-none-any.whl (5.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 100.8 MB/s eta 0:00:00
Installing collected packages: ml-dtypes, tensorboard, keras, tensorflow
Attempting uninstall: ml-dtypes
Found existing installation: ml-dtypes 0.2.0
Uninstalling ml-dtypes-0.2.0:
Successfully uninstalled ml-dtypes-0.2.0
WARNING: The script tensorboard is installed in '/Users/michaelobrien/Library/Python/3.9/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The scripts import_pb_to_tensorboard, saved_model_cli, tensorboard, tf_upgrade_v2, tflite_convert, toco and toco_from_protos are installed in '/Users/michaelobrien/Library/Python/3.9/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed keras-3.6.0 ml-dtypes-0.4.1 tensorboard-2.18.0 tensorflow-2.18.0
(venv-metal) michaelobrien@mini08s-Mini src % python -m pip install tensorflow-metal
Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow-metal
Downloading tensorflow_metal-1.1.0-cp39-cp39-macosx_12_0_arm64.whl.metadata (1.2 kB)
Requirement already satisfied: wheel
=0.35 in /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from tensorflow-metal) (0.37.0)
Requirement already satisfied: six>=1.15.0 in /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from tensorflow-metal) (1.15.0)
Downloading tensorflow_metal-1.1.0-cp39-cp39-macosx_12_0_arm64.whl (1.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 10.2 MB/s eta 0:00:00
Installing collected packages: tensorflow-metal
Successfully installed tensorflow-metal-1.1.0
(venv-metal) michaelobrien@mini08s-Mini src % python -c "import tensorflow as tf; print(tf.version)"
/Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: urllib3/urllib3#3020
warnings.warn(
Traceback (most recent call last):
File "", line 1, in
File "/Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/tensorflow/init.py", line 437, in
_ll.load_library(_plugin_dir)
File "/Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/tensorflow/python/framework/load_library.py", line 151, in load_library
py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: dlopen(/Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/tensorflow-plugins/libmetal_plugin.dylib, 0x0006): Symbol not found: __ZN3tsl8internal10LogMessageC1EPKcii
Referenced from: /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/tensorflow-plugins/libmetal_plugin.dylib
Expected in: <2A053E7E-6DBA-37C2-A28E-1D52A5836870> /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

(venv-metal) michaelobrien@mini08s-Mini src % python -m pip list
Package Version


absl-py 2.1.0
altgraph 0.17.2
astunparse 1.6.3
cachetools 5.5.0
certifi 2024.8.30
charset-normalizer 3.4.0
flatbuffers 24.3.25
future 0.18.2
gast 0.6.0
google-auth 2.36.0
google-auth-oauthlib 1.0.0
google-pasta 0.2.0
grpcio 1.68.0
h5py 3.12.1
idna 3.10
importlib_metadata 8.5.0
keras 3.6.0
libclang 18.1.1
macholib 1.15.2
Markdown 3.7
markdown-it-py 3.0.0
MarkupSafe 3.0.2
mdurl 0.1.2
ml-dtypes 0.4.1
namex 0.0.8
numpy 2.0.2
oauthlib 3.2.2
opt_einsum 3.4.0
optree 0.13.1
packaging 24.2
pip 24.3.1
protobuf 4.25.5
pyasn1 0.6.1
pyasn1_modules 0.4.1
Pygments 2.18.0
requests 2.32.3
requests-oauthlib 2.0.0
rich 13.9.4
rsa 4.9
setuptools 58.0.4
six 1.15.0
tensorboard 2.18.0
tensorboard-data-server 0.7.2
tensorflow 2.18.0
tensorflow-estimator 2.14.0
tensorflow-io-gcs-filesystem 0.37.1
tensorflow-metal 1.1.0
termcolor 2.5.0
typing_extensions 4.12.2
urllib3 2.2.3
Werkzeug 3.1.3
wheel 0.37.0
wrapt 1.14.1
zipp 3.21.0

136 python -m pip uninstall tensorflow-metal
137 python -m pip uninstall tensorflow
138 python -m pip install tensorflow==2.14.0
140 python -m pip install tensorflow-metal==1.1.0

(venv-metal) michaelobrien@mini08s-Mini src % python -c "import tensorflow as tf; print(tf.version)"

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

numpy revert from 2.0.2 to 1.24.1 (m1max)

(venv-metal) michaelobrien@mini08s-Mini src % python -m pip uninstall numpy
Found existing installation: numpy 2.0.2
Uninstalling numpy-2.0.2:
Would remove:
/Users/michaelobrien/Library/Python/3.9/bin/f2py
/Users/michaelobrien/Library/Python/3.9/bin/numpy-config
/Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/numpy-2.0.2.dist-info/*
/Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/numpy/*
Proceed (Y/n)? y
Successfully uninstalled numpy-2.0.2
(venv-metal) michaelobrien@mini08s-Mini src % python -m pip install numpy==1.24.1
Defaulting to user installation because normal site-packages is not writeable
Collecting numpy==1.24.1
Downloading numpy-1.24.1-cp39-cp39-macosx_11_0_arm64.whl.metadata (5.6 kB)
Downloading numpy-1.24.1-cp39-cp39-macosx_11_0_arm64.whl (13.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.9/13.9 MB 25.0 MB/s eta 0:00:00
Installing collected packages: numpy
WARNING: The scripts f2py, f2py3 and f2py3.9 are installed in '/Users/michaelobrien/Library/Python/3.9/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed numpy-1.24.1

(venv-metal) michaelobrien@mini08s-Mini src % python -c "import tensorflow as tf; print(tf.version)"
/Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: urllib3/urllib3#3020
warnings.warn(
<module 'tensorflow._api.v2.version' from '/Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/tensorflow/_api/v2/version/init.py'>

@obriensystems
Copy link
Member Author

partially working

(venv-metal) michaelobrien@mini08s-Mini src % python tflow.py
/Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: urllib3/urllib3#3020
warnings.warn(
2.14.0
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 4s 0us/step
2024-11-17 14:40:54.695498: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M4 Pro
2024-11-17 14:40:54.695578: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 24.00 GB
2024-11-17 14:40:54.695586: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 8.00 GB
2024-11-17 14:40:54.695842: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-17 14:40:54.695922: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
WARNING:tensorflow:From /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/keras/src/backend.py:7375: StrategyBase.configure (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
use update_config_proto instead.
2024-11-17 14:40:55.489448: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-17 14:40:55.489464: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
2024-11-17 14:40:55.534136: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
2024-11-17 14:40:55.567593: W tensorflow/core/common_runtime/colocation_graph.cc:1213] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/replica:0/task:0/device:GPU:0' assigned_device_name_='' resource_device_name_='/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
ReadVariableOp: GPU CPU
VarIsInitializedOp: GPU CPU
VarHandleOp: GPU CPU
Mul: GPU CPU
AddV2: GPU CPU
Sub: GPU CPU
AssignVariableOp: GPU CPU
StatelessRandomGetKeyCounter: CPU
StatelessRandomUniformV2: GPU CPU
Const: GPU CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
conv1_conv/kernel/Initializer/stateless_random_uniform/shape (Const)
conv1_conv/kernel/Initializer/stateless_random_uniform/min (Const)
conv1_conv/kernel/Initializer/stateless_random_uniform/max (Const)
conv1_conv/kernel/Initializer/stateless_random_uniform/StatelessRandomGetKeyCounter/seed (Const)
conv1_conv/kernel/Initializer/stateless_random_uniform/StatelessRandomGetKeyCounter (StatelessRandomGetKeyCounter)
conv1_conv/kernel/Initializer/stateless_random_uniform/StatelessRandomUniformV2/alg (Const)
conv1_conv/kernel/Initializer/stateless_random_uniform/StatelessRandomUniformV2 (StatelessRandomUniformV2)
conv1_conv/kernel/Initializer/stateless_random_uniform/sub (Sub)
conv1_conv/kernel/Initializer/stateless_random_uniform/mul (Mul)
conv1_conv/kernel/Initializer/stateless_random_uniform (AddV2)
conv1_conv/kernel (VarHandleOp) /replica:0/task:0/device:GPU:0
conv1_conv/kernel/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /replica:0/task:0/device:GPU:0
conv1_conv/kernel/Assign (AssignVariableOp) /replica:0/task:0/device:GPU:0
conv1_conv/kernel/Read/ReadVariableOp (ReadVariableOp) /replica:0/task:0/device:GPU:0
conv1_conv/Conv2D/ReadVariableOp (ReadVariableOp)
VarIsInitializedOp_294 (VarIsInitializedOp)

2024-11-17 14:40:55.567733: W tensorflow/core/common_runtime/colocation_graph.cc:1213] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/replica:0/task:0/device:GPU:0' assigned_device_name_='' resource_device_name_='/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
ReadVariableOp: GPU CPU
VarIsInitializedOp: GPU CPU
VarHandleOp: GPU CPU
Mul: GPU CPU
AddV2: GPU CPU
Sub: GPU CPU
AssignVariableOp: GPU CPU
StatelessRandomGetKeyCounter: CPU
StatelessRandomUniformV2: GPU CPU
Const: GPU CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
conv2_block1_0_conv/kernel/Initializer/stateless_random_uniform/shape (Const)
conv2_block1_0_conv/kernel/Initializer/stateless_random_uniform/min (Const)
conv2_block1_0_conv/kernel/Initializer/stateless_random_uniform/max (Const)
conv2_block1_0_conv/kernel/Initializer/stateless_random_uniform/StatelessRandomGetKeyCounter/seed (Const)
conv2_block1_0_conv/kernel/Initializer/stateless_random_uniform/StatelessRandomGetKeyCounter (StatelessRandomGetKeyCounter)
conv2_block1_0_conv/kernel/Initializer/stateless_random_uniform/StatelessRandomUniformV2/alg (Const)
conv2_block1_0_conv/kernel/Initializer/stateless_random_uniform/StatelessRandomUniformV2 (StatelessRandomUniformV2)
conv2_block1_0_conv/kernel/Initializer/stateless_random_uniform/sub (Sub)
conv2_block1_0_conv/kernel/Initializer/stateless_random_uniform/mul (Mul)
conv2_block1_0_conv/kernel/Initializer/stateless_random_uniform (AddV2)
conv2_block1_0_conv/kernel (VarHandleOp) /replica:0/task:0/device:GPU:0
conv2_block1_0_conv/kernel/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /replica:0/task:0/device:GPU:0
conv2_block1_0_conv/kernel/Assign (AssignVariableOp) /replica:0/task:0/device:GPU:0
conv2_block1_0_conv/kernel/Read/ReadVariableOp (ReadVariableOp) /replica:0/task:0/device:GPU:0
conv2_block1_0_conv/Conv2D/ReadVariableOp (ReadVariableOp)
VarIsInitializedOp_266 (VarIsInitializedOp)

2024-11-17 14:40:55.567853: W tensorflow/core/common_runtime/colocation_graph.cc:1213] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/replica:0/task:0/device:GPU:0' assigned_device_name_='' resource_device_name_='/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
ReadVariableOp: GPU CPU
VarIsInitializedOp: GPU CPU
VarHandleOp: GPU CPU
Mul: GPU CPU
AddV2: GPU CPU
Sub: GPU CPU
AssignVariableOp: GPU CPU
StatelessRandomGetKeyCounter: CPU
StatelessRandomUniformV2: GPU CPU
Const: GPU CPU

Colocation members, user-requested devices, and framework assigned devices, if any:

@obriensystems
Copy link
Member Author

fix remove
#tf.compat.v1.disable_eager_execution()

(venv-metal) michaelobrien@mini08s-Mini src % python tflow.py /Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn( 2024-11-17 14:43:47.584195: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M4 Pro 2024-11-17 14:43:47.584224: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 24.00 GB 2024-11-17 14:43:47.584230: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 8.00 GB 2024-11-17 14:43:47.584277: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2024-11-17 14:43:47.584437: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>) 2024-11-17 14:43:48.912318: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled. Epoch 1/25 98/98 [==============================] - 34s 286ms/step - loss: 4.3060 - accuracy: 0.0824 Epoch 2/25 14/98 [===>..........................] - ETA: 22s - loss: 3.5158 - accuracy: 0.1726

@obriensystems
Copy link
Member Author

(venv-metal) michaelobrien@mini08s-Mini src % python tflow.py
/Users/michaelobrien/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: urllib3/urllib3#3020
warnings.warn(
2024-11-17 15:22:32.586541: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M4 Pro
2024-11-17 15:22:32.586563: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 24.00 GB
2024-11-17 15:22:32.586569: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 8.00 GB
2024-11-17 15:22:32.586597: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-17 15:22:32.586616: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
2024-11-17 15:22:33.453838: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
Epoch 1/25
98/98 [==============================] - 28s 262ms/step - loss: 4.4098 - accuracy: 0.0749
Epoch 2/25
98/98 [==============================] - 25s 258ms/step - loss: 3.6834 - accuracy: 0.1519
Epoch 3/25
98/98 [==============================] - 25s 258ms/step - loss: 3.3197 - accuracy: 0.2083
Epoch 4/25
98/98 [==============================] - 25s 258ms/step - loss: 3.1031 - accuracy: 0.2527
Epoch 5/25
98/98 [==============================] - 25s 258ms/step - loss: 3.1260 - accuracy: 0.2715
Epoch 6/25
98/98 [==============================] - 25s 258ms/step - loss: 3.3128 - accuracy: 0.2484
Epoch 7/25
98/98 [==============================] - 25s 258ms/step - loss: 3.0062 - accuracy: 0.2838
Epoch 8/25
98/98 [==============================] - 25s 258ms/step - loss: 3.4348 - accuracy: 0.2313
Epoch 9/25
98/98 [==============================] - 25s 258ms/step - loss: 3.0541 - accuracy: 0.2689
Epoch 10/25
98/98 [==============================] - 25s 258ms/step - loss: 2.6588 - accuracy: 0.3362
Epoch 11/25
98/98 [==============================] - 25s 258ms/step - loss: 2.4382 - accuracy: 0.3793
Epoch 12/25
98/98 [==============================] - 25s 258ms/step - loss: 2.6936 - accuracy: 0.3512
Epoch 13/25
98/98 [==============================] - 25s 258ms/step - loss: 3.1090 - accuracy: 0.2870
Epoch 14/25
98/98 [==============================] - 25s 258ms/step - loss: 2.7951 - accuracy: 0.3398
Epoch 15/25
98/98 [==============================] - 25s 258ms/step - loss: 2.6840 - accuracy: 0.3469
Epoch 16/25
98/98 [==============================] - 25s 258ms/step - loss: 2.6941 - accuracy: 0.3444
Epoch 17/25
98/98 [==============================] - 25s 258ms/step - loss: 2.3126 - accuracy: 0.4194
Epoch 18/25
98/98 [==============================] - 25s 258ms/step - loss: 2.0555 - accuracy: 0.4756
Epoch 19/25
98/98 [==============================] - 25s 258ms/step - loss: 1.7615 - accuracy: 0.5343
Epoch 20/25
98/98 [==============================] - 25s 258ms/step - loss: 1.9138 - accuracy: 0.5108
Epoch 21/25
98/98 [==============================] - 25s 258ms/step - loss: 1.6491 - accuracy: 0.5607
Epoch 22/25
98/98 [==============================] - 25s 258ms/step - loss: 1.3774 - accuracy: 0.6221
Epoch 23/25
98/98 [==============================] - 25s 258ms/step - loss: 1.0176 - accuracy: 0.7145
Epoch 24/25
98/98 [==============================] - 25s 258ms/step - loss: 0.8265 - accuracy: 0.7677
Epoch 25/25
98/98 [==============================] - 25s 258ms/step - loss: 0.8689 - accuracy: 0.7612

import tensorflow as tf
#tf.compat.v1.disable_eager_execution()
#print(tf.version)
#import keras
#from keras.utils import multi_gpu_model
#import keras.backend as k
#microsoft/tensorflow-directml#352

https://www.tensorflow.org/guide/distributed_training

https://www.tensorflow.org/tutorials/distribute/keras

https://keras.io/guides/distributed_training/

#strategy = tf.distribute.MirroredStrategy()
#print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

#NUM_GPUS = 2
#strategy = tf.contrib.distribute.MirroredStrategy()#num_gpus=NUM_GPUS)

working on dual RTX-4090

#strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
#WARNING:tensorflow:Some requested devices in tf.distribute.Strategy are not visible to TensorFlow: /replica:0/task:0/device:GPU:1,/replica:0/task:0/device:GPU:0
#Number of devices: 2

strategy = tf.distribute.OneDeviceStrategy(device="/gpu")

#central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()
#strategy = tf.distribute.MultiWorkerMirroredStrategy() # not in tf 1.5
#print("mirrored_strategy: ",mirrored_strategy)
#strategy = tf.distribute.OneDeviceStrategy(device="/gpu:1")
#mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"],cross_device_ops=tf.contrib.distribute.AllReduceCrossDeviceOps(all_reduce_alg="hierarchical_copy"))
#mirrored_strategy = tf.distribute.MirroredStrategy(devices= ["/gpu:0","/gpu:1"],cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

#print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

https://learn.microsoft.com/en-us/windows/ai/directml/gpu-faq

#a = tf.constant([1.])
#b = tf.constant([2.])
#c = tf.add(a, b)

#gpu_config = tf.GPUOptions()
#gpu_config.visible_device_list = "1"#"0,1"
#gpu_config.visible_device_list = "0,1"
#gpu_config.allow_growth=True

#session = tf.Session(config=tf.ConfigProto(gpu_options=gpu_config))
#print(session.run(c))
#tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (DML:0) is being mapped to multiple DML devices (0 now, and 1 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
#from keras import backend as K
#K.set_session(session)

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()

with strategy.scope():

https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50

https://keras.io/api/models/model/

parallel_model = tf.keras.applications.ResNet50(
#model = tf.keras.applications.ResNet50(
include_top=True,
weights=None,
input_shape=(32, 32, 3),
classes=100,)

https://saturncloud.io/blog/how-to-do-multigpu-training-with-keras/

#parallel_model = multi_gpu_model(model, gpus=2)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)

https://keras.io/api/models/model_training_apis/

parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
parallel_model.fit(x_train, y_train, epochs=25, batch_size=512)#5120)#7168)#7168)

@obriensystems
Copy link
Member Author

M1 pro 6/2/14 example

(venv-metal) mpb6@mbp6 wse_github % history 1
    1  pwd
    2  history
    3  python
    4  pwd
    5  /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
    6  echo >> /Users/mpb6/.zprofile
    7  echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/mpb6/.zprofile
    8  eval "$(/opt/homebrew/bin/brew shellenv)"
    9  brew install go
   10  go version
   11  history
   12  git version
   13  ollama run llama3.2
   14  ollama run llama3.2
   15  ollama run llama2:13b
   16  ollama run llama2:13b
   17  history
   18  hostname mbp6
   19  brew
   20  python --version
   21  brew install python
   22  ls
   23  cd wse_github
   24  ls
   25  python --version
   26  pip --version
   27  brew --versin
   28  brew --version
   29  history
   30  history -n 1000
   31  history 100
   32  history 1
   33  python --version
   34  brew install python==3.9.6
   35  brew install [email protected]
   36  brew install [email protected]
   37  python --version
   38  python3.9 --version
   39  python3.9 -m venv ~/venv-metal
   40  source ~/venv-metal/bin/activate
   41  python -m pip install -U pip
   42  python -m pip install tensorflow==2.14.0
   43  python -m pip install tensorflow-metal
   44  ls
   45  vi tflow.py
   46  python tflow.py
from numpy 2.0 issue
   47  python -m pip install numpy==1.24.1
   48  python tflow.py
   49  history

model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=25, batch_size=512)

(venv-metal) mpb6@mbp6 wse_github % python tflow.py
2024-11-19 12:14:39.488488: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2024-11-19 12:14:39.488521: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2024-11-19 12:14:39.488526: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB

98/98 [==============================] - 51s 478ms/step - loss: 4.3992 - accuracy: 0.0751
Epoch 2/25
98/98 [==============================] - 45s 454ms/step - loss: 3.5465 - accuracy: 0.1652
Epoch 3/25
98/98 [==============================] - 44s 454ms/step - loss: 3.3181 - accuracy: 0.2175
Epoch 4/25
98/98 [==============================] - 44s 454ms/step - loss: 3.4754 - accuracy: 0.2090
Epoch 5/25
98/98 [==============================] - 44s 454ms/step - loss: 3.7438 - accuracy: 0.1903
Epoch 6/25
98/98 [==============================] - 45s 455ms/step - loss: 3.3383 - accuracy: 0.2322
Epoch 7/25
98/98 [==============================] - 45s 455ms/step - loss: 3.1099 - accuracy: 0.2691
Epoch 8/25
98/98 [==============================] - 44s 454ms/step - loss: 3.0315 - accuracy: 0.2818
Epoch 9/25
98/98 [==============================] - 44s 454ms/step - loss: 2.7005 - accuracy: 0.3270
Epoch 10/25
98/98 [==============================] - 44s 454ms/step - loss: 2.4656 - accuracy: 0.3762
Epoch 11/25
98/98 [==============================] - 44s 454ms/step - loss: 2.3153 - accuracy: 0.4093
Epoch 12/25
98/98 [==============================] - 45s 454ms/step - loss: 2.2461 - accuracy: 0.4316
Epoch 13/25
98/98 [==============================] - 44s 454ms/step - loss: 2.0798 - accuracy: 0.4669
Epoch 14/25
98/98 [==============================] - 44s 454ms/step - loss: 1.7628 - accuracy: 0.5307
Epoch 15/25
98/98 [==============================] - 44s 454ms/step - loss: 1.5368 - accuracy: 0.5834
Epoch 16/25
98/98 [==============================] - 44s 454ms/step - loss: 1.2676 - accuracy: 0.6469
Epoch 17/25
98/98 [==============================] - 45s 454ms/step - loss: 1.0732 - accuracy: 0.7012
Epoch 18/25
98/98 [==============================] - 44s 454ms/step - loss: 0.8437 - accuracy: 0.7591
Epoch 19/25
98/98 [==============================] - 44s 454ms/step - loss: 0.7039 - accuracy: 0.8057
Epoch 20/25
98/98 [==============================] - 44s 454ms/step - loss: 0.6416 - accuracy: 0.8197
Epoch 21/25
98/98 [==============================] - 45s 455ms/step - loss: 0.4794 - accuracy: 0.8636
Epoch 22/25
98/98 [==============================] - 45s 455ms/step - loss: 0.6930 - accuracy: 0.8065
Epoch 23/25
98/98 [==============================] - 44s 454ms/step - loss: 0.5356 - accuracy: 0.8616
Epoch 24/25
98/98 [==============================] - 44s 454ms/step - loss: 0.7293 - accuracy: 0.8196
Epoch 25/25
98/98 [==============================] - 44s 454ms/step - loss: 0.3607 - accuracy: 0.9041

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant