Unable to train a Centered Instance model during training data (m1 mac SLEAP) #913

shk1013 · 2022-08-16T14:32:50Z

shk1013
Aug 16, 2022

While training 100 frames, it is able to train a centroid model but once it comes to training Centered Instance, a pop up window comes up that there is an error training centered_instance and the terminal can provide more detail.
Here is what it says in the terminal (including the initial config):

Start training centered_instance...
['sleap-train', '/var/folders/29/16hkd1xj5cx859zxsh72yzzw0000gn/T/tmpdwpr6zrp/220815_175510_training_job.json', '/Users/shreyaashok/Downloads/sleap/edited100framescopy.slp', '--zmq', '--save_viz']
INFO:sleap.nn.training:Versions:
SLEAP: 1.2.0a6
TensorFlow: 2.7.0
Numpy: 1.21.5
Python: 3.8.13
OS: macOS-12.4-arm64-arm-64bit
INFO:sleap.nn.training:Training labels file: /Users/shreyaashok/Downloads/sleap/edited100framescopy.slp
INFO:sleap.nn.training:Training profile: /var/folders/29/16hkd1xj5cx859zxsh72yzzw0000gn/T/tmpdwpr6zrp/220815_175510_training_job.json
INFO:sleap.nn.training:
INFO:sleap.nn.training:Arguments:
INFO:sleap.nn.training:{
"training_job_path": "/var/folders/29/16hkd1xj5cx859zxsh72yzzw0000gn/T/tmpdwpr6zrp/220815_175510_training_job.json",
"labels_path": "/Users/shreyaashok/Downloads/sleap/edited100framescopy.slp",
"video_paths": [
""
],
"val_labels": null,
"test_labels": null,
"tensorboard": false,
"save_viz": true,
"zmq": true,
"run_name": "",
"prefix": "",
"suffix": "",
"cpu": false,
"first_gpu": false,
"last_gpu": false,
"gpu": 0
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Training job:
INFO:sleap.nn.training:{
"data": {
"labels": {
"training_labels": null,
"validation_labels": null,
"validation_fraction": 0.1,
"test_labels": null,
"split_by_inds": false,
"training_inds": null,
"validation_inds": null,
"test_inds": null,
"search_path_hints": [],
"skeletons": []
},
"preprocessing": {
"ensure_rgb": false,
"ensure_grayscale": false,
"imagenet_mode": null,
"input_scaling": 1.0,
"pad_to_stride": null,
"resize_and_pad_to_target": true,
"target_height": null,
"target_width": null
},
"instance_cropping": {
"center_on_part": null,
"crop_size": null,
"crop_size_detection_padding": 16
}
},
"model": {
"backbone": {
"leap": null,
"unet": {
"stem_stride": null,
"max_stride": 16,
"output_stride": 4,
"filters": 24,
"filters_rate": 2.0,
"middle_block": true,
"up_interpolate": true,
"stacks": 1
},
"hourglass": null,
"resnet": null,
"pretrained_encoder": null
},
"heads": {
"single_instance": null,
"centroid": null,
"centered_instance": {
"anchor_part": null,
"part_names": null,
"sigma": 2.5,
"output_stride": 4,
"offset_refinement": false
},
"multi_instance": null
}
},
"optimization": {
"preload_data": true,
"augmentation_config": {
"rotate": true,
"rotation_min_angle": -15.0,
"rotation_max_angle": 15.0,
"translate": false,
"translate_min": -5,
"translate_max": 5,
"scale": false,
"scale_min": 0.9,
"scale_max": 1.1,
"uniform_noise": false,
"uniform_noise_min_val": 0.0,
"uniform_noise_max_val": 10.0,
"gaussian_noise": false,
"gaussian_noise_mean": 5.0,
"gaussian_noise_stddev": 1.0,
"contrast": false,
"contrast_min_gamma": 0.5,
"contrast_max_gamma": 2.0,
"brightness": false,
"brightness_min_val": 0.0,
"brightness_max_val": 10.0,
"random_crop": false,
"random_crop_height": 256,
"random_crop_width": 256,
"random_flip": false,
"flip_horizontal": true
},
"online_shuffling": true,
"shuffle_buffer_size": 128,
"prefetch": true,
"batch_size": 4,
"batches_per_epoch": null,
"min_batches_per_epoch": 200,
"val_batches_per_epoch": null,
"min_val_batches_per_epoch": 10,
"epochs": 200,
"optimizer": "adam",
"initial_learning_rate": 0.0001,
"learning_rate_schedule": {
"reduce_on_plateau": true,
"reduction_factor": 0.5,
"plateau_min_delta": 1e-06,
"plateau_patience": 5,
"plateau_cooldown": 3,
"min_learning_rate": 1e-08
},
"hard_keypoint_mining": {
"online_mining": false,
"hard_to_easy_ratio": 2.0,
"min_hard_keypoints": 2,
"max_hard_keypoints": null,
"loss_scale": 5.0
},
"early_stopping": {
"stop_training_on_plateau": true,
"plateau_min_delta": 1e-08,
"plateau_patience": 10
}
},
"outputs": {
"save_outputs": true,
"run_name": "220815_175510.centered_instance.n=199",
"run_name_prefix": "",
"run_name_suffix": "",
"runs_folder": "/Users/shreyaashok/Downloads/sleap/models",
"tags": [
""
],
"save_visualizations": true,
"delete_viz_images": true,
"zip_outputs": false,
"log_to_csv": true,
"checkpointing": {
"initial_model": false,
"best_model": true,
"every_epoch": false,
"latest_model": false,
"final_model": false
},
"tensorboard": {
"write_logs": false,
"loss_frequency": "epoch",
"architecture_graph": false,
"profile_graph": false,
"visualizations": true
},
"zmq": {
"subscribe_to_controller": true,
"controller_address": "tcp://127.0.0.1:9000",
"controller_polling_timeout": 10,
"publish_updates": true,
"publish_address": "tcp://127.0.0.1:9001"
}
},
"name": "",
"description": "",
"sleap_version": "1.2.0a6",
"filename": "/var/folders/29/16hkd1xj5cx859zxsh72yzzw0000gn/T/tmpdwpr6zrp/220815_175510_training_job.json"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Using GPU 0 for acceleration.
INFO:sleap.nn.training:Disabled GPU memory pre-allocation.
INFO:sleap.nn.training:System:
GPUs: 1/1 available
Device: /physical_device:GPU:0
Available: True
Initalized: False
Memory growth: True
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
INFO:sleap.nn.training:Loading training labels from: /Users/shreyaashok/Downloads/sleap/edited100framescopy.slp
INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1
INFO:sleap.nn.training: Splits: Training = 179 / Validation = 20.
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB

2022-08-15 17:55:16.028473: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-08-15 17:55:16.028643: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
2022-08-15 17:55:16.243856: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
INFO:sleap.nn.training:Loaded test example. [1.548s]
INFO:sleap.nn.training: Input shape: (848, 848, 1)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training: Backbone: UNet(stacks=1, filters=24, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=2, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training: Max stride: 16
INFO:sleap.nn.training: Parameters: 4,311,057
INFO:sleap.nn.training: Heads:
INFO:sleap.nn.training: [0] = CenteredInstanceConfmapsHead(part_names=['m_nose', 'm_leftEar', 'm_rightEar', 'm_neck', 'm_torso', 'm_waist', 'm_tailbase', 'm_tailMid', 'm_tailEnd'], anchor_part=None, sigma=2.5, output_stride=4, loss_weight=1.0)
INFO:sleap.nn.training: Outputs:
INFO:sleap.nn.training: [0] = KerasTensor(type_spec=TensorSpec(shape=(None, 212, 212, 9), dtype=tf.float32, name=None), name='CenteredInstanceConfmapsHead_0/BiasAdd:0', description="created by layer 'CenteredInstanceConfmapsHead_0'")
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 179
INFO:sleap.nn.training:Validation set: n = 20
INFO:sleap.nn.training:Setting up optimization...
INFO:sleap.nn.training: Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:sleap.nn.training: Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-08, plateau_patience=10)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: )
INFO:sleap.nn.training: ZMQ controller subcribed to: tcp://127.0.0.1:9000
INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set
INFO:sleap.nn.training: ZMQ progress reporter publish on: tcp://127.0.0.1:9001
INFO:sleap.nn.training:Created run path: /Users/shreyaashok/Downloads/sleap/models/220815_175510.centered_instance.n=199
INFO:sleap.nn.training:Setting up visualization...
INFO:sleap.nn.training:Finished trainer set up. [3.4s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [20.1s]
INFO:sleap.nn.training:Starting training loop...
Epoch 1/200
2022-08-15 17:55:39.951937: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
Traceback (most recent call last):
File "/Users/shreyaashok/miniconda3/envs/sleap_m1/bin/sleap-train", line 33, in
sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')())
File "/Users/shreyaashok/sleap_m1/sleap/nn/training.py", line 1625, in main
trainer.train()
File "/Users/shreyaashok/sleap_m1/sleap/nn/training.py", line 889, in train
self.keras_model.fit(
File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) INTERNAL: Invalid input shapes
[[node loss_fn/mean_squared_error/SquaredDifference
(defined at /Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py:1204)
]]
[[gradient_tape/model/stack0_dec1_s8_to_s4_skip_concat/Slice_1/_116]]
(1) INTERNAL: Invalid input shapes
[[node loss_fn/mean_squared_error/SquaredDifference
(defined at /Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py:1204)
]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_13871]

Errors may have originated from an input operation.
Input Source operations connected to node loss_fn/mean_squared_error/SquaredDifference:
In[0] model/CenteredInstanceConfmapsHead_0/BiasAdd (defined at /Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/layers/convolutional.py:264)
In[1] IteratorGetNext (defined at /Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py:866)

Operation defined at: (most recent call last)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/bin/sleap-train", line 33, in
sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')())

File "/Users/shreyaashok/sleap_m1/sleap/nn/training.py", line 1625, in main
trainer.train()

File "/Users/shreyaashok/sleap_m1/sleap/nn/training.py", line 889, in train
self.keras_model.fit(

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py", line 1216, in fit
tmp_logs = self.train_function(iterator)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function
return step_function(self, iterator)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step
outputs = model.train_step(data)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py", line 809, in train_step
loss = self.compiled_loss(

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/compile_utils.py", line 201, in call
loss_value = loss_obj(y_t, y_p, sample_weight=sw)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py", line 141, in call
losses = call_fn(y_true, y_pred)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py", line 245, in call
return ag_fn(y_true, y_pred, **self._fn_kwargs)

File "/Users/shreyaashok/sleap_m1/sleap/nn/training.py", line 284, in loss_fn
for loss_fn in losses:

File "/Users/shreyaashok/sleap_m1/sleap/nn/training.py", line 285, in loss_fn
loss += loss_fn(y_gt, y_pr)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py", line 141, in call
losses = call_fn(y_true, y_pred)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py", line 245, in call
return ag_fn(y_true, y_pred, **self._fn_kwargs)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py", line 1204, in mean_squared_error
return backend.mean(tf.math.squared_difference(y_pred, y_true), axis=-1)

Input Source operations connected to node loss_fn/mean_squared_error/SquaredDifference:
In[0] model/CenteredInstanceConfmapsHead_0/BiasAdd (defined at /Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/layers/convolutional.py:264)
In[1] IteratorGetNext (defined at /Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py:866)

Operation defined at: (most recent call last)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/bin/sleap-train", line 33, in
sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')())

File "/Users/shreyaashok/sleap_m1/sleap/nn/training.py", line 1625, in main
trainer.train()

File "/Users/shreyaashok/sleap_m1/sleap/nn/training.py", line 889, in train
self.keras_model.fit(

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py", line 1216, in fit
tmp_logs = self.train_function(iterator)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function
return step_function(self, iterator)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step
outputs = model.train_step(data)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/training.py", line 809, in train_step
loss = self.compiled_loss(

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/engine/compile_utils.py", line 201, in call
loss_value = loss_obj(y_t, y_p, sample_weight=sw)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py", line 141, in call
losses = call_fn(y_true, y_pred)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py", line 245, in call
return ag_fn(y_true, y_pred, **self._fn_kwargs)

File "/Users/shreyaashok/sleap_m1/sleap/nn/training.py", line 284, in loss_fn
for loss_fn in losses:

File "/Users/shreyaashok/sleap_m1/sleap/nn/training.py", line 285, in loss_fn
loss += loss_fn(y_gt, y_pr)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py", line 141, in call
losses = call_fn(y_true, y_pred)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py", line 245, in call
return ag_fn(y_true, y_pred, **self._fn_kwargs)

File "/Users/shreyaashok/miniconda3/envs/sleap_m1/lib/python3.8/site-packages/keras/losses.py", line 1204, in mean_squared_error
return backend.mean(tf.math.squared_difference(y_pred, y_true), axis=-1)

Function call stack:
train_function -> train_function

INFO:sleap.nn.callbacks:Closing the reporter controller/context.
INFO:sleap.nn.callbacks:Closing the training controller socket/context.
Run Path: /Users/shreyaashok/Downloads/sleap/models/220815_175510.centered_instance.n=199
qt.qpa.drawing: Layer-backing is always enabled. QT_MAC_WANTS_LAYER/_q_mac_wantsLayer has no effect.

Thank you for your help!

talmo · 2022-08-16T16:59:34Z

talmo
Aug 16, 2022
Maintainer

Hi @shk1013,

That's strange... Everything looks ok from the logs, so I'm not super sure what's going on. Do you mind doing a couple of things:

Update to the newest version (git pull or redownload it -> conda env remove -n sleap_m1 -> conda create -f environment.yml)
Send us your package file (Predict -> Export -> Recommended) so we can try to reproduce it ([email protected]).

Thanks!

Talmo

1 reply

shk1013 Aug 16, 2022
Author

Hi,
Just sent the package file to you!

Thanks!

talmo · 2022-08-16T19:48:32Z

talmo
Aug 16, 2022
Maintainer

Alright so the problem seems to be related to having a bunch of duplicated skeletons in your labels file. I was able to get it working by setting all instances to have the same skeleton:

import sleap
labels = sleap.load_file("edited100framescopy.slp")
skeleton = labels.skeletons[-1]
labels.skeletons = [skeleton]
for instance in labels.instances():
    instance.skeleton = skeleton
labels.save("fixed.slp")

You can run that in a python terminal or save it out as a fix.py file that you call with python fix.py.

Not sure how you ended up with a bunch of conflicting skeletons, but I'm guessing it's related to the M1 branch being on an older version of SLEAP (until recently).

Give it a go and let us know if you're having any issues. I also highly recommend updating to the newest M1 version as we've fixed a bunch of bugs since the previous one.

Cheers,

Talmo

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to train a Centered Instance model during training data (m1 mac SLEAP) #913

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Unable to train a Centered Instance model during training data (m1 mac SLEAP) #913

shk1013 Aug 16, 2022

Replies: 2 comments · 1 reply

talmo Aug 16, 2022 Maintainer

shk1013 Aug 16, 2022 Author

talmo Aug 16, 2022 Maintainer

shk1013
Aug 16, 2022

Replies: 2 comments 1 reply

talmo
Aug 16, 2022
Maintainer

shk1013 Aug 16, 2022
Author

talmo
Aug 16, 2022
Maintainer