Skip to content
This repository has been archived by the owner on Aug 19, 2022. It is now read-only.

Inconsistent with the original Blazepose full model #3

Open
jizhu1023 opened this issue Dec 22, 2020 · 2 comments
Open

Inconsistent with the original Blazepose full model #3

jizhu1023 opened this issue Dec 22, 2020 · 2 comments

Comments

@jizhu1023
Copy link

Thanks for your great work! I have some questions about the network structure. By comparing blazepose_full.py with the visualization image of the original tflite model, I found some differences. First, your implementation omitted the "identity_1" output in the original tflite model. Second, the "identity_2" output size is 156, i. e. 4 * (33+6), but the corresponding output size in your implementation is 99, i.e. 3 * 33. Why your implementation is inconsistent with the original model in these aspects? And why the joints output size is 156 in the original model? Many thanks in advance.

@vietanhdev
Copy link
Owner

Hello,
Our implementation is a modified version of the original model.
First, for identity_1, we don't know the exact purpose of this branch. This architecture is for tracking, so we guess that this branch predicts if there is a person in the image. I also verified this assumption by running the pre-trained model with the following code:

import tensorflow as tf
import cv2
import numpy as np

model = tf.keras.models.load_model('saved_model_full_pose_landmark_39kp')
cap = cv2.VideoCapture(0)

while True:
    _, origin = cap.read()
    img = cv2.resize(origin, (256, 256))
    img = img.astype(float)
    img = (img - 127) / 255
    img = np.array([img])

    heatmap, classify, regress = model.predict(img)
    confidence = np.reshape(classify, (1,))[0]
    print(confidence)

For identity_2, as explained here, they have 4 outputs for each keypoints:

x and y: Landmark coordinates normalized to [0.0, 1.0] by the image width and height respectively.
z: Should be discarded as currently the model is not fully trained to predict depth, but this is something on the roadmap.
visibility: A value in [0.0, 1.0] indicating the likelihood of the landmark being visible (present and not occluded) in the image.

That's why their output size is 4 * number_of_keypoints. In the pre-trained model we used to implement this repo, number_of_keypoints = 39, so we have 4 * 39 = 156 outputs. I removed z dimension from keypoints, the shape of the output is 3 * number_of_keypoints.

Another difference between our model and the original model is that the heatmap output of our model has the shape of (128, 128, number_of_keypoints) while the original model only has the shape of (128, 128, 1). We are using output from heatmap for the keypoints. In the future, we will modify this design.

@jizhu1023
Copy link
Author

jizhu1023 commented Dec 23, 2020

@vietanhdev Thanks for your reply, which address my issues well! The other thing I am confused about is why there are 39 keypoints but not 33 or 35 keypoints as mentioned by the paper? By looking into the code in Mediapipe, I found the keypoint 34-35 are auxiliary_landmarks for ROI generation and keypoint 36-39 are not used. I further visualize the locations of keypoint 36-39 and found they are the same with some keypoints on hands.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants