Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VideoMAE visualization based on Vision Transformers. #529

Open
pooyafayyaz opened this issue Sep 8, 2024 · 0 comments
Open

VideoMAE visualization based on Vision Transformers. #529

pooyafayyaz opened this issue Sep 8, 2024 · 0 comments

Comments

@pooyafayyaz
Copy link

pooyafayyaz commented Sep 8, 2024

Thanks for the awesome repo!

I've been visualizing the attention maps from VideoMAE, but it seems the results aren't accurate. VideoMAE takes inputs in the form (B, C, T, H, W) for video classification.

Here's how I'm applying GradCAM:


target_layers = [models.blocks[-1].norm]

with HiResCAM(model=model, target_layers=target_layers, reshape_transform=reshape_transform) as cam:

        grayscale_cam = cam(input_tensor=video_tensor, targets=targets)        
        grayscale_cam = grayscale_cam[0, :]  # the batch is only 1 
                                        
        video_tensor_numpy = video_tensor[0].permute(1, 2, 3, 0).cpu().numpy()  # Get first frame

        # Loop over each frame in grayscale_cam (16 frames)
        for frame_idx in range(grayscale_cam.shape[0]):
            grayscale_frame = grayscale_cam[frame_idx, :]  # Extract the current frame

            # Convert video_tensor to numpy for visualization
            video_frame = video_tensor_numpy[frame_idx]  # Select the current frame
            
            video_frame = (video_frame * 255).astype(np.uint8)  # Rescale to [0, 255]
            video_frame = cv2.cvtColor(video_frame, cv2.COLOR_RGB2BGR)  # Convert to BGR for OpenCV

            # Overlay CAM on the current frame
            cam_image = show_cam_on_image(video_frame/255, grayscale_frame, use_rgb=True)            
            ## SAVE IMAGE ......

        return grayscale_cam

For reshaping I used. The input is 316244244, and I used the patch size of 1616. VideoMAE does not use a CLS token. The sequence length is equal to (num_frames // tubelet_size) * num_patches_per_frame, with num_patches_per_frame = (image_size // patch_size) ** 2.

Hence, in this case: (16//2) * (224 // 16)**2 = 1568.


def reshape_transform(tensor, height=14, width=14):
    result = tensor.reshape(tensor.size(0), 8 ,height, width, tensor.size(2))
    result = result.permute(0, 1 , 4, 2 , 3)
    return result

Am I missing something here? The GradCAM visualization seems scattered all over the place, even though the model is correctly classifying the input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant