Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset section #2

Open
jpainam opened this issue Jul 31, 2024 · 5 comments
Open

Dataset section #2

jpainam opened this issue Jul 31, 2024 · 5 comments

Comments

@jpainam
Copy link

jpainam commented Jul 31, 2024

Hi. Thanks for releasing the code.

Can you provide details in the readme about the dataset preparation? I see a get_dataset that generates a toy_dataset with shape (10000, 2), while extracting feature from UCF_Crime will likely give me (N, 16, 1152). N is the number of frames

@jakubmicorek
Copy link
Owner

jakubmicorek commented Aug 6, 2024

Hi,

for the object-centric approach we used the features provided by Accurate-Interpretable-VAD.

For the frame-centric approach we use the Hiera backbone. For 16 consecutive RGB frames of shape [1, 3, 16, 224, 224] we extract a d-dimensional feature vector just before the classification head. In the case of Hiera-Large we obtain a feature vector of [1, 1152] for the 16 frames. We use the center frame as the ground-truth label. To get the features and ground-truth labels for the whole video clip we extract the features in a rolling window fashion.

@jpainam
Copy link
Author

jpainam commented Aug 6, 2024

Can you be more explicit about what you mean by rolling window fashion?
Given 64 consecutive frames. Do you build your windows in this fashion

  • [0, 16], [16, 32], [32, 48], [48, 64] or
  • [0, 16], [1, 17], [2, 18], [3, 19], ... [48, 64]

Both refer to a windowing approach.

@Haifu-Ye
Copy link

Hi,

for the object-centric approach we used the features provided by Accurate-Interpretable-VAD.

For the frame-centric approach we use the Hiera backbone. For 16 consecutive RGB frames of shape [1, 3, 16, 224, 224] we extract a d-dimensional feature vector just before the classification head. In the case of Hiera-Large we obtain a feature vector of [1, 1152] for the 16 frames. We use the center frame as the ground-truth label. To get the features and ground-truth labels for the whole video clip we extract the features in a rolling window fashion.

Hello!Have you solved your problem? I'm also reproducing the effect on the Avenue dataset but I'm stuck because I don't have the appropriate processing code for it.

@jpainam
Copy link
Author

jpainam commented Aug 13, 2024

@Haifu-Ye I decided to go with the first approach - no-overlapping frame
[0, 16], [16, 32], [32, 48], [48, 64] and use the label of the middle frame as the clip(window)'s label. i.e., label of the frame at start_frame + 8

I'm using UCF Crime

But the performance I get are far from the ones reported in the paper.

@Haifu-Ye
Copy link

@Haifu-Ye I decided to go with the first approach - no-overlapping frame [0, 16], [16, 32], [32, 48], [48, 64] and use the label of the middle frame as the clip(window)'s label. i.e., label of the frame at start_frame + 8

I'm using UCF Crime

But the performance I get are far from the ones reported in the paper.

hi!I want to try to use the shanghaitech dataset, but it seems that the format of the dataset in extract_shanghaitech_frames.py is not the same as that of the official shanghaitech dataset, however, the download link for the shanghaitech dataset in the script doesn't work, and I'd like to know how other people I would like to know how other people solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants