Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added pseudolabel_frames.py #19

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Added pseudolabel_frames.py #19

wants to merge 5 commits into from

Conversation

yahya010
Copy link

Pseudolableing code:

Goes through each tar file, checks mp4 files and for each frame in the video, psuedolabels it.
Here are is an example of a frame before and after pseudolabeling:
Before:
image
After Pseudolabeling the frame:
image

@markus583
Copy link

Thanks, Yahya! Logic looks fine but I feel this script lacks some configurability.
It would be very useful to have dirs & potentially some other options configurable like here:
#14, #15.

@markus583
Copy link

Moreover, crucially, you save the result as .jpg (here). I assume this is what you show above. For visualization purposes, this is just fine. We could leave this optionally in as an argument (but default: False). But as input into any model we need BB coordinates, right?
Currently, it seems as if this is not provided. See #11.
Please clarify/add this.

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 22, 2024

Hey Yahya, yes thanks for making the PR and including the pictures! :)

Before merging in, +1 to Markus' comments and suggestions. To consolidate our suggestions together, can you make the following changes?

Thanks! And let me know if you're unsure about how to tackle any of these :)

import cv2
from ultralytics import YOLO

SHARDS = "/cluster/work/cotterell/mm_swissai/datasets/hdvila/1000_hd_vila_shuffled/0000000000.tar"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a configurable input

from ultralytics import YOLO

SHARDS = "/cluster/work/cotterell/mm_swissai/datasets/hdvila/1000_hd_vila_shuffled/0000000000.tar"
OUTPUT_DIR = "bbox-yolo/extracted_frames"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be root/data/video_det

video.release()

# Apply pseudolabeling to the extracted frames
results = model(frame_paths, project=LABELED_OUTPUT_DIR, name=file[:-4])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extract the bounding box representations as JSONs?

@markus583
Copy link

Plus, I think it would be useful to only select every nth frame here already, similar to how it is done in #14. In general, you can take big inspiration from this, should be easy to adapt. @yahya010

@yahya010
Copy link
Author

Made the changes, so now the shards is an argument, (the paths to save will be changed on todi, I kept them for now as they are on euler because the YOLO model is a bit weird with where it puts the save directory)

Also did the nth frame selection and also saves both the bounding box image and the json output like this:

[
{
"bbox": [
621.873046875,
300.680419921875,
1166.7474365234375,
656.718017578125
],
"confidence": 0.8789449334144592,
"class": 42,
"class_name": "fork"
},
{
"bbox": [
0.431396484375,
1.4971923828125,
1157.58642578125,
703.078857421875
],
"confidence": 0.3820949196815491,
"class": 60,
"class_name": "dining table"
},
{
"bbox": [
261.07958984375,
34.11528015136719,
786.9713134765625,
492.86407470703125
],
"confidence": 0.2623625695705414,
"class": 45,
"class_name": "bowl"
}
]

image

@markus583
Copy link

Great! Thanks! It is also tested on Euler, right?

I realize you extract the tarfile to a directory. I am not sure if we want this. Maybe it is better not to keep them and just use a temp dir/temp file instead. I am also doing so in tokenization #14.

Another thing we need to keep in mind is if we have different fps for different videos. In this case, should we normalize the --nth_frame argument so all videos behave in the same way? This also becomes relevant with the temporal embedding later; if we do not do this here (and also in tokenization, where it is not implemented either), we should ensure proper time offsets later.


# Load the YOLO model
model = YOLO('bbox-yolo/yolov8n.pt') # pretrained YOLOv8n model
model = YOLO('/cluster/work/cotterell/yemara/ml-4m/bbox-yolo/yolov8n.pt') # pretrained YOLOv8n model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: for more configuration, maybe even make this path an arg in the parseargs (and set this is as the default)?


# Extract the tar file
with tarfile.open(SHARDS, "r") as tar:
tar.extractall(path="bbox-yolo/extracted_files")
tar.extractall(path="extracted_files")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Markus's comment -- better perhaps to do everything within a tempdir like this https://stackoverflow.com/questions/3223604/how-do-i-create-a-temporary-directory-in-python

conf = box.conf.item() # get confidence score
cls = int(box.cls.item()) # get class id
json_data.append({
"bbox": xyxy,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use key-names consistent with the og 4M's bounding box json key names? (I'm 90% sure it's the ones in the example here - #11 (comment))

So for one frame that'd be like:

{
            "num_instances": 5,
            "image_height": 512,
            "image_width": 906,
            "instances": [
                {
                    "boxes": [
                        0.4229210317134857,
                        0.00020096010121051222,
                        0.5715101361274719,
                        0.13699540495872498
                    ],
                    "score": 0.9029952883720398,
                    "class_id": 74,
                    "class_name": "clock",
                    "segmentation": [
                        [
                            0.5055187637969095,
                            0.1337890625,
                            ...
                        ]
                    ]
                },
                {
                    "boxes": [
                        ...
                    ],
                    ...
                },
                    ...
            ]
        },

video.release()

# Apply pseudolabeling to the extracted frames
results = model(frame_paths, project=LABELED_OUTPUT_DIR, name=file[:-4])

for i, result in enumerate(results):
# Save labeled image
result.save(filename=f'{file[:-4]}_labeled_frame_{i}.jpg')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you actually make this optional in an arg (default false)? While this is useful for debugging, when we run at scale I think we won't want to save every image.

})

# Save JSON file
json_filename = os.path.join(JSON_OUTPUT_DIR, f"{file[:-4]}_frame_{i}_boxes.json")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually do 2 more things for saving the results?

  1. aggregate the results in a list of jsons and save it using jsonl? (https://jsonlines.readthedocs.io/en/latest/) So, each video should be saved with the same name as the mp4 (so if a file is named 00004.mp4, it should be saved as 00004.jsonl).
  2. repackage the jsonl's back into the tar files corresponding in name to the tarfiles containing those mp4s. For example, if all videos extracted from the tar video_rgb/00043.tar, should have its corresponding jsonls in video_det/00043.tar. See Transform from video_rgb format into video_det format and save in video_det/ directory. #11 (comment) for more deets!

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 30, 2024

Hey Yahya, looks much better! Just have a few more requests regarding how the outputs of yolo should be saved that I left in-line, can you take a look please?

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 30, 2024

Another thing we need to keep in mind is if we have different fps for different videos. In this case, should we normalize the --nth_frame argument so all videos behave in the same way? This also becomes relevant with the temporal embedding later; if we do not do this here (and also in tokenization, where it is not implemented either), we should ensure proper time offsets later.

This is a good point. I think we have a couple of options here?

(1) we keep the same FPS for all videos for a given modality.
Pro: easy to deal with
Con: Maybe less flexible. Possible for some videos we want bounding boxes more or less frequently than others.

(2) we mark the FPS of the video for each modality in metadata.
We could add keys like "video_det_fps", "video_tok_rgb_fps", etc. to the metadata and hope the model learns to leverage that info usefully.
Pro: more flexible
Con: a little more work, a little more learning burden on the model, but hopefully doable?

I'd be a proponent of (2) but also would like to get a take from Ali/other 4M experts.

@markus583
Copy link

Another thing we need to keep in mind is if we have different fps for different videos. In this case, should we normalize the --nth_frame argument so all videos behave in the same way? This also becomes relevant with the temporal embedding later; if we do not do this here (and also in tokenization, where it is not implemented either), we should ensure proper time offsets later.

This is a good point. I think we have a couple of options here?

(1) we keep the same FPS for all videos for a given modality. Pro: easy to deal with Con: Maybe less flexible. Possible for some videos we want bounding boxes more or less frequently than others.

(2) we mark the FPS of the video for each modality in metadata. We could add keys like "video_det_fps", "video_tok_rgb_fps", etc. to the metadata and hope the model learns to leverage that info usefully. Pro: more flexible Con: a little more work, a little more learning burden on the model, but hopefully doable?

I'd be a proponent of (2) but also would like to get a take from Ali/other 4M experts.

Thx for coming up with the options. (1) may be enough for a start but could restrict us later, so it may be better indeed to implement this more flexibly already. So I agree that (2) is better.
However, I wonder if we need/want to rely on the metadata on this. Would it not be better to keep this part of the temporal encoding? Hence the --every_nth_frame also becomes problematic: Taking every 5th frame from a video with 25 fps results in different timestamps than every 5th frame from a video with 30 fps. So the temporal encoding should also change.

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 30, 2024

Another TODO: instead of doing every-nth-frame, keep a fixed FPS for each modality. While less flexible, it's easier to implement and it doesn't seem necessary now to engineer for differing FPS for different videos.
We might want that in the future for e.g. video_tok_rgb for videos with drastically different lengths, but for now oh well.

The cost of changing this later is just needing to re-pseudolabel everything, but while we work with small amounts of data to start this shouldn't be a concern. Also do the same for #14

@markus583
Copy link

markus583 commented Jul 30, 2024

For the fixed fps, see b5f902a
This should be done in the same way for all modalities!
@yahya010

@markus583
Copy link

@yahya010 thanks, please also move to fps instead of every_nth_frame (see comment above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants