Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add modality info and transforms for raw RGB, tokens, video descriptions, video transcripts, and video bounding boxes. #1

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

kdu4108
Copy link
Collaborator

@kdu4108 kdu4108 commented Jun 21, 2024

Goal: Inputs from each modality, after being loaded from their paths, need to be transformed so that they can be used downstream for the model. These transforms are defined in modality_transforms.py. modality_info.MODALITY_TRANSFORMS defines the mapping from different modality IDs (e.g., "tok_rgb") to its correspnding transformation. In this PR, we need to define new transformations for each video modality according to be represented as we have described in https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144.

Summary: So far, I've implemented but NOT tested (like at all, because I haven't done any of the data setup yet :/ this is entirely just me spitballing on what the code should maybe look like and probably has a million little runtime errors, sorry :(( )

(a) the modality info mapping (I feel pretty good about this, but @garjania please take a look to see if that looks okay!) and
(b) tentative implementations for transforms for VideoTokTransform, VideoDescriptionTransform, VideoTranscriptTransform, and VideoDetTransform. Of these, I feel best about VideoDetTransform and VideoTokTransform. I have several TODOs in the code indicating questions. Most of these revolve around what kind of data augmentations (image_augment function) we should keep for the different video transforms. Also the input format and desired output shapes for some of these are not entirely clear to me (i.e., should the outputs already be unrolled from (num_frames, num_tokens_per_frame) to (num_frames * num_tokens_per_frame,)?), so any input there would be great!

Remaining TODOs: Implement the raw RGB, clarify/confirm the questions, update the implementation accordingly, implement runtime fixes and test to make sure the transforms can all run correctly.
Reviewers: @garjania @vesteinn @yahya

@kdu4108
Copy link
Collaborator Author

kdu4108 commented Jun 21, 2024

Oops, I apologize for the extra files. Everythign except modality_transforms and modality_info.py should just be comments, so feel free to ignore them.

"""

# raise NotImplementedError("I'm not ")
def load(self, path):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: webd formats are not loaded using this function. This is only applicable for non-webd data storage formats (Ali).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datapipe = wds.DataPipeline(
- the pipeline for loading webd and running the preprocess/augment/postprocess.

def wds_decoder(key, value):
is the thing that actually loads the specific files from the webd tars

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All modalities should use webd and be compressed into webd/tar files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the load function will (probably) never be called for webd/our video-based modalities

class VideoTokTransform(AbstractTransform):
"""
Assume input tokens is an ndarray of shape (num_frames, num_tokens_per_frame).
Transform the tokens to a torch tensor of shape (num_frames * num_tokens,).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe don't flatten it here because this will make it a little bit more annoying to add the temporal embedding. Instead we can unflatten after adding the temporal embedding.

def preprocess(self, sample):
return sample

def image_augment(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Role of this: do the augmentation during the tokenization step here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TokTransform's v input is actually a list of tokenized inputs which are different augmentations of the same input. So that transform requires an index to select one of the augmented inputs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here, we should assume no augmentations but keep the input v as a list of augmentations. but in this case since there's none just get v[0] always.

def image_augment(
self,
bboxes_per_frame: List[List[Tuple]],
crop_coords: Tuple,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No rand_aug_idx for this because the other parameters actually specify which aug to apply

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we have the same crop for each frame already!

The innermost tuple is the actual representation of a bounding box instance (of the form [xmin, ymin, xmax, ymax, class_name, score]).
"""
instances = [frame["instances"] for frame in sample]
return self.convert_detection_instance(instances)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bugfix: wrap this in a list

for i, bboxes in enumerate(bboxes_per_frame):
bboxes_str = self.convert_bboxes_to_string(bboxes_per_frame)
output_str += bboxes_str
output_str += f"<frame_{i+1}_token>"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these new special tokens need to be added to the tokenizer jsons -- https://github.com/swiss-ai/ml-4m/blob/main/fourm/utils/tokenizer/trained/text_tokenizer_4m_wordpiece_30k.json (we can manually add these tokens to the json file)

bboxes_str = self.convert_bboxes_to_string(bboxes_per_frame)
output_str += bboxes_str
output_str += f"<frame_{i+1}_token>"
output_str += "<eos_token>" # TODO: Do we need to explicitly add EOS token here? If yes, we need to do this so it adds the actual eos_token str of the model not just this hardcoded string.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's reuse the eos token from the text tokenizer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print(
"WARNING: no augmentations implemented for transcripts yet. Decide whether to augment/what these should be and then remove this warning."
)
return val
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No augmentations for now. val is actually the caption.

def preprocess(self, sample):
return sample

def image_augment(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precondition: assume val does not have any augmentations (thus list[dict])
postcondition: output is type list[dict]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant