Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transcript + metadata processing #15

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

markus583
Copy link

@markus583 markus583 commented Jul 16, 2024

Implements #16 and #12.

Transcript:

Output format (as jsonl for each video, with multiple videos in a tarfile):

[
    {
        "transcript": "take your ribbon and cut out five pieces",
        "start_frame_index": 430,
        "end_frame_index": 430
    },
    {
        "transcript": "these pieces are cut at 3 inches mom and",
        "start_frame_index": 541,
        "end_frame_index": 541
    },
    {
        "transcript": "then after you're done cutting them be",
        "start_frame_index": 600,
        "end_frame_index": 600
    },
...
]

Open issues (also see TODO/FIXME in the code):
(Outdated now; see discussion below)

  • What about errored videos? Why/when does this happen?
  • Videos with no subtitles?
  • Non-English videos? (Seemingly not available in the howto100m/v2d_40k subset?)
  • Check if timestep --> frame mapping is correct. Rounding is sensible? Or rather use floor/ceil for start/end? Also, the timestamps seem weirdly short within the transcripts...
  • Some dir stuff, but not problematic

Metadata:

Very similar to transcript structure, but save to json instead of jsonl.
TO DOs:

@markus583
Copy link
Author

markus583 commented Jul 16, 2024

Works fine on todi using /store/swissai/a08/data/4m-data/train/DEBUG/v2d_40k/train/

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 16, 2024

What about errored videos? Why/when does this happen?

Can you share the error? Likely it's that the video is now private or no longer exists and so can't be downloaded.

Videos with no subtitles?

How often does this happen?

Non-English videos? (Seemingly not available in the howto100m/v2d_40k subset?)

We intentionally decided to ignore non-English videos for now to keep the scope smaller.

Check if timestep --> frame mapping is correct. Rounding is sensible? Or rather use floor/ceil for start/end? Also, the timestamps seem weirdly short within the transcripts...

The timestep should be left inclusive, right exclusive. So if a clip is from timestamp 1m30 to 1m39 and frame A is at 1m29.9, frame B is 1m30.0, ..., frame Y is 1m39.9, frame Z is 1m40.0, then we would want start_frame_index=A and end_frame_index=Z

@markus583
Copy link
Author

What about errored videos? Why/when does this happen?

Can you share the error? Likely it's that the video is now private or no longer exists and so can't be downloaded.
The json looks like this:
{'url': 'https://www.youtube.com/watch?v=maixx6u6WSM', 'key': '0000000045', 'status': 'failed_to_download', 'error_message': "[Errno 2] No such file or directory: '/tmp/3e5ec4c0-42c2-4784-9f87-4025a4bec186.m4a'", 'yt_meta_dict': None}
So, yes, seems like it could not download the video.

What is a bit more common is that there is no yt_meta_dict:
{'url': 'https://www.youtube.com/watch?v=ng_ELNno0A4', 'key': '0000000005', 'status': 'success', 'error_message': None, 'yt_meta_dict': {}}

Any idea when/why this occurs @kdu4108? This not only contains transcripts but also other info.
Happens in 5-10% of videos.

The timestep should be left inclusive, right exclusive. So if a clip is from timestamp 1m30 to 1m39 and frame A is at 1m29.9, frame B is 1m30.0, ..., frame Y is 1m39.9, frame Z is 1m40.0, then we would want start_frame_index=A and end_frame_index=Z

Seems sensible. Let's wait with the implementation until we have metadata containing transcripts with proper timestamps.

--> TODO: adapt frame calculation

@markus583 markus583 changed the title Add transcript processing Add transcript + metadata processing Jul 18, 2024

if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Process tarfiles contati JSONs and convert to structured JSONL format."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contati?

hrs, mins, secs = map(float, timestamp.split(":"))
total_seconds = timedelta(hours=hrs, minutes=mins, seconds=secs).total_seconds()
# TODO: is round the right way of doing this? Most transcripts are assigned to only 1-2 frames...
return round(total_seconds * fps)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be left inclusive and right exclusive.

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 19, 2024

Looking very clean overall! Maybe what could be helpful is to include here an example of the jsons that are being converted into our format for both metadata and transcripts.

@markus583
Copy link
Author

Examples

Metadata:

    "title": "HOW TO PAINT YOUR MOTORCYCLE PART#5 BASECOAT PREP",
    "duration": 435,
    "channel": "Troy Kane Vtwins to v8s",
    "fps": 30,
    "tags": [
        "Vtwinstov8s.com",
        "troy kane",
        "harley",
        "sportster",
        "how to video",
        "how to paint",
        "custom paint",
        "motorcycle",
        "basecoat",
        "clear coat"
    ],
    "resolution": "1280x720",
    "aspect_ratio": 1.78,
    "dataset": "howto100m"
}

Transcripts:
(DEPRECATED, wait for WhisperX)

[
    {
        "transcript": "all right everybody this is where what's",
        "start_frame_index": 84,
        "end_frame_index": 84
    },
    {
        "transcript": "going on today",
        "start_frame_index": 156,
        "end_frame_index": 156
    },
    {
        "transcript": "got this fender here",
        "start_frame_index": 537,
        "end_frame_index": 538
    },
    {
        "transcript": "it's got to go black just a gloss black",
        "start_frame_index": 648,
        "end_frame_index": 648
    },
    {
        "transcript": "but before i could spray it i gotta prep",
        "start_frame_index": 736,
        "end_frame_index": 737
    },
    {
        "transcript": "it",
        "start_frame_index": 737,
        "end_frame_index": 777
    },
    {
        "transcript": "i'm gonna take this pinstripe off both",
        "start_frame_index": 837,
        "end_frame_index": 838
    },
    {
        "transcript": "sides it's got some scratches",
        "start_frame_index": 955,
        "end_frame_index": 955
    },
    {
        "transcript": "a little surface rust here and there",
        "start_frame_index": 1010,
        "end_frame_index": 1010
    },
    {
        "transcript": "but it's just gonna be a quick wham jam",
        "start_frame_index": 1281,
        "end_frame_index": 1282
    },
    {
        "transcript": "job because i'm only charging the guy 50",
        "start_frame_index": 1380,
        "end_frame_index": 1380
    },
    {
        "transcript": "bucks to do it",
        "start_frame_index": 1447,
        "end_frame_index": 1447
    },
    {
        "transcript": "so it's gonna be a quickie",
        "start_frame_index": 1543,
        "end_frame_index": 1543
    },
    {
        "transcript": "huh kids just gonna be a quickie yeah",
        "start_frame_index": 1646,
        "end_frame_index": 1646
    },
    {
        "transcript": "in and out job",
        "start_frame_index": 1742,
        "end_frame_index": 1742
    },
    {
        "transcript": "gunned it down like three four times",
        "start_frame_index": 1850,
        "end_frame_index": 1850
    },
    {
        "transcript": "and get all the nasty nasty off of it",
        "start_frame_index": 1956,
        "end_frame_index": 1956
    },
    {
        "transcript": "to get the gun residue off",
        "start_frame_index": 2313,
        "end_frame_index": 2314
    },
    {
        "transcript": "just to get her clean before i start",
        "start_frame_index": 2404,
        "end_frame_index": 2405
    },
    {
        "transcript": "sanding and yeah i'm gonna get on it",
        "start_frame_index": 2512,
        "end_frame_index": 2513
    },
    {
        "transcript": "ah oh well",
        "start_frame_index": 3252,
        "end_frame_index": 3252
    },
    {
        "transcript": "yeah that's what's going on i'll bring",
        "start_frame_index": 3312,
        "end_frame_index": 3312
    },
    {
        "transcript": "it back",
        "start_frame_index": 3367,
        "end_frame_index": 3367
    },
    {
        "transcript": "after i get this thing ready to paint",
        "start_frame_index": 3462,
        "end_frame_index": 3462
    },
    {
        "transcript": "peace",
        "start_frame_index": 3462,
        "end_frame_index": 3552
    }
]```

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 22, 2024

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 30, 2024

Thanks Markus, looks great! Two nits are (1) is there reason it's called merge_data.py instead of train_val_test_split.py or something like that? and (2) can you add an example command of how we would run that script in a comment (in particular, I want to clarify - does that splitting script takes in as input a modality folder like video_rgb or video_det? As opposed to the raw video folder?)

@markus583
Copy link
Author

Sure, renamed the script.
The script can work in either way. If I run it like this:
python pseudolabeling/merge_data.py --source_dir /store/swissai/a08/data/4m --output_dir /store/swissai/a08/data/4m/splits (NB: splits dir is not really necessary, just to see the diff)

I get this as output:

Move /store/swissai/a08/data/4m/video_rgb_tok -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok/train
Move /store/swissai/a08/data/4m/video_rgb_tok -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok/val
Move /store/swissai/a08/data/4m/video_rgb_tok -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok/test
/store/swissai/a08/data/4m/video_rgb_tok_full
Move /store/swissai/a08/data/4m/video_rgb_tok_full -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok_full/train
Move /store/swissai/a08/data/4m/video_rgb_tok_full -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok_full/val
Move /store/swissai/a08/data/4m/video_rgb_tok_full -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok_full/test
/store/swissai/a08/data/4m/video_metadata
Move /store/swissai/a08/data/4m/video_metadata -----------> /store/swissai/a08/data/4m/splits/video_metadata/train
Move /store/swissai/a08/data/4m/video_metadata -----------> /store/swissai/a08/data/4m/splits/video_metadata/val
Move /store/swissai/a08/data/4m/video_metadata -----------> /store/swissai/a08/data/4m/splits/video_metadata/test
/store/swissai/a08/data/4m/video_rgb
Move /store/swissai/a08/data/4m/video_rgb -----------> /store/swissai/a08/data/4m/splits/video_rgb/train
Move /store/swissai/a08/data/4m/video_rgb -----------> /store/swissai/a08/data/4m/splits/video_rgb/val
Move /store/swissai/a08/data/4m/video_rgb -----------> /store/swissai/a08/data/4m/splits/video_rgb/test

@markus583
Copy link
Author

I think this is ready to merge now. One nice to have would be to integrate @kdu4108 's logger, but let's get moving now and do this later. @kdu4108 @yahya010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants