Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline Modifications for variable Audio and Sampler Generalisation to different Download Tasks #110

Merged
merged 12 commits into from
Nov 18, 2021

Conversation

khumairraj
Copy link
Contributor

No description provided.

@khumairraj khumairraj changed the title Pipeline Modifications for variable Audio and Sampler Generalisation to different Download Tasks [WIP] - Pipeline Modifications for variable Audio and Sampler Generalisation to different Download Tasks Nov 11, 2021
@khumairraj khumairraj changed the title [WIP] - Pipeline Modifications for variable Audio and Sampler Generalisation to different Download Tasks Pipeline Modifications for variable Audio and Sampler Generalisation to different Download Tasks Nov 16, 2021
hearpreprocess/sampler.py Show resolved Hide resolved
necessary_keys = sampler_config["necessary_keys"]

def requires(self):
return _get_download_and_extract_tasks(self.task_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrmmm really? I mean this is the sampler so it's not core code, it's just for testing, but this seems wrong and smells bad.

requires() in luigi is supposed to return a Luigi task. Not do work. run() is where work should occur. Otherwise you might have weird luigi bugs that are hard to debug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_download_and_extract_tasks returns the tasks which will download and extract. So technically, the requires is still returning a list of tasks. This is how our main pipeline is working where we use this function to build the task and then pass it in the ExtractMetadata as luigiParameter and put it in the requires. As discussed in the previous comment, the main reason to do this here, is that the _get_download_and_extract_tasks is different for different tasks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"get_download_and_extract_tasks"
]

class RandomSampleOriginalDataset(_RandomSampleOriginalDataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have RandomSampleOriginalDataset and _RandomSampleOriginalDataset? Why can't we just have only RandomSampleOriginalDataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason to do this here like this is - The RandomSampleOriginalDataset requires the get_download_and_extract_task, which is a function and is specific to the task. This returns the task to download and extract the task. So the _RandomSampleOriginalDataset is overridden and named as RandomSampleOriginalDataset and the tasks are added. I cannot refer to both the global variable _RandomSampleOriginalDataset and local variable RandomSampleOriginalDataset inside the get_sampler_task function with the same name. So I had to make different names for them.

@@ -160,6 +160,9 @@ def get_audio_dir_stats(
all_file_paths,
)
)
if len(audio_paths) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait why would this happen? Why is this okay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no audio in the downloaded directory, we should return nothing. This can happen, for example, if a downloaded directory has just metadata files, we still need to put that in the requires of the ExtractMetadata, so that the task can actually run. This function, the one here, runs on all the downloaded directories in the requires. So, this is not a problem as we still have another assert below, So that if audio files are found and the stats are not calculated, the assert will throw, if audio files are not found, which is in this case, it will return the empty dict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# should also be None and no subsampling will be done
if task_config["sample_duration"] is None:
schema["max_task_duration_by_split"] = Schema(
{split: Or(int, float, None) for split in SPLITS}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None, not int, float, None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@turian turian merged commit 943439c into main Nov 18, 2021
@turian turian deleted the pipeline_changes branch November 18, 2021 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants