Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chime4 #423

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Chime4 #423

wants to merge 10 commits into from

Conversation

JorisCos
Copy link
Collaborator

@JorisCos JorisCos commented Feb 2, 2021

About this PR

This PR introduces a recipe for evaluation purposes only, on real noisy data.
It uses the data from the CHiME 4 challenge (an adaptation from the CHiME 3 challenge).

Design

I am unsure about the design of the dataset and the recipe :

  • The dataset mimics the behaviour of other datasets by returning mixture and sources(here fake sources that are actually a copy of the mixture)
  • The recipe only contains scripts for the evaluation it might be worth to make it more general by avoiding to limit its application to only one model

@mpariente
Copy link
Collaborator

I'm undecided about this.
On the one hand, having data prep and eval script for Chime4 is cool, and it's an example for other models but we could probably make more modular than this.

But to make it more modular, we should attach the data prep to the dataloader (needs refactoring of the data part of Asteroid, which we want to do), and we should enable evaluating any model on any dataset (also needs refactoring).

I'd vote in favour of merging it, because it diversifies the examples that we have (both for the users and things in mind when we refactor).

@jonashaag @popcornell

@jonashaag
Copy link
Collaborator

Yes, I'd say merge it, as it's not introducing any new way of doing things, so it doesn't add much work to our refactoring plate.

Did you forget to rename the "chime3" variables and CLI flag or is that on purpose?

@JorisCos
Copy link
Collaborator Author

JorisCos commented Feb 4, 2021

Yes, I'd say merge it, as it's not introducing any new way of doing things, so it doesn't add much work to our refactoring plate.

Did you forget to rename the "chime3" variables and CLI flag or is that on purpose?

That's on purpose because CHiME 4 uses data from CHiME 3 when you download the data you actually get a directory named CHiME3. But maybe it's confusing.

@mpariente
Copy link
Collaborator

I guess it will be clear once people have the data from CHiME3 downloaded.

Copy link
Collaborator

@mpariente mpariente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly good to do but needs a last polishing.

I feel the create_metadata.py file is not so readable. Mainly because of the function flow. I'd have to think about what I'd reproach to the design exactly.

egs/chime4/README.md Show resolved Hide resolved
import os


class CHiME4(Dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this CHiME4Dataset.

Comment on lines 67 to 72
mixture = torch.from_numpy(mixture)
mock_source = torch.vstack([mixture])
if self.return_id:
id1 = row.ID
return mixture, mock_source, [id1]
return mixture, mock_source
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not return the mock_source.

Comment on lines 98 to 104
mix, sources, ids = test_set[idx]
mix, sources = tensors_to_device([mix, sources], device=model_device)
est_sources = model(mix.unsqueeze(0))
mix_np = mix.cpu().data.numpy()
sources_np = sources.cpu().data.numpy()
est_sources_np = est_sources.squeeze(0).cpu().data.numpy()
est_sources_np *= np.max(np.abs(mix_np)) / np.max(np.abs(est_sources_np))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the code related to sources.

utt_metrics.update(
**wer_tracker(
mix=mix_np,
clean=sources_np,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend WerTracker to work without clean reference. And remove clean.

# Choice for the ASR model whether trained on clean or noisy data. One of clean or noisy
asr_type=noisy

test_dir=data/test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be overwritten?
If it doesn't, then add it after utils/parse_options.sh

Comment on lines 39 to 65
if [[ $stage -le 0 ]]; then
echo "Stage 0: Generating CHiME-4 dataset"
$python_path local/create_metadata.py --chime3_dir $storage_dir/CHiME3/
fi

if [[ $stage -le 1 ]]; then
echo "Stage 2 : Evaluation"
echo "Results from the following experiment will be stored in $exp_dir/chime4/$asr_type"

if [[ $compute_wer -eq 1 ]]; then

# Install espnet if not instaled
if ! python -c "import espnet" &> /dev/null; then
echo 'This recipe requires espnet. Installing requirements.'
$python_path -m pip install espnet_model_zoo
$python_path -m pip install jiwer
$python_path -m pip install tabulate
fi
fi

$python_path eval.py \
--exp_dir $exp_dir \
--test_dir $test_dir \
--use_gpu $eval_use_gpu \
--compute_wer $compute_wer \
--asr_type $asr_type
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix indents, but otherwise OK

As the channel to use for the training set wasn't defined by
the challenge's rules, we will set it randomly.

NOTE :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NOTE :
**Note :**


NOTE :
This dataset uses real noisy data. This means the clean speech from the noisy
utterances is not available. This makes it not suitable for the usual training
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
utterances is not available. This makes it not suitable for the usual training
utterances is not available. This makes it unsuitable for the usual training

egs/chime4/README.md Show resolved Hide resolved
Copy link
Collaborator

@jonashaag jonashaag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manuel asked for some review of create_metadata.py so here we go :)

f for f in glob(os.path.join(chime3_dir, "data", "annotations", "*real*.list"))
]
for c3_annot_file_path in c3_annot_files:
# Read CHiME-3 annotation file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be helpful to move the entire for loop body into a separate function so as to make it obvious that all loop iterations are entirely independent.

c3_annot_file_path.replace(".json", "_1ch_track.list"), header=None, names=["path"]
)
else:
c4_annot_file = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case isn't handled in the create_dataframe function

# Read CHiME-3 annotation file
c3_annot_file = pd.read_json(c3_annot_file_path)
# subsets : "tr" "dt" "et" origin "real" or "simu"
subset, origin = os.path.split(c3_annot_file_path)[1].replace(".json", "").split("_")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the entire code could benefit from using the pathlib API. Eg here, c3_annot_file_path.with_suffix("").name.split("_")

Also I think it could be helpful to add a comment with a file name example everywhere filenames are parsed or constructed, for example

# Extract subset and origin from /foo/bar/<subset>_<origin>.json

)
else:
c4_annot_file = None
df, df_2 = create_dataframe(chime3_dir, c3_annot_file, c4_annot_file, subset, origin)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be better names than df and df_2?


def create_dataframe(chime3_dir, c3_annot_file, c4_annot_file, subset, origin):
# Empty list for DataFrame creation
row_list = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, is there a more helpful name than "row" and "row 2" here?

env = row.environment
# if we are not dealing with tr subset
if "tr" not in subset:
mixture_path = c4_annot_file[c4_annot_file["path"].str.contains(ID + "_" + env)].values[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, and also in the else case below, it could be helpful to add a comment with an example of what we're looking for in the annotation data frame, like the I suggested with the path splitting above

JorisCos and others added 3 commits February 9, 2021 17:53
reformat create_metadata.py
fix typo indent
fix eval
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants