Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError when computing WER metric #2078

Closed
diego-fustes opened this issue Mar 18, 2021 · 11 comments · Fixed by #2111
Closed

MemoryError when computing WER metric #2078

diego-fustes opened this issue Mar 18, 2021 · 11 comments · Fixed by #2111
Assignees
Labels
metric bug A bug in a metric script

Comments

@diego-fustes
Copy link

diego-fustes commented Mar 18, 2021

Hi, I'm trying to follow the ASR example to try Wav2Vec. This is the code that I use for WER calculation:

wer = load_metric("wer")
print(wer.compute(predictions=result["predicted"], references=result["target"]))

However, I receive the following exception:

Traceback (most recent call last): File "/home/diego/IpGlobal/wav2vec/test_wav2vec.py", line 51, in <module> print(wer.compute(predictions=result["predicted"], references=result["target"])) File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/datasets/metric.py", line 403, in compute output = self._compute(predictions=predictions, references=references, **kwargs) File "/home/diego/.cache/huggingface/modules/datasets_modules/metrics/wer/73b2d32b723b7fb8f204d785c00980ae4d937f12a65466f8fdf78706e2951281/wer.py", line 94, in _compute return wer(references, predictions) File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/jiwer/measures.py", line 81, in wer truth, hypothesis, truth_transform, hypothesis_transform, **kwargs File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/jiwer/measures.py", line 192, in compute_measures H, S, D, I = _get_operation_counts(truth, hypothesis) File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/jiwer/measures.py", line 273, in _get_operation_counts editops = Levenshtein.editops(source_string, destination_string) MemoryError

My system has more than 10GB of available RAM. Looking at the code, I think that it could be related to the way jiwer does the calculation, as it is pasting all the sentences in a single string before calling Levenshtein editops function.

@lhoestq
Copy link
Member

lhoestq commented Mar 24, 2021

Hi ! Thanks for reporting.
We're indeed using jiwer to compute the WER.

Maybe instead of calling jiwer.wer once for all the preditions/references we can compute the WER iteratively to avoid memory issues ? I'm not too familial with jiwer but this must be possible.

Currently the code to compute the WER is defined here:

https://github.com/huggingface/nlp/blob/349ac4398a3bcae6356f14c5754483383a60e8a4/metrics/wer/wer.py#L93-L94

@diego-fustes
Copy link
Author

diego-fustes commented Apr 5, 2021

Hi,

I've just pushed a pull request that is related to this issue #2169. It's not iterative, but it should avoid memory errors. It's based on the editdistance python library. An iterative implementation should be as easy as storing scores and words stepwise and dividing at the end.

@diego-fustes
Copy link
Author

I see, this was solved by other thread. Ok, let me know if you want to switch the implementation for any reason :)

@lhoestq
Copy link
Member

lhoestq commented Apr 6, 2021

Thanks for diving into this anyway ^^'
As you said this actually got solved a few days ago

@nikvaessen
Copy link

Someone created an issue jitsi/jiwer#40 at jiwer which shows that this is still a problem in the current version. Would be curious to figure out how this can be fixed by jiwer... :) I assume that it runs of out memory because it's trying to compute the WER over (too many) test samples?

@lhoestq
Copy link
Member

lhoestq commented Apr 29, 2021

Hi !

It's computed iteratively so not sure what could go wrong

datasets/metrics/wer/wer.py

Lines 100 to 106 in 8afd0ba

incorrect = 0
total = 0
for prediction, reference in zip(predictions, references):
measures = compute_measures(reference, prediction)
incorrect += measures["substitutions"] + measures["deletions"] + measures["insertions"]
total += measures["substitutions"] + measures["deletions"] + measures["hits"]
return incorrect / total

@NiklasHoltmeyer what version of datasets are you running ?

@albertvillanova
Copy link
Member

albertvillanova commented Apr 30, 2021

One possible explanation might be that it is the user who is passing all the sentences in a single element to wer.compute?

As current implementation iterates over the elements of predictions and references, this can be problematic if predictions and references contain a single huge element each.

This could be the case, for example, with a single string with all sentences:

result["predicted"] = "One sentence. Other sentence."

or with a double nested list of sentence lists

result["predicted"] = [[ ["One sentence."], ["Other sentence"] ]]

The user should check the dimensions of the data structure passed to predictions and references.

@diego-fustes
Copy link
Author

diego-fustes commented Apr 30, 2021

Hi all,

in my case I was using and older version of datasets and, as @albertvillanova points out, passing the full list of sentences for the metric calculation. The problem was in the way jiwer implements WER, as it tries to compute WER for the full list at once instead of doing it element-wise. I think that with the latest implementation of datasets, or by using the alternative WER function that I've contributed on this pull request there shouldn't be memory errors.

@NiklasHoltmeyer
Copy link

@lhoestq i was using Datasets==1.5.0 with 1.6.1 it worked (atleast the first run) but 1.5.0 is not compatible with my preprocessing. i cant save my dataset to a parquet file while using the latest datasets version

->

  File "../preprocess_dataset.py", line 132, in <module>
    pq.write_table(train_dataset.data, f'{resampled_data_dir}/{data_args.dataset_config_name}.train.parquet')
  File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py", line 1674, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py", line 588, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
TypeError: Argument 'table' has incorrect type (expected pyarrow.lib.Table, got ConcatenationTable)

if i do

import pyarrow.parquet as pq
...
...
pq.write_table(train_dataset.data, 'train.parquet')
pq.write_table(eval_dataset.data, 'eval.parquet')

while using 1.6.1. and its working with 1.5.0

@lhoestq
Copy link
Member

lhoestq commented Apr 30, 2021

Hi ! You can pass dataset.data.table instead of dataset.data to pq.write_table

@NiklasHoltmeyer
Copy link

This seems to be working so far! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metric bug A bug in a metric script
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants