MemoryError when computing WER metric #2078

diego-fustes · 2021-03-18T11:30:05Z

Hi, I'm trying to follow the ASR example to try Wav2Vec. This is the code that I use for WER calculation:

wer = load_metric("wer")
print(wer.compute(predictions=result["predicted"], references=result["target"]))

However, I receive the following exception:

Traceback (most recent call last): File "/home/diego/IpGlobal/wav2vec/test_wav2vec.py", line 51, in <module> print(wer.compute(predictions=result["predicted"], references=result["target"])) File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/datasets/metric.py", line 403, in compute output = self._compute(predictions=predictions, references=references, **kwargs) File "/home/diego/.cache/huggingface/modules/datasets_modules/metrics/wer/73b2d32b723b7fb8f204d785c00980ae4d937f12a65466f8fdf78706e2951281/wer.py", line 94, in _compute return wer(references, predictions) File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/jiwer/measures.py", line 81, in wer truth, hypothesis, truth_transform, hypothesis_transform, **kwargs File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/jiwer/measures.py", line 192, in compute_measures H, S, D, I = _get_operation_counts(truth, hypothesis) File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/jiwer/measures.py", line 273, in _get_operation_counts editops = Levenshtein.editops(source_string, destination_string) MemoryError

My system has more than 10GB of available RAM. Looking at the code, I think that it could be related to the way jiwer does the calculation, as it is pasting all the sentences in a single string before calling Levenshtein editops function.

The text was updated successfully, but these errors were encountered:

lhoestq · 2021-03-24T10:01:55Z

Hi ! Thanks for reporting.
We're indeed using jiwer to compute the WER.

Maybe instead of calling jiwer.wer once for all the preditions/references we can compute the WER iteratively to avoid memory issues ? I'm not too familial with jiwer but this must be possible.

Currently the code to compute the WER is defined here:

https://github.com/huggingface/nlp/blob/349ac4398a3bcae6356f14c5754483383a60e8a4/metrics/wer/wer.py#L93-L94

diego-fustes · 2021-04-05T15:47:08Z

Hi,

I've just pushed a pull request that is related to this issue #2169. It's not iterative, but it should avoid memory errors. It's based on the editdistance python library. An iterative implementation should be as easy as storing scores and words stepwise and dividing at the end.

diego-fustes · 2021-04-06T07:42:55Z

I see, this was solved by other thread. Ok, let me know if you want to switch the implementation for any reason :)

lhoestq · 2021-04-06T15:04:17Z

Thanks for diving into this anyway ^^'
As you said this actually got solved a few days ago

nikvaessen · 2021-04-29T14:47:13Z

Someone created an issue jitsi/jiwer#40 at jiwer which shows that this is still a problem in the current version. Would be curious to figure out how this can be fixed by jiwer... :) I assume that it runs of out memory because it's trying to compute the WER over (too many) test samples?

lhoestq · 2021-04-29T17:03:18Z

Hi !

It's computed iteratively so not sure what could go wrong

datasets/metrics/wer/wer.py

Lines 100 to 106 in 8afd0ba

    
           incorrect = 0 
        
           total = 0 
        
           for prediction, reference in zip(predictions, references): 
        
               measures = compute_measures(reference, prediction) 
        
               incorrect += measures["substitutions"] + measures["deletions"] + measures["insertions"] 
        
               total += measures["substitutions"] + measures["deletions"] + measures["hits"] 
        
           return incorrect / total

@NiklasHoltmeyer what version of datasets are you running ?

albertvillanova · 2021-04-30T07:13:22Z

One possible explanation might be that it is the user who is passing all the sentences in a single element to wer.compute?

As current implementation iterates over the elements of predictions and references, this can be problematic if predictions and references contain a single huge element each.

This could be the case, for example, with a single string with all sentences:

result["predicted"] = "One sentence. Other sentence."

or with a double nested list of sentence lists

result["predicted"] = [[ ["One sentence."], ["Other sentence"] ]]

The user should check the dimensions of the data structure passed to predictions and references.

diego-fustes · 2021-04-30T07:49:06Z

Hi all,

in my case I was using and older version of datasets and, as @albertvillanova points out, passing the full list of sentences for the metric calculation. The problem was in the way jiwer implements WER, as it tries to compute WER for the full list at once instead of doing it element-wise. I think that with the latest implementation of datasets, or by using the alternative WER function that I've contributed on this pull request there shouldn't be memory errors.

NiklasHoltmeyer · 2021-04-30T19:37:49Z

@lhoestq i was using Datasets==1.5.0 with 1.6.1 it worked (atleast the first run) but 1.5.0 is not compatible with my preprocessing. i cant save my dataset to a parquet file while using the latest datasets version

->

  File "../preprocess_dataset.py", line 132, in <module>
    pq.write_table(train_dataset.data, f'{resampled_data_dir}/{data_args.dataset_config_name}.train.parquet')
  File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py", line 1674, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py", line 588, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
TypeError: Argument 'table' has incorrect type (expected pyarrow.lib.Table, got ConcatenationTable)

if i do

import pyarrow.parquet as pq
...
...
pq.write_table(train_dataset.data, 'train.parquet')
pq.write_table(eval_dataset.data, 'eval.parquet')

while using 1.6.1. and its working with 1.5.0

lhoestq · 2021-04-30T20:59:25Z

Hi ! You can pass dataset.data.table instead of dataset.data to pq.write_table

NiklasHoltmeyer · 2021-05-01T08:31:49Z

This seems to be working so far! Thanks!

lhoestq added the metric bug A bug in a metric script label Mar 24, 2021

albertvillanova self-assigned this Mar 25, 2021

albertvillanova mentioned this issue Mar 25, 2021

Compute WER metric iteratively #2111

Merged

albertvillanova linked a pull request Mar 25, 2021 that will close this issue

Compute WER metric iteratively #2111

Merged

diego-fustes mentioned this issue Apr 5, 2021

Updated WER metric implementation to avoid memory issues #2169

Closed

lhoestq closed this as completed in #2111 Apr 6, 2021

NiklasHoltmeyer mentioned this issue Apr 30, 2021

Still running into MemoryError? Huggingface jitsi/jiwer#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError when computing WER metric #2078

MemoryError when computing WER metric #2078

diego-fustes commented Mar 18, 2021 •

edited

Loading

lhoestq commented Mar 24, 2021

diego-fustes commented Apr 5, 2021 •

edited

Loading

diego-fustes commented Apr 6, 2021

lhoestq commented Apr 6, 2021

nikvaessen commented Apr 29, 2021

lhoestq commented Apr 29, 2021

albertvillanova commented Apr 30, 2021 •

edited

Loading

diego-fustes commented Apr 30, 2021 •

edited

Loading

NiklasHoltmeyer commented Apr 30, 2021

lhoestq commented Apr 30, 2021

NiklasHoltmeyer commented May 1, 2021

MemoryError when computing WER metric #2078

MemoryError when computing WER metric #2078

Comments

diego-fustes commented Mar 18, 2021 • edited Loading

lhoestq commented Mar 24, 2021

diego-fustes commented Apr 5, 2021 • edited Loading

diego-fustes commented Apr 6, 2021

lhoestq commented Apr 6, 2021

nikvaessen commented Apr 29, 2021

lhoestq commented Apr 29, 2021

albertvillanova commented Apr 30, 2021 • edited Loading

diego-fustes commented Apr 30, 2021 • edited Loading

NiklasHoltmeyer commented Apr 30, 2021

lhoestq commented Apr 30, 2021

NiklasHoltmeyer commented May 1, 2021

diego-fustes commented Mar 18, 2021 •

edited

Loading

diego-fustes commented Apr 5, 2021 •

edited

Loading

albertvillanova commented Apr 30, 2021 •

edited

Loading

diego-fustes commented Apr 30, 2021 •

edited

Loading