-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MemoryError when computing WER metric #2078
Comments
Hi ! Thanks for reporting. Maybe instead of calling Currently the code to compute the WER is defined here: |
Hi, I've just pushed a pull request that is related to this issue #2169. It's not iterative, but it should avoid memory errors. It's based on the editdistance python library. An iterative implementation should be as easy as storing scores and words stepwise and dividing at the end. |
I see, this was solved by other thread. Ok, let me know if you want to switch the implementation for any reason :) |
Thanks for diving into this anyway ^^' |
Someone created an issue jitsi/jiwer#40 at jiwer which shows that this is still a problem in the current version. Would be curious to figure out how this can be fixed by jiwer... :) I assume that it runs of out memory because it's trying to compute the WER over (too many) test samples? |
Hi ! It's computed iteratively so not sure what could go wrong Lines 100 to 106 in 8afd0ba
@NiklasHoltmeyer what version of |
One possible explanation might be that it is the user who is passing all the sentences in a single element to As current implementation iterates over the elements of This could be the case, for example, with a single string with all sentences: result["predicted"] = "One sentence. Other sentence." or with a double nested list of sentence lists result["predicted"] = [[ ["One sentence."], ["Other sentence"] ]] The user should check the dimensions of the data structure passed to |
Hi all, in my case I was using and older version of datasets and, as @albertvillanova points out, passing the full list of sentences for the metric calculation. The problem was in the way jiwer implements WER, as it tries to compute WER for the full list at once instead of doing it element-wise. I think that with the latest implementation of datasets, or by using the alternative WER function that I've contributed on this pull request there shouldn't be memory errors. |
@lhoestq i was using Datasets==1.5.0 with 1.6.1 it worked (atleast the first run) but 1.5.0 is not compatible with my preprocessing. i cant save my dataset to a parquet file while using the latest datasets version ->
if i do
while using 1.6.1. and its working with 1.5.0 |
Hi ! You can pass dataset.data.table instead of dataset.data to pq.write_table |
This seems to be working so far! Thanks! |
Hi, I'm trying to follow the ASR example to try Wav2Vec. This is the code that I use for WER calculation:
However, I receive the following exception:
Traceback (most recent call last): File "/home/diego/IpGlobal/wav2vec/test_wav2vec.py", line 51, in <module> print(wer.compute(predictions=result["predicted"], references=result["target"])) File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/datasets/metric.py", line 403, in compute output = self._compute(predictions=predictions, references=references, **kwargs) File "/home/diego/.cache/huggingface/modules/datasets_modules/metrics/wer/73b2d32b723b7fb8f204d785c00980ae4d937f12a65466f8fdf78706e2951281/wer.py", line 94, in _compute return wer(references, predictions) File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/jiwer/measures.py", line 81, in wer truth, hypothesis, truth_transform, hypothesis_transform, **kwargs File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/jiwer/measures.py", line 192, in compute_measures H, S, D, I = _get_operation_counts(truth, hypothesis) File "/home/diego/miniconda3/envs/wav2vec3.6/lib/python3.6/site-packages/jiwer/measures.py", line 273, in _get_operation_counts editops = Levenshtein.editops(source_string, destination_string) MemoryError
My system has more than 10GB of available RAM. Looking at the code, I think that it could be related to the way jiwer does the calculation, as it is pasting all the sentences in a single string before calling Levenshtein editops function.
The text was updated successfully, but these errors were encountered: