layout | title | parent |
---|---|---|
default |
Bicleaner |
Data cleaning |
Bicleaner AI is a tool that aims at detecting noisy sentence pairs in a parallel corpus. The classifier scores parallel sentences from 0 to 1 where 0 means a very noisy translation and 1 is a good translation. If a specialized model for a language pair is not available it will fallback to downloading a multilingual en-xx model.
For supported languages see:
The configuration specifies a default threshold and a per-dataset threshold. A sentence pair will be kept if its score is above the given threshold.
0.5
should be a good default value.- Increase the threshold for noisier datasets.
- Set the threshold to
0
to skip cleaning entirely.
Data set | Threshold | Reason |
---|---|---|
OpenSubtitles | 0.8 | This is a noiser dataset |
ParaCrawl | 0 | This dataset has already been cleaned by bicleaner. See Bicleaner AI: Bicleaner Goes Neural, section 4.2.2 |
bicleaner:
default-threshold: 0.5
dataset-thresholds:
opus_CCAligned/v1: 0.7
opus_OpenSubtitles/v2018: 0.8
opus_ParaCrawl/v9: 0
...
In the current implementation an appropriate model is downloaded for a language pair on demand and then cached.
For example for en-ru
a multilingual en-xx
will be downloaded since a dedicated model is not available for this pair.
For pt-en
en-pt
will be downloaded since all the models have English as the first language.
The downloaded model will be cached in Taskcluster under the requested language pair (en-ru
, pt-en
).
If a new model is added to Hugging Face repo it would be a good idea to invalidate
the caches manually by editing pipeline/bicleaner/download_pack.py
.
We do not do this automatically in the current implementation. We will rethink this strategy if this happens often.