Skip to content

Latest commit

 

History

History
58 lines (43 loc) · 2.45 KB

bicleaner.md

File metadata and controls

58 lines (43 loc) · 2.45 KB
layout title parent
default
Bicleaner
Data cleaning

Bicleaner

Bicleaner AI is a tool that aims at detecting noisy sentence pairs in a parallel corpus. The classifier scores parallel sentences from 0 to 1 where 0 means a very noisy translation and 1 is a good translation. If a specialized model for a language pair is not available it will fallback to downloading a multilingual en-xx model.

For supported languages see:

How to configure for training

The configuration specifies a default threshold and a per-dataset threshold. A sentence pair will be kept if its score is above the given threshold.

  • 0.5 should be a good default value.
  • Increase the threshold for noisier datasets.
  • Set the threshold to 0 to skip cleaning entirely.

Recommendations for specific datasets

Data set Threshold Reason
OpenSubtitles 0.8 This is a noiser dataset
ParaCrawl 0 This dataset has already been cleaned by bicleaner. See Bicleaner AI: Bicleaner Goes Neural, section 4.2.2

Example config:

  bicleaner:
    default-threshold: 0.5
    dataset-thresholds:
      opus_CCAligned/v1: 0.7
      opus_OpenSubtitles/v2018: 0.8
      opus_ParaCrawl/v9: 0
      ...

Models and caching

In the current implementation an appropriate model is downloaded for a language pair on demand and then cached. For example for en-ru a multilingual en-xx will be downloaded since a dedicated model is not available for this pair. For pt-en en-pt will be downloaded since all the models have English as the first language.

The downloaded model will be cached in Taskcluster under the requested language pair (en-ru, pt-en). If a new model is added to Hugging Face repo it would be a good idea to invalidate the caches manually by editing pipeline/bicleaner/download_pack.py. We do not do this automatically in the current implementation. We will rethink this strategy if this happens often.