Skip to content

0.3.1 Quality improvements

Compare
Choose a tag to compare
@eu9ene eu9ene released this 06 Dec 23:09
· 349 commits to main since this release
3b3f33b
  • Update teacher hyperparameters
  • Use chrf metric for the best model (https://discourse.translatelocally.com/t/marian-configuration-to-use/24)
  • Update SacreBLEU and add chrF metric to evaluation.
  • Add evaluation of each model and the whole ensemble of teachers.
  • Early stop based on ce-mean-words instead of bleu-detok
  • Continue teacher training on parallel data only (train on augmented data for N epochs first)
  • Do cleaning per dataset
  • Add per-dataset fixes from https://github.com/ZJaume/clean/tree/master/fixes.
  • Use bicleaner per dataset with customizable thresholds.
  • Remove punctuation normalization
  • Add alphabets for more languages in the cleaning scripts
  • Replace absolute paths with relative ones
  • Add Snakemake cross-workflow caching. Caching works, but apparently, there is a bug in Snakemake, it doesn't recognize symlinks after caching. Disabled for now.