Skip to content

Repo for the Social IQA translation project to Portuguese language for the IA024 class at Unicamp.

License

Notifications You must be signed in to change notification settings

fabiograssiotto/SocialIQA_pt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SocialIQA_pt

Repo for the Social IQA dataset translation project to Portuguese language.

Translation Workflow:

alt text

Workflow notebooks:
Step I - read_dataset_en.ipynb - Initial reading of the source language in JSONL format.
Step II - translator_* - Machine translation using hugging face models.
Steps III/IV - evaluator_gemba_*.ipynb - Translation evaluator using a modified GEMBA technique.
Step V - publish_dataset_pt.ipynb - Publishing the target dataset to JSONL format.

Utility notebooks:
splitter_training_set.ipynb - Splitter for the training set to handle OpenAI rate limits for the GEMBA evaluation.
merger_training_set.ipynb - Merger for the training set post-GEMBA evaluation.
metrics.ipynb - Plotter for translation metrics.

Data folders:
\data - Folder for source language data in csv format.
\dataset_en - Source language dataset folder (en).
\dataset_pt - Target language dataset folder (pt). \images - For the workflow image.
\rankings - Storage for GEMBA translation evaluations.
\translated - Storage for temporary translated strings.

The resulting PT dataset is available at https://huggingface.co/datasets/fabiogr/social_i_qa_pt

About

Repo for the Social IQA translation project to Portuguese language for the IA024 class at Unicamp.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published