Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8 replicate demetr results for bleu #10

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

klh5
Copy link
Collaborator

@klh5 klh5 commented Dec 6, 2024

Closes #8

Runs BLEU, SacreBLEU, BLASER 2.0, and COMET over the DEMETR dataset.

@klh5 klh5 requested a review from jack89roberts December 6, 2024 15:30
@klh5 klh5 linked an issue Dec 6, 2024 that may be closed by this pull request
3 tasks
Copy link
Collaborator

@jack89roberts jack89roberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments, mostly around some ideas for refactoring that might make things a bit easier to extend to different datasets/metrics, though I realised later that it's not as straightforward to have a common metrics interface as I thought so I'm open to different ideas/not spending time going down that route.

The actions are failing with linting and mypy errors, setting up pre-commit or running ruff locally should fix the linting ones (mostly it just doesn't like a blank newline at the start of functions), and adding ignore_missing_imports = true to the mypy section in pyproject.toml should get mypy to quieten down (e.g. see here).


def __init__(self) -> None:
self.blaser_ref = load_blaser_model("blaser_2_0_ref").eval()
self.blaser_qe = load_blaser_model("blaser_2_0_qe").eval()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For BLASER and COMET would it make more sense for the ref and qe versions to be different classes?

We could then also have a common Metrics parent class that's always expected to have a function get_score(self, reference, prediction, source)/similar, with some being optional/ignored depending on the metric, which might make some things a bit easier to extend later (though it doesn't look like we'll be using many more metrics).

if "BLASER_ref" in metrics_to_use or "BLASER_qe" in metrics_to_use:
self.blaser = BLASERScore()
if "COMET_ref" in metrics_to_use or "COMET_qe" in metrics_to_use:
self.comet = COMETScore()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also re having a common Metrics class - ProcessDEMETR could then just have a list of metrics classes, and the loop in process_demetr_category could be more like:

for metric in self.metrics:
     metric.get_score(ref_txt, mt_txt, src_txt) 

without the need for the big if block.

I hadn't considered BLASER also needing a lang code though, so get_score would need to support kwargs somehow too and it's messier than I thought (incidentally, it's interesting that BLASER requires this but COMET doesn't).

import os


def load_json(json_path: os.PathLike | str) -> list:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return type could also be dict


if ds_cat in cats_to_process or not cats_to_process:
print(f"Processing input file {ds}")
reverse_acc = ds_cat == 35
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment would be helpful here to explain why category 35 needs to be reversed.

@@ -10,7 +10,7 @@ authors = [
]
description = "Evaluation of Metrics for Speech Translation (M4ST)"
readme = "README.md"
requires-python = ">=3.10"
requires-python = "==3.11.*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is one of the dependencies forcing this or can it be more flexible?


src_embs = self.text_embedder.predict([source], source_lang=source_lang_code)
ref_embs = self.text_embedder.predict([reference], source_lang="eng_Latn")
mt_embs = self.text_embedder.predict([prediction], source_lang="eng_Latn")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the prediction & reference are English is probably fine for what we're doing, but I'd be slightly paranoid about forgetting to change the language here if we do try anything in a different language. So we could consider making reference and mt language arguments/class attributes that have to be set when creating an instance or calling the score, so it's more explicit that we're assuming English when we use it? Or otherwise just a comment/note somewhere that makes it clear we're assuming English.

@klh5
Copy link
Collaborator Author

klh5 commented Dec 13, 2024

On hold while I also implement #13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replicate DEMETR results for BLEU
2 participants