-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8 replicate demetr results for bleu #10
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments, mostly around some ideas for refactoring that might make things a bit easier to extend to different datasets/metrics, though I realised later that it's not as straightforward to have a common metrics interface as I thought so I'm open to different ideas/not spending time going down that route.
The actions are failing with linting and mypy errors, setting up pre-commit or running ruff locally should fix the linting ones (mostly it just doesn't like a blank newline at the start of functions), and adding ignore_missing_imports = true
to the mypy section in pyproject.toml should get mypy to quieten down (e.g. see here).
src/m4st/metrics.py
Outdated
|
||
def __init__(self) -> None: | ||
self.blaser_ref = load_blaser_model("blaser_2_0_ref").eval() | ||
self.blaser_qe = load_blaser_model("blaser_2_0_qe").eval() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For BLASER and COMET would it make more sense for the ref and qe versions to be different classes?
We could then also have a common Metrics
parent class that's always expected to have a function get_score(self, reference, prediction, source)
/similar, with some being optional/ignored depending on the metric, which might make some things a bit easier to extend later (though it doesn't look like we'll be using many more metrics).
src/m4st/process_demetr.py
Outdated
if "BLASER_ref" in metrics_to_use or "BLASER_qe" in metrics_to_use: | ||
self.blaser = BLASERScore() | ||
if "COMET_ref" in metrics_to_use or "COMET_qe" in metrics_to_use: | ||
self.comet = COMETScore() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also re having a common Metrics
class - ProcessDEMETR
could then just have a list of metrics classes, and the loop in process_demetr_category
could be more like:
for metric in self.metrics:
metric.get_score(ref_txt, mt_txt, src_txt)
without the need for the big if
block.
I hadn't considered BLASER also needing a lang code though, so get_score
would need to support kwargs somehow too and it's messier than I thought (incidentally, it's interesting that BLASER requires this but COMET doesn't).
import os | ||
|
||
|
||
def load_json(json_path: os.PathLike | str) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return type could also be dict
|
||
if ds_cat in cats_to_process or not cats_to_process: | ||
print(f"Processing input file {ds}") | ||
reverse_acc = ds_cat == 35 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment would be helpful here to explain why category 35 needs to be reversed.
@@ -10,7 +10,7 @@ authors = [ | |||
] | |||
description = "Evaluation of Metrics for Speech Translation (M4ST)" | |||
readme = "README.md" | |||
requires-python = ">=3.10" | |||
requires-python = "==3.11.*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is one of the dependencies forcing this or can it be more flexible?
src/m4st/metrics.py
Outdated
|
||
src_embs = self.text_embedder.predict([source], source_lang=source_lang_code) | ||
ref_embs = self.text_embedder.predict([reference], source_lang="eng_Latn") | ||
mt_embs = self.text_embedder.predict([prediction], source_lang="eng_Latn") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the prediction & reference are English is probably fine for what we're doing, but I'd be slightly paranoid about forgetting to change the language here if we do try anything in a different language. So we could consider making reference and mt language arguments/class attributes that have to be set when creating an instance or calling the score, so it's more explicit that we're assuming English when we use it? Or otherwise just a comment/note somewhere that makes it clear we're assuming English.
On hold while I also implement #13 |
Closes #8
Runs BLEU, SacreBLEU, BLASER 2.0, and COMET over the DEMETR dataset.