You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 9, 2024. It is now read-only.
It will give great flexibility of metrics, but I think it doesn't have to implement in RAGchain.
There are mainly two reason.
Each metrics might have conventional setting.
We target RAG workflow and its researches, there must be conventional hyperparameter or setups to each metrics. Because all benchmark should perform in same setting, ideally. So, it is better that we suggest conventional setup for each metric.
User can calculate score with their metrics after get their own result.
We give whole pd.DataFrame that contain question, answer, gt answer, etc. User can easily calculate score with their own metric with this Dataframe. So, if someone wants to score their result with new metric, they can do that easily. (Maybe we can make guide for that later. I did once with Rare F1 metric.)
Plus, I think it will be too complicated to use our evaluator. Sometimes, framework should restrict flexibility for easy to use.
I think that adding EM(Exactly match) metric is one of defined step.
(conclusion of surfing on many benchmark)
Actually, I can't perceive setup which I can suggest conventinally.
BLEU, and ROUGE score is wrapped official(maybe...? basically used) library...
they have many variation according to n or perspective.
this problem clearly need to be solved by our evaluator
In my think,
evaluator should open folloing functions
add costum normalizer and tokenizer
add metric function on existing metric list
how about metric_expaneded-version..?
like metric.py, metric_expanded-version is collection of various metrics that are not officially(?) accepted
name is tentative
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We have few metric based on category of metric which can be controllable with few parameter (if based on n-gram, can choose n)
more flexible!
in my opinion,
have more idea?
The text was updated successfully, but these errors were encountered: