Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluate vs test #222

Open
IreneSucameli opened this issue Dec 6, 2021 · 5 comments
Open

evaluate vs test #222

IreneSucameli opened this issue Dec 6, 2021 · 5 comments

Comments

@IreneSucameli
Copy link

Hi, it's not clear the difference between the evaluate.py and test.py scripts in the NLU folder. What do they evaluate? Since the result obtained is completely different even if they take as input the same test set.

@zqwerty
Copy link
Member

zqwerty commented Dec 6, 2021

evaluate.py uses the unified interface inherited from NLU such as class BERTNLU(NLU). Each NLU model should provide such a class so that we can compare different models given the same inputs. test.py is the test script for BERTNLU only, which may have different preprocessing. However, the difference should not be large.

@IreneSucameli
Copy link
Author

Ok, so evaluate.py is used to compare the performance of different NLU while if I want to test only BERTNLU I should use test.py?
It is not clear to me why test.py calls the functions is_slot_da, calculate_F1 and recover_intent while evaluate doesn't do that. On what basis the overall performance is computed, in evaluate.py, if neither slots nor intents are recovered? Thks

@zqwerty
Copy link
Member

zqwerty commented Dec 7, 2021

Yes, evaluate.py is used to compare the performance of different NLU. it will be slower than test.py since it uses batch_size=1. If you want to test only BERTNLU (e.g., to tune some hyper-parameters) and do not compare with other NLU, you can use test.py for verification. The difference should not be large.

evaluate.py will call recover_intent in BERTNLU: https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/nlu/jointBERT/multiwoz/nlu.py#L106. And calculate_F1 will be called by both evaluate.py and test.py -> https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/nlu/jointBERT/multiwoz/postprocess.py#L13

is_slot_da decides whether a dialog act (intent, domain, slot, value) is non-categorical, which means the value is in the sentence and we use the slot-tagging method in BERTNLU to extract it (e.g., inform the name of a restaurant). if is_slot_da is False, we use [CLS] to do binary classification to judge if such a dialog act exists (e.g. request the name of a restaurant). We evaluate two kinds of dialog act and give slot F1 and intent F1 respectively. However, these metrics may not be applied to other NLU models such as a generative model, so it is not included in evaluate.py

In evaluate.py we directly evaluate the dialog act F1, comparing two lists of tuple (intent, domain, slot, value).

@IreneSucameli
Copy link
Author

IreneSucameli commented Apr 15, 2022

Hi, thanks for your reply.
Could you please specify if in test.py the Recall, Precision and F1 score were micro-averaged?

@zqwerty
Copy link
Member

zqwerty commented Apr 18, 2022

Yes, they are. TP, FP, FN are accumulated through the test set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants