evaluate vs test #222

IreneSucameli · 2021-12-06T09:53:12Z

Hi, it's not clear the difference between the evaluate.py and test.py scripts in the NLU folder. What do they evaluate? Since the result obtained is completely different even if they take as input the same test set.

zqwerty · 2021-12-06T10:46:37Z

evaluate.py uses the unified interface inherited from NLU such as class BERTNLU(NLU). Each NLU model should provide such a class so that we can compare different models given the same inputs. test.py is the test script for BERTNLU only, which may have different preprocessing. However, the difference should not be large.

IreneSucameli · 2021-12-06T11:11:08Z

Ok, so evaluate.py is used to compare the performance of different NLU while if I want to test only BERTNLU I should use test.py?
It is not clear to me why test.py calls the functions is_slot_da, calculate_F1 and recover_intent while evaluate doesn't do that. On what basis the overall performance is computed, in evaluate.py, if neither slots nor intents are recovered? Thks

zqwerty · 2021-12-07T01:44:10Z

Yes, evaluate.py is used to compare the performance of different NLU. it will be slower than test.py since it uses batch_size=1. If you want to test only BERTNLU (e.g., to tune some hyper-parameters) and do not compare with other NLU, you can use test.py for verification. The difference should not be large.

evaluate.py will call recover_intent in BERTNLU: https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/nlu/jointBERT/multiwoz/nlu.py#L106. And calculate_F1 will be called by both evaluate.py and test.py -> https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/nlu/jointBERT/multiwoz/postprocess.py#L13

is_slot_da decides whether a dialog act (intent, domain, slot, value) is non-categorical, which means the value is in the sentence and we use the slot-tagging method in BERTNLU to extract it (e.g., inform the name of a restaurant). if is_slot_da is False, we use [CLS] to do binary classification to judge if such a dialog act exists (e.g. request the name of a restaurant). We evaluate two kinds of dialog act and give slot F1 and intent F1 respectively. However, these metrics may not be applied to other NLU models such as a generative model, so it is not included in evaluate.py

In evaluate.py we directly evaluate the dialog act F1, comparing two lists of tuple (intent, domain, slot, value).

IreneSucameli · 2022-04-15T13:57:54Z

Hi, thanks for your reply.
Could you please specify if in test.py the Recall, Precision and F1 score were micro-averaged?

zqwerty · 2022-04-18T00:55:33Z

Yes, they are. TP, FP, FN are accumulated through the test set

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluate vs test #222

evaluate vs test #222

IreneSucameli commented Dec 6, 2021

zqwerty commented Dec 6, 2021

IreneSucameli commented Dec 6, 2021

zqwerty commented Dec 7, 2021 •

edited

Loading

IreneSucameli commented Apr 15, 2022 •

edited

Loading

zqwerty commented Apr 18, 2022

evaluate vs test #222

evaluate vs test #222

Comments

IreneSucameli commented Dec 6, 2021

zqwerty commented Dec 6, 2021

IreneSucameli commented Dec 6, 2021

zqwerty commented Dec 7, 2021 • edited Loading

IreneSucameli commented Apr 15, 2022 • edited Loading

zqwerty commented Apr 18, 2022

zqwerty commented Dec 7, 2021 •

edited

Loading

IreneSucameli commented Apr 15, 2022 •

edited

Loading