Prepare non-English SQuAD #1549

yingzwang · 2021-09-28T18:28:24Z

yingzwang
Sep 28, 2021

I saw that you released GermanQuAD. (https://huggingface.co/datasets/deepset/germanquad)
Could you share some details on how you prepared the dataset?
I want to train a monolingual Dutch QA model. I plan to use an existing pre-trained Dutch BERT base model, and prepare some Dutch SQuAD samples myself. I have following ideas,

machine translation of English SQuAD.
human annotation of 500 Dutch question-passage-answer (Q-P-A) triplets in my domain.

If I skip 2), would 1) already be enough?
If I add 2), would 500 Q-P-A samples be enough?
By enough I mean at least better than a multilingual model.

I've tried the multilingual model deepset/xlm-roberta-large-squad2 on my Dutch samples. It works a bit however the performance is not satisfactory, which is expected. I feel that training a monolingual Dutch QA model will be much better. Your advice and tips are much appreciated!

Answered by julian-risch

Sep 30, 2021

Hi @yingzwang Appendix A contains all labeling instructions, yes. As we had a team of annotators, we did regular meetings during the annotation process where we discussed newly annotated samples. For example, we discussed the phrasing of the question and whether we can rephrase the question so that there is less lexical overlap with the answer string. Another example is that we discussed whether the question is self-sufficient or whether there should be some additional information in the question for open-domain question answering.

English-to-Dutch machine translation of SQuAD should already give you quite good results. At first glance, the translations look very good, I agree. However, s…

View full answer

bogdankostic · 2021-09-29T09:18:36Z

bogdankostic
Sep 29, 2021

Hi @yingzwang! We released a paper where we summarize our learnings from creating the GermanQuAD dataset. We saw that using human annotations instead of machine-translated labels results in a better performance and also did some experiments on how the number of training samples affects the performance (see Figure 3).
Also, we released a blogpost about QA on different languages than English.

Let me know if you have further questions :)

0 replies

yingzwang · 2021-09-29T17:33:41Z

yingzwang
Sep 29, 2021
Author

Thanks for sharing the paper @bogdankostic ! At a quick glance I see a lot of useful stuff that's exactly what I'm looking for.
The result of Table 4 is a bit surprising to me. The monolingual model trained on translated SQuAD is even worse than the XLM model trained on English SQuAD! Is the machine translated SQuAD that bad? I manually checked a few machine translated (English-to-Dutch) SQuAD samples, and they look pretty OK, i.e., the translated text is quite accurate and natural. What could seriously go wrong with the machine translated SQuAD, so that the monolingual model cannot learn well?

The performance boost by human annotated GermanQuAD is significant. I'm encouraged to invest more effort preparing more human annotated DutchQuAD :) Also need to explore a bit your annotation tool.

0 replies

yingzwang · 2021-09-29T17:52:06Z

yingzwang
Sep 29, 2021
Author

Also in the paper you mentioned about "detailed labeling instructions". Could you share your complete labeling instructions? Or is it already everything in Appendix A?

0 replies

julian-risch · 2021-09-30T10:07:10Z

julian-risch
Sep 30, 2021
Maintainer

Hi @yingzwang Appendix A contains all labeling instructions, yes. As we had a team of annotators, we did regular meetings during the annotation process where we discussed newly annotated samples. For example, we discussed the phrasing of the question and whether we can rephrase the question so that there is less lexical overlap with the answer string. Another example is that we discussed whether the question is self-sufficient or whether there should be some additional information in the question for open-domain question answering.

English-to-Dutch machine translation of SQuAD should already give you quite good results. At first glance, the translations look very good, I agree. However, sometimes there are small mistakes or the words used in the translation do not really fit the context. Sometimes machine-translation could increase the lexical overlap of the question and the answer and thus simplify the task for the model. That might be one reason why training on machine-translated data is not as good as training on carefully hand-annotated data.

I think 500 hand-annotated samples could already be enough to see an improvement, yes. I would recommend that you try to make these 500 samples not too simple for the model.

Please keep us updated about the progress of your project and don't hesitate to contact us again if there are any questions coming up. Feel free to close the issue for now if there are no further questions at the moment and re-open it later. Good luck with training a monolingual Dutch QA model! 👍

0 replies

yingzwang · 2021-09-30T20:51:45Z

yingzwang
Sep 30, 2021
Author

Thanks for the elaborated answer @julian-risch ! You have really adopted a robust quality control workflow. How large is your annotation team? Not sure if I can afford a similar setup but I'll try.

I found the manual of your annotation tool here. There's one rule saying "Don’t create question that elaborate on context", in page 13. Why is that? A user is very likely to ask a question in that way, especially in a conversation. Why should we avoid it in training data?

Your labeling instructions are insightful. From ~50 annotated samples I've collected so far, some already violated the "no lexical overlap rule" and "prefer short answer over long answer rule". I read your paper just in time :)

0 replies

julian-risch · 2021-10-01T09:38:06Z

julian-risch
Oct 1, 2021
Maintainer

Great to hear that you find our documentation helpful! @yingzwang
Questions shouldn't have too many words in common with the context around the answer because that simplifies the question answering task and simple questions are not very helpful for the training. You are right, that users might ask such questions in an application. That's no problem for the model as it is a simple task to then find the corresponding document and answer. There is no need for training data that represents this relatively simple task.
We had a team of five annotators when we created the GermanQuAD dataset. Having a team of annotators does not only speed up the annotation process. It also helps at formulating more diverse questions and answers. For that reason I would recommend to have at least two annotators.

3 replies

yingzwang Oct 1, 2021
Author

I was referring to this example in the manual (pp. 13) of your annotation tool.

Don’t create question that elaborate on context
○ Example

Don’t: “Considering the acquisition of ABC Inc., how did it impact the balance sheet?”

Do: “How did the acquisition of ABC Inc. impact the balance sheet?”
or “Did the acquisition of ABC Inc. impact the balance sheet?”

I don't understand why the question in Don't is worse for training than the questions in Do. They used same words, only ordering is a bit different. @julian-risch

julian-risch Oct 2, 2021
Maintainer

Thanks for clarifying. I think this example could also be termed "refrain from using subordinate clauses", if I am not mistaken. Maybe @Timoeller can confirm this, please? I would say: Try to phrase the question as concise as possible, without a long description of a situation or some other additional constraints. The reason is that these additions might introduce ambiguity in the question, which makes it unnecessarily hard to find the correct answer. For example, “Considering the acquisition of ABC Inc., how did it impact the balance sheet?” requires coreference resolution: the model needs to infer the meaning of "it".

yingzwang Oct 5, 2021
Author

Okay that makes sense. Subordinate clauses make the question a bit ambiguous which is bad for training. Thanks!

aantti · 2021-10-01T09:45:25Z

aantti
Oct 1, 2021
Collaborator

There's also this recent article we did on annotation https://www.deepset.ai/blog/labeling-data-with-haystack-annotation-tool (and the latest one is about evaluation). Just in case it might be helpful in addition to the documentation pages.

0 replies

julian-risch · 2021-10-01T10:14:19Z

julian-risch
Oct 1, 2021
Maintainer

I just converted this issue to a discussion. @yingzwang you can mark the answer in this discussion rather than having to close the issue. That should also improve searchability for other community members.

0 replies

yingzwang · 2021-10-05T18:14:28Z

yingzwang
Oct 5, 2021
Author

Another question, about preparing machine translated SQuAD.

In your paper you mentioned that you used Facebook's MLQA dataset to warm start the training. Unfortunately MLQA does not contain Dutch. It seems that I have to prepare the machine translated Dutch SQuAD myself. Do you have any suggestions or tips?

I found this repo implementing an translate-align-retrieve method. It seems useful?

2 replies

Timoeller Oct 6, 2021

We did it manually, though at a quick glance the linked repo seems useful.

I remember the biggest problem in the converting English SQuAD to German was finding the correct answer span in the translation, not so much the translation itself. You should check weather the conversion tricks work for your translation system and language. The most common trick is adding quotation marks around the answer - with google translate the conversion accuracy differed immense across quotation marks, e.g. https://unicode-table.com/en/sets/quotation-marks/

yingzwang Oct 6, 2021
Author

Thanks for the advice!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare non-English SQuAD #1549

{{title}}

Replies: 9 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Prepare non-English SQuAD #1549

yingzwang Sep 28, 2021

Replies: 9 comments · 5 replies

bogdankostic Sep 29, 2021

yingzwang Sep 29, 2021 Author

yingzwang Sep 29, 2021 Author

julian-risch Sep 30, 2021 Maintainer

yingzwang Sep 30, 2021 Author

julian-risch Oct 1, 2021 Maintainer

yingzwang Oct 1, 2021 Author

julian-risch Oct 2, 2021 Maintainer

yingzwang Oct 5, 2021 Author

aantti Oct 1, 2021 Collaborator

julian-risch Oct 1, 2021 Maintainer

yingzwang Oct 5, 2021 Author

Timoeller Oct 6, 2021

yingzwang Oct 6, 2021 Author

yingzwang
Sep 28, 2021

Replies: 9 comments 5 replies

bogdankostic
Sep 29, 2021

yingzwang
Sep 29, 2021
Author

yingzwang
Sep 29, 2021
Author

julian-risch
Sep 30, 2021
Maintainer

yingzwang
Sep 30, 2021
Author

julian-risch
Oct 1, 2021
Maintainer

yingzwang Oct 1, 2021
Author

julian-risch Oct 2, 2021
Maintainer

yingzwang Oct 5, 2021
Author

aantti
Oct 1, 2021
Collaborator

julian-risch
Oct 1, 2021
Maintainer

yingzwang
Oct 5, 2021
Author

yingzwang Oct 6, 2021
Author