Prepare non-English SQuAD #1549
-
I saw that you released GermanQuAD. (https://huggingface.co/datasets/deepset/germanquad)
If I skip 2), would 1) already be enough? I've tried the multilingual model |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 5 replies
-
Hi @yingzwang! We released a paper where we summarize our learnings from creating the GermanQuAD dataset. We saw that using human annotations instead of machine-translated labels results in a better performance and also did some experiments on how the number of training samples affects the performance (see Figure 3). Let me know if you have further questions :) |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing the paper @bogdankostic ! At a quick glance I see a lot of useful stuff that's exactly what I'm looking for. The performance boost by human annotated GermanQuAD is significant. I'm encouraged to invest more effort preparing more human annotated DutchQuAD :) Also need to explore a bit your annotation tool. |
Beta Was this translation helpful? Give feedback.
-
Also in the paper you mentioned about "detailed labeling instructions". Could you share your complete labeling instructions? Or is it already everything in Appendix A? |
Beta Was this translation helpful? Give feedback.
-
Hi @yingzwang Appendix A contains all labeling instructions, yes. As we had a team of annotators, we did regular meetings during the annotation process where we discussed newly annotated samples. For example, we discussed the phrasing of the question and whether we can rephrase the question so that there is less lexical overlap with the answer string. Another example is that we discussed whether the question is self-sufficient or whether there should be some additional information in the question for open-domain question answering. English-to-Dutch machine translation of SQuAD should already give you quite good results. At first glance, the translations look very good, I agree. However, sometimes there are small mistakes or the words used in the translation do not really fit the context. Sometimes machine-translation could increase the lexical overlap of the question and the answer and thus simplify the task for the model. That might be one reason why training on machine-translated data is not as good as training on carefully hand-annotated data. I think 500 hand-annotated samples could already be enough to see an improvement, yes. I would recommend that you try to make these 500 samples not too simple for the model. Please keep us updated about the progress of your project and don't hesitate to contact us again if there are any questions coming up. Feel free to close the issue for now if there are no further questions at the moment and re-open it later. Good luck with training a monolingual Dutch QA model! 👍 |
Beta Was this translation helpful? Give feedback.
-
Thanks for the elaborated answer @julian-risch ! You have really adopted a robust quality control workflow. How large is your annotation team? Not sure if I can afford a similar setup but I'll try. I found the manual of your annotation tool here. There's one rule saying "Don’t create question that elaborate on context", in page 13. Why is that? A user is very likely to ask a question in that way, especially in a conversation. Why should we avoid it in training data? Your labeling instructions are insightful. From ~50 annotated samples I've collected so far, some already violated the "no lexical overlap rule" and "prefer short answer over long answer rule". I read your paper just in time :) |
Beta Was this translation helpful? Give feedback.
-
Great to hear that you find our documentation helpful! @yingzwang |
Beta Was this translation helpful? Give feedback.
-
There's also this recent article we did on annotation https://www.deepset.ai/blog/labeling-data-with-haystack-annotation-tool (and the latest one is about evaluation). Just in case it might be helpful in addition to the documentation pages. |
Beta Was this translation helpful? Give feedback.
-
I just converted this issue to a discussion. @yingzwang you can mark the answer in this discussion rather than having to close the issue. That should also improve searchability for other community members. |
Beta Was this translation helpful? Give feedback.
-
Another question, about preparing machine translated SQuAD. In your paper you mentioned that you used Facebook's MLQA dataset to warm start the training. Unfortunately MLQA does not contain Dutch. It seems that I have to prepare the machine translated Dutch SQuAD myself. Do you have any suggestions or tips? I found this repo implementing an translate-align-retrieve method. It seems useful? |
Beta Was this translation helpful? Give feedback.
Hi @yingzwang Appendix A contains all labeling instructions, yes. As we had a team of annotators, we did regular meetings during the annotation process where we discussed newly annotated samples. For example, we discussed the phrasing of the question and whether we can rephrase the question so that there is less lexical overlap with the answer string. Another example is that we discussed whether the question is self-sufficient or whether there should be some additional information in the question for open-domain question answering.
English-to-Dutch machine translation of SQuAD should already give you quite good results. At first glance, the translations look very good, I agree. However, s…