Tasks that are difficult/painful to follow/understand/solve #467

danyaljj · 2021-10-21T01:46:04Z

danyaljj
Oct 21, 2021

@Palipoor @yeganehkordi While you're addressing human eval feedback #276 let's use this thread to keep track of the tasks that are difficult for humans to understand (**) and we don't have a good way of improving them. We can collectively discuss ways to improve them or drop them. For completeness, here is the result of human evaluation with crowdworkers' feedback.

(**) not to be confused with tasks that are easy to understand but humans score low in terms of automatic eval.

Palipoor · 2021-10-21T02:12:49Z

Palipoor
Oct 21, 2021

Ok!
I didn't know Yeganehis also working on addressing the feedbacks.
@yeganehkordi Do we need to plan on which task numbers you work on and which I do, so there won't be redundancies?

0 replies

yeganehkordi · 2021-10-21T05:38:13Z

yeganehkordi
Oct 21, 2021

Sure!
@Palipoor Yeah, I'm working on tasks 200-230 now. To avoid any conflicts, we can work on groups of 100 tasks.

2 replies

Palipoor Oct 21, 2021

oh ok! I've already done the 201 - 205. I will jump to 300+ for now.

yeganehkordi Oct 21, 2021

Thanks!

Palipoor · 2021-10-21T17:39:06Z

Palipoor
Oct 21, 2021

Two observations:

People become overwhelmed when task inputs are long.
The tasks with "incorrect answer generation" tend to confuse them.

7 replies

Palipoor Oct 21, 2021

The task task324_jigsaw_classification_disagree doesn't make sense to me(and the annotators!)

Palipoor Oct 21, 2021

[!!Warning for swear words below!!]

In the task task326_jigsaw_classification_obscene it seems like the gold labels are "obscene" for sentences containing words like "damn" or "shit", although according to the dictionary definition, and the definition of the task, obscene is something offensive that is somehow related to sex. Should we change the definition and title?

Palipoor Oct 21, 2021

task 333 has lots of noisy instances!

Palipoor Oct 21, 2021

task 335 doesn't make sense to me( and the annotators)

Palipoor Oct 21, 2021

task 345 (part of speech tagging) was difficult for the annotators. (Why do they say "we need to memorize tags?! Aren't the instructions in the same page?)

danyaljj · 2021-10-21T18:45:24Z

danyaljj
Oct 21, 2021
Author

Few pieces of intuition from #464:

For Winogrande tasks, I think the definitions and the examples were clear. Maybe the tasks themselves are difficult.

Tasks 036 and 046 didn't make sense to me, so I couldn't fix them. Either the definition is not good or the instances are too noisy.

In task 201 all the outputs are "2".

0 replies

yeganehkordi · 2021-10-22T07:57:19Z

yeganehkordi
Oct 22, 2021

In task 204, based on the paper, genres refer to the source of collecting the premises. I don't think if the classification of the sentences based on the genres is practical. For example, two-sided, telephone conversations that took place in 1990 or 1991 (TELEPHONE) and Collection of two-sided, in-person conversations that took place in the early 2000s (FACE-TO-FACE) are hard to distinguish.

1 reply

Palipoor Oct 22, 2021

I agree!

yeganehkordi · 2021-10-22T18:38:39Z

yeganehkordi
Oct 22, 2021

Structured Text Generation tasks based on the logic2text (e.g., task211_logic2text_classification.json) seem to be difficult for humans who don't have a programming or math background. However, the definitions are good.

2 replies

danyaljj Nov 7, 2021
Author

If the definitions are clear (to some human expert) and the instance labels are well-defined, we're good. It's okay if average humans struggle, but there have to be >0 humans that can solve them.

yeganehkordi Nov 8, 2021

I think it's okay! Most of the programmers can solve them. Maybe we can use this task in the train split.

yeganehkordi · 2021-10-22T23:15:32Z

yeganehkordi
Oct 22, 2021

In the task task226_english_language_answer_relevance_classification, the first half of the instances have the "yes" outputs, and the second half has "no". Maybe we need to shuffle instances of the tasks later.

0 replies

Palipoor · 2021-10-24T17:50:33Z

Palipoor
Oct 24, 2021

task383_matres_classification was difficult for annotators to understand. I didn't understand it either!

1 reply

Palipoor Oct 24, 2021

For task "task386_semeval_2018_task3_irony_detection" I had a difficult time understanding what it meant by "irony". The example was "sarcastic" in my opinion. It may be because I am not a native speaker. The annotators seemed to be confused too.

Palipoor · 2021-10-30T22:53:37Z

Palipoor
Oct 30, 2021

@yeganehkordi Task 456 was difficult. I think I didn't understand it either.

2 replies

Palipoor Oct 30, 2021

Task 471 has very difficult instances, yet the examples are easy.

yeganehkordi Nov 3, 2021

I'll try to make task 456 more clear.

swarooprm · 2021-11-04T07:09:44Z

swarooprm
Nov 4, 2021

@pulkitverma25 wants to help in addressing crowdworker evaluation feedback. What task numbers he can work on, so that it does not conflict with the ones you are working on @Palipoor @yeganehkordi ?

3 replies

swarooprm Nov 4, 2021

@aarunku5 also wants to help here, let's decide task numbers for her as well.

Palipoor Nov 4, 2021

Hi!
So now Yeganeh has tasks 500 - 600, and I have tasks 600 - 700. So you can split the rest (700+) between the two of you! @aarunku5 @pulkitverma25
We can also keep a sheet if there're more people involved.

pulkitverma25 Nov 4, 2021

I will start working on tasks 701-800.

aarunku5 · 2021-11-06T04:30:58Z

aarunku5
Nov 6, 2021

I can do the next 100, i.e., 801-900

…

On Thu, Nov 4, 2021 at 4:25 PM Pulkit Verma ***@***.***> wrote: I will start working on tasks 701-800. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/allenai/natural-instructions-expansion/discussions/467*discussioncomment-1591865__;Iw!!IKRxdwAv5BmarQ!IXBk6Qgjk1bINegP51LAGkzBIulKlTzEmWCLDGxUuCNpDIsuJfyqsAJLeJ7N-IQ$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AM6E2NCPCDYJS5JCFSVOI63UKMI7FANCNFSM5GM6TRIQ__;!!IKRxdwAv5BmarQ!IXBk6Qgjk1bINegP51LAGkzBIulKlTzEmWCLDGxUuCNpDIsuJfyqsAJLegwJ8cQ$> . Triage notifications on the go with GitHub Mobile for iOS <https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!IKRxdwAv5BmarQ!IXBk6Qgjk1bINegP51LAGkzBIulKlTzEmWCLDGxUuCNpDIsuJfyqsAJLYNMoM0c$> or Android <https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!IKRxdwAv5BmarQ!IXBk6Qgjk1bINegP51LAGkzBIulKlTzEmWCLDGxUuCNpDIsuJfyqsAJLVatz7BM$>.

0 replies

Palipoor · 2021-11-06T15:12:12Z

Palipoor
Nov 6, 2021

task667_mmmlu_answer_generation_business_ethics seems to be difficult for people.

1 reply

Palipoor Nov 6, 2021

The mmmlu tasks are difficult for people. They require field knowledge.

yeganehkordi · 2021-11-08T06:26:29Z

yeganehkordi
Nov 8, 2021

Tasks 268, 274, and 276 were difficult for the annotators. I think the definitions are sufficient.

0 replies

yeganehkordi · 2021-11-09T21:45:49Z

yeganehkordi
Nov 9, 2021

@Palipoor In task522, here is the question of one of the crowdworkers:
Can I use parts of sentences to create a whole new sentence? Can I just use one sentence?
Based on the instances, it seems that they can't combine different sentences. Am I right?

2 replies

Palipoor Nov 9, 2021

No, they can't. Unless they're consecutive themselves.

yeganehkordi Nov 10, 2021

Thanks!

Palipoor · 2021-11-11T00:41:10Z

Palipoor
Nov 11, 2021

@aarunku5 @yeganehkordi @pulkitverma25 I will pick up tasks 900 - 1100.

4 replies

Palipoor Nov 11, 2021

@aarunku5 @yeganehkordi @pulkitverma25 I will pick up tasks 1100 - 1200.

yeganehkordi Nov 12, 2021

@aarunku5 @Palipoor @pulkitverma25 I will pick up tasks 1200-1300.

Palipoor Nov 14, 2021

@yeganehkordi @pulkitverma25 @aarunku5 I will pick up tasks 1300 - 1400.

yeganehkordi Nov 15, 2021

@aarunku5 @Palipoor @pulkitverma25 I will pick up tasks 1400-1500.

yeganehkordi · 2021-11-11T01:01:09Z

yeganehkordi
Nov 11, 2021

task569_recipe_nlg_text_generation and task573_air_dialogue_classification were difficult for the annotators. Although, the definitions are good.

2 replies

yeganehkordi Nov 13, 2021

Task592 was difficult for annotators.

danyaljj Nov 15, 2021
Author

If you feel the definition are okay (and the task is well-defined), then we're okay.

Palipoor · 2021-11-12T00:02:35Z

Palipoor
Nov 12, 2021

task1148_maximum_ascii_value
People didn't know the ascii values and it was difficult for them.

1 reply

Palipoor Nov 12, 2021

task1186_nne_hrngo_classification seems to be difficult for people, as the numbers are not very meaningful.

pulkitverma25 · 2021-11-13T09:26:05Z

pulkitverma25
Nov 13, 2021

Tasks 664-667, 685-737:
A common feedback for many tasks is something on these lines: "They should have provided examples with an incorrect selection, rather than just instances that showed answers that didn't accord with the instructions."
Should I add a negative example with an incorrect selection? Or that is not needed?

For now, I created PR#607 without addressing this.

2 replies

Palipoor Nov 13, 2021

I think it's good to do this! I also saw some feedbacks on tasks like that.

pulkitverma25 Nov 13, 2021

Ok. This might take time, so I'll create a new PR to address this.

Palipoor · 2021-11-14T08:32:01Z

Palipoor
Nov 14, 2021

People seem to be bothered by tasks that may need a google search. Like task1321_country_continent

1 reply

danyaljj Nov 15, 2021
Author

Right, I should have specified in the evaluation that they can use Google.

CodeHime · 2021-11-17T22:47:59Z

CodeHime
Nov 17, 2021

I will take up changes in the tasks 1500-1600

0 replies

Palipoor · 2021-11-18T00:58:54Z

Palipoor
Nov 18, 2021

I will pick up tasks 1600 - 1700.

0 replies

Palipoor · 2021-11-18T03:36:03Z

Palipoor
Nov 18, 2021

task1625_disfl_qa_asnwer_generation:
A crowdworker has commented:
Confusing only because I don't understand why I'd read the disfluent question, when i'd much rather read the proper question and answer it.
I agree with them!

0 replies

pulkitverma25 · 2021-11-18T16:56:39Z

pulkitverma25
Nov 18, 2021

Picking 1701+

0 replies

pulkitverma25 · 2021-11-18T19:59:22Z

pulkitverma25
Nov 18, 2021

task743_eurlex_summarization seems difficult for annotators (and me).
One feedback:
"writing articles for the headline requires knowledge about anything legal related and I don't have that knowledge"

I agree with the feedback. This is not something you can Google or answer using your knowledge or inference unless you have "legal" knowledge.

2 replies

pulkitverma25 Nov 18, 2021

Same feedback for task744_eurlex_classification
Too much text to read, which is also full of "legal" vocabulary.
We can maybe mention that good knowledge of this kind is required for this task.

pulkitverma25 Nov 19, 2021

@danyaljj Do you think eurlex tasks (743 and 744) make sense for the users? The passages are long and need knowledge of legal terminology to answer. And their answers can't be Googled too.

Tasks that are difficult/painful to follow/understand/solve #467

Replies: 24 comments · 33 replies

danyaljj Oct 21, 2021 Author

danyaljj Nov 7, 2021 Author

danyaljj Nov 15, 2021 Author

danyaljj Nov 15, 2021 Author

Replies: 24 comments 33 replies

danyaljj
Oct 21, 2021
Author

danyaljj Nov 7, 2021
Author

danyaljj Nov 15, 2021
Author

danyaljj Nov 15, 2021
Author