Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjudication Recipe #106

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from
Open

Conversation

laurejt
Copy link
Contributor

@laurejt laurejt commented Oct 23, 2024

Primary change: adding adjudication recipe for text span annotations.

Secondary change: modified existing annotation recipes to use Prodigy API and option to fetch media.

@laurejt laurejt self-assigned this Oct 23, 2024
Copy link
Collaborator

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's obvious that you've figured out a lot more things about how prodigy works, and the refactors to rely more on built-in prodigy functionality looks good. My suggestions pretty much all relate to adding more comments to clarify what's going on; I do wonder if the annotation merging is worth unit testing, but not sure how hard that is.

src/corppa/poetry_detection/annotation/recipe.py Outdated Show resolved Hide resolved
"view_id": "blocks",
"config": config,
}

if fetch_media:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth a comment here to explain why this step needs to be done

"ner_manual_highlight_chars": True,
"global_css_dir": CURRENT_DIR,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this already in the common config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I wasn't originally copying the full common config.

Comment on lines 283 to 300
for version in versions:
session_id = version[SESSION_ID_ATTR]
# Assume: session name does not contain -
session_name = session_id.rsplit("-", maxsplit=1)[1]
if session_id not in session_counts:
session_counts[session_id] = 1
else:
session_name += f"-{session_counts[session_id]}"
session_counts[session_id] += 1
sessions.append(session_name)
if "spans" not in version:
# Not sure when an annotated example would be missing a spans field
continue
for span in version["spans"]:
new_span = span.copy()
span_label = span["label"]
new_span["label"] = f"{session_name}: {span_label}"
merged_spans.append(new_span)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of know what this is doing because you showed us what the interface looks like, but it's hard to follow from just the code. Would help to add some more comments here, and I wonder if this bit is worth a unit test.
What is a version ? How does it relate to sessions and annotation spans?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that helps answer how much clarity the type annotations provide. versions is a list of annotated tasks (List[TaskType]) to be merged. These tasks are grouped by input hash (this grouping occurs in get_review_stream. So, version is an annotated example (TaskType).

This code is making a single "merged" example for review and does it in the following way:

  1. Copy the contents from one of the version
  2. Then for each version:
    a. Determine the session name by removing the dataset prefix and adding a numerical suffix if a session has multiple annotated examples (hopefully only a headache for round 1)
    b. Copy its spans (but with modified label fields) to merged_spans list.
  3. Set "merged" example spans to merged_spans
  4. Add "sessions" field containing list of session names to "merged" example

@laurejt
Copy link
Contributor Author

laurejt commented Oct 23, 2024

I do wonder if the annotation merging is worth unit testing, but not sure how hard that is.

I've been considering making a separate, stand-alone function for merging annotated examples, that would make it easier to unit test. That said, I think I need to think more about whether it's worth further splitting out the logic for merging "spans" fields

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Under Review
Development

Successfully merging this pull request may close these issues.

2 participants