Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP #60 #681

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

WIP #60 #681

wants to merge 10 commits into from

Conversation

shamikbose
Copy link
Contributor

@shamikbose shamikbose commented Jun 4, 2022

Issues with download_and_extract()

NOTE: This is WIP. Folder structures are being lost when using download_and_extract()
Manual extraction:
image

Extraction using download_manager:
image

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Issues with `download_and_extract()`
Updated to show the code which was being run earlier.
Format is "CRAFT-5.0.0\concept-annotation\key\key"
- Passes all tests
- Warnings logged for multiple annotations
@shamikbose
Copy link
Contributor Author

UPDATE: This passes all tests as a local dataset. However, I have a few more follow-up questions:

  1. Some annotations (~2k of 97k) are multiple, non-conitnuous annotations. The spanned text is also partial. How should this be handled? At the moment, I'm logging a warning with the entity_id and the filename. Example annotation below
<annotation>
    <mention id="GO_BP_2016_02_16_Instance_19281" />
    <annotator id="GO_BP_2016_02_16_Instance_10000">Mike Bada, University of Colorado Anschutz Medical Campus</annotator>
    <span start="1351" end="1362" />
    <span start="1376" end="1378" />
    <span start="1383" end="1386" />
    <spannedText>development ... of ... PPs</spannedText>
</annotation>
  1. MONDO is a completely different file structure (naming conventions), annotation structure and duplicate ids. Should this be handled? Example annotation below
        <annotation annotator="Default" id="11532192-123" motivation="" type="identity">
            <class id="http://purl.obolibrary.org/obo/MONDO_0005041" label="'glaucoma (disease)'"/>
            <span end="1761" id="11532192-124" start="1753">Glaucoma</span>
        </annotation>
        <annotation annotator="Default" id="11532192-129" motivation="" type="identity">
            <class id="http://purl.obolibrary.org/obo/MONDO_0005041" label="'glaucoma (disease)'"/>
            <span end="2047" id="11532192-130" start="2039">glaucoma</span>
        </annotation>

@jason-fries @ruisi-su @galtay

shamikbose and others added 4 commits June 4, 2022 20:02
General changes:
- Updated paths to use `os.path.join()` to make it platform-agnostic
MONDO specific changes:
- Specific ways to read annotations
- Specific ways to find corresponding annotations
_PUBMED set to True
@mariosaenger
Copy link
Collaborator

@phlobo I adjusted the implementation. Please have a look at it.

@mariosaenger
Copy link
Collaborator

Resolves #938

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants