WIP #60 #681

shamikbose · 2022-06-04T20:42:31Z

Issues with download_and_extract()

NOTE: This is WIP. Folder structures are being lost when using download_and_extract()
Manual extraction:

Extraction using download_manager:

Name: CRAFT
Description: CRAFT corpus, a collection of 97 articles from the PubMed Central Open Access subset, each of which has been annotated along a number of different axes spanning structural, coreference, and concept annotation
Paper: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-161
Data: https://github.com/UCDenver-ccp/CRAFT/releases

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Issues with `download_and_extract()`

Updated to show the code which was being run earlier. Format is "CRAFT-5.0.0\concept-annotation\key\key"

- Passes all tests - Warnings logged for multiple annotations

shamikbose · 2022-06-04T23:12:51Z

UPDATE: This passes all tests as a local dataset. However, I have a few more follow-up questions:

Some annotations (~2k of 97k) are multiple, non-conitnuous annotations. The spanned text is also partial. How should this be handled? At the moment, I'm logging a warning with the entity_id and the filename. Example annotation below

<annotation>
    <mention id="GO_BP_2016_02_16_Instance_19281" />
    <annotator id="GO_BP_2016_02_16_Instance_10000">Mike Bada, University of Colorado Anschutz Medical Campus</annotator>
    <span start="1351" end="1362" />
    <span start="1376" end="1378" />
    <span start="1383" end="1386" />
    <spannedText>development ... of ... PPs</spannedText>
</annotation>

MONDO is a completely different file structure (naming conventions), annotation structure and duplicate ids. Should this be handled? Example annotation below

        <annotation annotator="Default" id="11532192-123" motivation="" type="identity">
            <class id="http://purl.obolibrary.org/obo/MONDO_0005041" label="'glaucoma (disease)'"/>
            <span end="1761" id="11532192-124" start="1753">Glaucoma</span>
        </annotation>
        <annotation annotator="Default" id="11532192-129" motivation="" type="identity">
            <class id="http://purl.obolibrary.org/obo/MONDO_0005041" label="'glaucoma (disease)'"/>
            <span end="2047" id="11532192-130" start="2039">glaucoma</span>
        </annotation>

@jason-fries @ruisi-su @galtay

General changes: - Updated paths to use `os.path.join()` to make it platform-agnostic MONDO specific changes: - Specific ways to read annotations - Specific ways to find corresponding annotations

_PUBMED set to True

…ntegration

mariosaenger · 2024-10-26T08:13:16Z

@phlobo I adjusted the implementation. Please have a look at it.

mariosaenger · 2024-10-27T09:18:32Z

Resolves #938

Initial commit

ff1728f

Issues with `download_and_extract()`

shamikbose requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber, sg-wbi and debajyotidatta as code owners June 4, 2022 20:42

shamikbose added 3 commits June 4, 2022 16:59

Use this version of craft.py to debug

c0ec8eb

Updated to show the code which was being run earlier. Format is "CRAFT-5.0.0\concept-annotation\key\key"

Removed print statements out of shame

87bd4e8

Implemented as a local dataset

f499a83

- Passes all tests - Warnings logged for multiple annotations

shamikbose and others added 4 commits June 4, 2022 20:02

Can be loaded with load_datasets(). Passes all tests

d55394e

General changes: - Updated paths to use `os.path.join()` to make it platform-agnostic MONDO specific changes: - Specific ways to read annotations - Specific ways to find corresponding annotations

Update craft.py

c9a17a3

Merge branch 'bigscience-workshop:master' into craft

2875fcb

Update craft.py

9692f3e

_PUBMED set to True

shamikbose mentioned this pull request Jun 7, 2022

Create a dataset loader for CRAFT #60

Open

mariosaenger self-assigned this Oct 26, 2024

Mario Sänger added 2 commits October 26, 2024 10:10

Merge branch 'main' into craft

1ce40e8

refactor: Refactor and improve implementation of CRAFT to hub-style i…

c6bfb36

…ntegration

mariosaenger requested a review from phlobo October 26, 2024 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP #60 #681

WIP #60 #681

shamikbose commented Jun 4, 2022 •

edited

Loading

shamikbose commented Jun 4, 2022

mariosaenger commented Oct 26, 2024

mariosaenger commented Oct 27, 2024

WIP #60 #681

Are you sure you want to change the base?

WIP #60 #681

Conversation

shamikbose commented Jun 4, 2022 • edited Loading

Checkbox

shamikbose commented Jun 4, 2022

mariosaenger commented Oct 26, 2024

mariosaenger commented Oct 27, 2024

shamikbose commented Jun 4, 2022 •

edited

Loading