CLIA-2 - Funder Mapping Specification #36

gitstart-app · 2024-08-21T16:39:18Z

run funder_mapping file and add required files in the same root

gitstart-app · 2024-08-21T16:39:24Z

This PR is estimated to cost between 50 and 70 credits.
🟡 By merging this PR you agree to this estimate. If you disagree, click here.

agt24 · 2024-08-21T18:20:13Z

Thanks for submitting this PR.

I checked out this pull request, copied the two input files (biomedical_research_funders.csv and indicators_all.csv) into the scripts directory and ran it with the command line python funder_mapping.py.

It took a few minutes but it completed with one warning message. The output file had the right number of lines but it found no funder matches. (i.e. there were no True values when I ran grep -i TRUE pmid-funding-matrix.csv ).

It looks like the code has converted the funder fields in indicators_all.csv to lowercase, but is doing a case sensitive string match using the names in biomedical_research_funders.csv which is mixed case. I would suggest not removing case from these fields. Rather, you could make the str.contains() match case intensive with case=False.

Also, many publication will include the funder's acrononym rather than the full name, so this should be searched as well, preserving case.

agt24 · 2024-08-21T18:22:59Z

Here's the original spec for easy reference: and the biomedical_research_funders.csv file.
biomedical_research_funders.csv

The large 'indicators_all.csv` file referenced in the ticket from the last job (#29) includes a few columns that have information about how the publication was funded (e.g. 'fund_text', 'fund_pmc_institute', 'fund_pmc_source', 'fund_pmc_anysource'). We would like to search these fields for strings that indicate if the paper's funders included one or more of the 31 funders listed in the attached CSV file (biomedical_research_funders.csv). We'd like this information to be represented in an output CSV with a line for each publication and 31 columns, one for each funder, with values of TRUE or FALSE. I've attached an example (pmid-funding-matrix.csv). Note the PMIDs are not real and the TRUE/FALSE values are randomly assigned. Also the actual output files should have ~2.75M lines.

There are various ways to approach this task, and we can be more prescriptive if you like, but at this point we're just looking for a reasonable first pass and pulling this information out. So feel free to try something simple like string-matching. Then we can try refining it.

gitstart-app · 2024-08-22T09:38:06Z

This PR is estimated to cost between 50 and 70 credits.
🟡 By merging this PR you agree to this estimate. If you disagree, click here.

gitstart-nimhdsst · 2024-08-22T09:41:12Z

Thanks for submitting this PR.

I checked out this pull request, copied the two input files (biomedical_research_funders.csv and indicators_all.csv) into the scripts directory and ran it with the command line python funder_mapping.py.

It took a few minutes but it completed with one warning message. The output file had the right number of lines but it found no funder matches. (i.e. there were no True values when I ran grep -i TRUE pmid-funding-matrix.csv ).

It looks like the code has converted the funder fields in indicators_all.csv to lowercase, but is doing a case sensitive string match using the names in biomedical_research_funders.csv which is mixed case. I would suggest not removing case from these fields. Rather, you could make the str.contains() match case intensive with case=False.

Also, many publication will include the funder's acrononym rather than the full name, so this should be searched as well, preserving case.

Thanks for the review.
We have updated the PR, kindly review again. @agt24

gitstart-app · 2024-08-23T08:21:08Z

This PR is estimated to cost between 50 and 70 credits.
🟡 By merging this PR you agree to this estimate. If you disagree, click here.

agt24 · 2024-08-23T17:05:10Z

There are still problems with this PR.
Contrary to my suggestion on Wednesday, acronyms are matched with case=False which results in lots of false positive matches. Also, for many lines, all 31 funders columns were set to TRUE. I didn’t take the time to determine why. Rather, I addressed the issue in a separate branch here: https://github.com/nimh-dsst/osm/tree/agt-funder-matrix

The new branch resolves the issue. Please discontinue working on this PR. It will not be merged.

add funder mapping script

396af60

fix review comments

c0a929c

gitstart-nimhdsst requested a review from agt24 August 22, 2024 10:56

remove duplicate

61c6c81

leej3 closed this Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIA-2 - Funder Mapping Specification #36

CLIA-2 - Funder Mapping Specification #36

gitstart-app bot commented Aug 21, 2024 •

edited

Loading

gitstart-app bot commented Aug 21, 2024

agt24 commented Aug 21, 2024

agt24 commented Aug 21, 2024

gitstart-app bot commented Aug 22, 2024

gitstart-nimhdsst commented Aug 22, 2024 •

edited

Loading

gitstart-app bot commented Aug 23, 2024

agt24 commented Aug 23, 2024

CLIA-2 - Funder Mapping Specification #36

CLIA-2 - Funder Mapping Specification #36

Conversation

gitstart-app bot commented Aug 21, 2024 • edited Loading

gitstart-app bot commented Aug 21, 2024

agt24 commented Aug 21, 2024

agt24 commented Aug 21, 2024

gitstart-app bot commented Aug 22, 2024

gitstart-nimhdsst commented Aug 22, 2024 • edited Loading

gitstart-app bot commented Aug 23, 2024

agt24 commented Aug 23, 2024

gitstart-app bot commented Aug 21, 2024 •

edited

Loading

gitstart-nimhdsst commented Aug 22, 2024 •

edited

Loading