Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLIA-2 - Funder Mapping Specification #36

Closed
wants to merge 3 commits into from
Closed

CLIA-2 - Funder Mapping Specification #36

wants to merge 3 commits into from

Conversation

gitstart-app[bot]
Copy link

@gitstart-app gitstart-app bot commented Aug 21, 2024

run funder_mapping file and add required files in the same root

Copy link
Author

gitstart-app bot commented Aug 21, 2024

This PR is estimated to cost between 50 and 70 credits.
🟡 By merging this PR you agree to this estimate. If you disagree, click here.

@agt24
Copy link
Contributor

agt24 commented Aug 21, 2024

Thanks for submitting this PR.

I checked out this pull request, copied the two input files (biomedical_research_funders.csv and indicators_all.csv) into the scripts directory and ran it with the command line python funder_mapping.py.

It took a few minutes but it completed with one warning message. The output file had the right number of lines but it found no funder matches. (i.e. there were no True values when I ran grep -i TRUE pmid-funding-matrix.csv ).

It looks like the code has converted the funder fields in indicators_all.csv to lowercase, but is doing a case sensitive string match using the names in biomedical_research_funders.csv which is mixed case. I would suggest not removing case from these fields. Rather, you could make the str.contains() match case intensive with case=False.

Also, many publication will include the funder's acrononym rather than the full name, so this should be searched as well, preserving case.

@agt24
Copy link
Contributor

agt24 commented Aug 21, 2024

Here's the original spec for easy reference: and the biomedical_research_funders.csv file.
biomedical_research_funders.csv

The large 'indicators_all.csv` file referenced in the ticket from the last job (#29) includes a few columns that have information about how the publication was funded (e.g. 'fund_text', 'fund_pmc_institute', 'fund_pmc_source', 'fund_pmc_anysource'). We would like to search these fields for strings that indicate if the paper's funders included one or more of the 31 funders listed in the attached CSV file (biomedical_research_funders.csv). We'd like this information to be represented in an output CSV with a line for each publication and 31 columns, one for each funder, with values of TRUE or FALSE. I've attached an example (pmid-funding-matrix.csv). Note the PMIDs are not real and the TRUE/FALSE values are randomly assigned. Also the actual output files should have ~2.75M lines.

There are various ways to approach this task, and we can be more prescriptive if you like, but at this point we're just looking for a reasonable first pass and pulling this information out. So feel free to try something simple like string-matching. Then we can try refining it.

Copy link
Author

gitstart-app bot commented Aug 22, 2024

This PR is estimated to cost between 50 and 70 credits.
🟡 By merging this PR you agree to this estimate. If you disagree, click here.

@gitstart-nimhdsst
Copy link
Contributor

gitstart-nimhdsst commented Aug 22, 2024

Thanks for submitting this PR.

I checked out this pull request, copied the two input files (biomedical_research_funders.csv and indicators_all.csv) into the scripts directory and ran it with the command line python funder_mapping.py.

It took a few minutes but it completed with one warning message. The output file had the right number of lines but it found no funder matches. (i.e. there were no True values when I ran grep -i TRUE pmid-funding-matrix.csv ).

It looks like the code has converted the funder fields in indicators_all.csv to lowercase, but is doing a case sensitive string match using the names in biomedical_research_funders.csv which is mixed case. I would suggest not removing case from these fields. Rather, you could make the str.contains() match case intensive with case=False.

Also, many publication will include the funder's acrononym rather than the full name, so this should be searched as well, preserving case.

Thanks for the review.
We have updated the PR, kindly review again. @agt24

Copy link
Author

gitstart-app bot commented Aug 23, 2024

This PR is estimated to cost between 50 and 70 credits.
🟡 By merging this PR you agree to this estimate. If you disagree, click here.

@agt24
Copy link
Contributor

agt24 commented Aug 23, 2024

There are still problems with this PR.
Contrary to my suggestion on Wednesday, acronyms are matched with case=False which results in lots of false positive matches. Also, for many lines, all 31 funders columns were set to TRUE. I didn’t take the time to determine why. Rather, I addressed the issue in a separate branch here: https://github.com/nimh-dsst/osm/tree/agt-funder-matrix

The new branch resolves the issue. Please discontinue working on this PR. It will not be merged.

@leej3 leej3 closed this Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants