Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update NIH award Format in the award number normalizer #69

Merged
merged 1 commit into from
Oct 26, 2023

Conversation

tsande16
Copy link
Contributor

@tsande16 tsande16 commented Oct 25, 2023

normalizeAwardNumber() does not capture award numbers that have the format [A-Z][0-9][A-Z], most activity codes are 1 letter followed by 2 numbers or two letters followed by 1 number; however there are cases of letter, number, letter e.g. P2C HD042854

Note: P2C activity codes make up a very small portion of award numbers. Only 2 were found in 1 year's worth of data from PMC

ref: https://www.era.nih.gov/files/Deciphering-NIH-Application.pdf
https://grants.nih.gov/grants/funding/ac_search_results.htm

To test: mvn verify

@tsande16 tsande16 self-assigned this Oct 25, 2023
@tsande16 tsande16 linked an issue Oct 25, 2023 that may be closed by this pull request
Copy link
Contributor

@markpatton markpatton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if that regexp needs to be quite so strict and complicated? Instead of adding a special that there can be a letter at the end of the initial part, the rules could just be loosened to match numbers and letters. But I may be missing some of the point.

Tests pass for me.

@markpatton markpatton merged commit 4f4f262 into main Oct 26, 2023
2 checks passed
@tsande16
Copy link
Contributor Author

@markpatton definitely something to look into. I agree it's complicated. Back in aug/sept I ran tests on 4-5k nih award numbers and 4-5k non-award numbers. The tests were successful. This particular case occurred because the activity code is not very common (2 within ~9.5k records).

Since non-nih award numbers are all over the place, my initial idea was to try and capture as exactly as possible so as to not perform modifications to non-nih award numbers.

The normalization probably wouldn't affect non-nih award numbers since they are only removing leading/trailing whitespace, making all characters uppercase. The only thing that could potentially cause a problem is the removal of leading a zeros. Loosening the regex could then match with a non-nih grant and remove a leading zero, and there 1,401 grants in pass_grants (STAGE) with at least 1 leading zero. More testing on this would need to be performed - I can create a ticket to look into investigate this?

@markpatton
Copy link
Contributor

@tsande16 Things look good now. It was just sort of a comment about watching that regex grow in size, when we get more types of awards numbers to deal with, we may want to revisit.

@tsande16
Copy link
Contributor Author

@markpatton that makes sense. It could definitely grow unwieldy.

@tsande16 tsande16 deleted the 804-update-nih-award-format branch October 26, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug fix - Update NIH format for award numbers
2 participants