Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get_cid pulls many CIDs, some wrong #417

Open
daithi45 opened this issue May 2, 2024 · 2 comments
Open

Get_cid pulls many CIDs, some wrong #417

daithi45 opened this issue May 2, 2024 · 2 comments

Comments

@daithi45
Copy link

daithi45 commented May 2, 2024

Hi all, I'm running a dataset of ~1000 CAS#s through webchem to pull CIDs.
For about half of them, it pulls multiple CIDs.

get_cid("613-33-2", from="cas", match = "all")
# A tibble: 2 × 2
  query    cid   
  <chr>    <chr> 
1 613-33-2 11941 
2 613-33-2 170889

Most of the time, the first CID it pulls isn't the correct one and requires manual checking. Is there any way to improve my approach to reduce the manual element?

@Aariq
Copy link
Collaborator

Aariq commented May 2, 2024

At first, I thought "this is just the nature of CAS numbers" or "this is just how searching for CAS numbers on pubchem works", but in this example, if I search for 613-33-2 on pubchem, I only get one result. It might be worth it to double check how we are querying the pubchem API here, @stitam, and if there is an alternative way that only returns the best match according to pubchem ("best" is currently not an option for the match argument of get_cid())

@daithi45
Copy link
Author

daithi45 commented May 3, 2024

Interestingly, if I take out the from="cas" element, I only get 1 CID back, will try this on my main dataset and see if it works!

> get_cid("613-33-2", match = "all")
# A tibble: 1 × 2
  query    cid  
  <chr>    <chr>
1 613-33-2 11941 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants