Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine the best sorting order #72

Closed
gaurav opened this issue Jun 30, 2023 · 1 comment
Closed

Determine the best sorting order #72

gaurav opened this issue Jun 30, 2023 · 1 comment

Comments

@gaurav
Copy link
Contributor

gaurav commented Jun 30, 2023

On a development version of NameRes, I tried searching for COVID with MONDO|HP filtering turned on, Biolink type filtering to Disease and sorting by shortest_name_length. I got back the following results in this order:

  1. MONDO:0100233 "long COVID-19"
  2. MONDO:0100163 "COVID-19–associated multisystem inflammatory syndrome in children"
  3. MONDO:0100319 "COVID-19–associated multisystem inflammatory syndrome in adults"
  4. MONDO:0100096 "COVID-19"

This is because MONDO:0100233 ("PASC") and MONDO:0100163 ("MISC", "PIMS", "PMIS") have shorter synonyms than MONDO:0100319 ("MIS-A") and MONDO:0100096 ("β-CoV"), and since both of the latter have the same length of synonym, we don't have any way of separating them.

This isn't too bad, since COVID-19 is still in the first five results, but it's not ideal.

There are other stats we could measure to help improve this situation:

  • preferred_name_length: The length of the preferred name
  • information_content: The information content of the clique (this will be missing for lots of identifiers: we could sort them after every concept we have an information concept value for)
  • ??? (@cbizon any ideas?)

Options for a search order:

  1. preferred_name_length > shortest_name_length
  2. information_content > preferred_name_length > shortest_name_length
  3. information_content > shortest_name_length > preferred_name_length

Probably the next step will be to include all three of these stats in the Solr database, and then we can make a Vue application to compare different sorting strategies until we find the one we like best.

We can also implement more complex metrics if we need to by adding them to Babel and summarizing them into a score field that we can sort on.

@gaurav
Copy link
Contributor Author

gaurav commented Sep 26, 2024

We've made lots of improvements to how we sort our results, so I think we can close this issue and focus on the individual issues that point to issues that still need improvement (e.g. #161). But please reopen if I've missed anything!

@gaurav gaurav closed this as completed Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant