Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mkdocs search is not optimal #727

Open
lbarraga opened this issue Sep 18, 2024 · 0 comments
Open

mkdocs search is not optimal #727

lbarraga opened this issue Sep 18, 2024 · 0 comments

Comments

@lbarraga
Copy link
Contributor

The built-in search function provided by mkdocs is not optimal for HPC users.

How searching works now

The following line in hpc.template dictates how text is split into tokens:

  - search:
      separator: '[\_\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'

This can be split into 4 distinct functions:

  1. [\_\s\-,:!=\[\]()"/]+: separate text into tokens, with these characters as separators
  2. (?!\b)(?=[A-Z][a-z]): split PascalCase and camelCase
  3. \.(?!\d): split at ., but not when followed by a digit (to prevent version strings like 1.2.3 from being split)
  4. \.(?!\d)|&[lg]t;: don't split html

After splitting the text into tokens, mkdocs builds an index for fast searching.

When searching for a term, the term is also tokenized. This is a different tokenization than the one used for making the index, I think. After that, an OR search is done, ranking pages with more hits higher.

Minor problems

  1. Splitting on PascalCase and camelCase is unnecessary
  2. We don't have html in the source files, so why include that in the regex

Major problems

OR search

Because mkdocs inherently uses an OR search, the search terms are not treated as one.

Therefore, the page that is ranked the highest is the page that contains the most of any of the search terms. For example, when searching for module show, the page that pops up is

image

which does not even contain show. It just has the most amount of module + show

Splitting

By splitting the text on ., filenames can no longer be found reliably. test.py is split into test and py and the pages with the most amount of test + py are ranked highest.
The same principle makes python-bundle return the pages with the most amount of bundle and python

Available software

The search results are cluttered by the available software pages. A solution could be to rank these pages down or exclude them entirely (and promote use of the search function at https://docs.hpc.ugent.be/Linux/available_software/)

Conclusion

the search is bad mostly because of the OR search, which we can do little about. However, some improvements can be made in terms of splitting the text (e.g. not split on .). We could also rank down the available modules to give the "real docs" a better chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant