mkdocs search is not optimal #727

lbarraga · 2024-09-18T13:40:06Z

The built-in search function provided by mkdocs is not optimal for HPC users.

How searching works now

The following line in hpc.template dictates how text is split into tokens:

  - search:
      separator: '[\_\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'

This can be split into 4 distinct functions:

[\_\s\-,:!=\[\]()"/]+: separate text into tokens, with these characters as separators
(?!\b)(?=[A-Z][a-z]): split PascalCase and camelCase
\.(?!\d): split at ., but not when followed by a digit (to prevent version strings like 1.2.3 from being split)
\.(?!\d)|&[lg]t;: don't split html

After splitting the text into tokens, mkdocs builds an index for fast searching.

When searching for a term, the term is also tokenized. This is a different tokenization than the one used for making the index, I think. After that, an OR search is done, ranking pages with more hits higher.

Minor problems

Splitting on PascalCase and camelCase is unnecessary
We don't have html in the source files, so why include that in the regex

Major problems

OR search

Because mkdocs inherently uses an OR search, the search terms are not treated as one.

Therefore, the page that is ranked the highest is the page that contains the most of any of the search terms. For example, when searching for module show, the page that pops up is

which does not even contain show. It just has the most amount of module + show

Splitting

By splitting the text on ., filenames can no longer be found reliably. test.py is split into test and py and the pages with the most amount of test + py are ranked highest.
The same principle makes python-bundle return the pages with the most amount of bundle and python

Available software

The search results are cluttered by the available software pages. A solution could be to rank these pages down or exclude them entirely (and promote use of the search function at https://docs.hpc.ugent.be/Linux/available_software/)

Conclusion

the search is bad mostly because of the OR search, which we can do little about. However, some improvements can be made in terms of splitting the text (e.g. not split on .). We could also rank down the available modules to give the "real docs" a better chance.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mkdocs search is not optimal #727

mkdocs search is not optimal #727

lbarraga commented Sep 18, 2024

mkdocs search is not optimal #727

mkdocs search is not optimal #727

Comments

lbarraga commented Sep 18, 2024

How searching works now

Minor problems

Major problems

OR search

Splitting

Available software

Conclusion