You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[\_\s\-,:!=\[\]()"/]+: separate text into tokens, with these characters as separators
(?!\b)(?=[A-Z][a-z]): split PascalCase and camelCase
\.(?!\d): split at ., but not when followed by a digit (to prevent version strings like 1.2.3 from being split)
\.(?!\d)|&[lg]t;: don't split html
After splitting the text into tokens, mkdocs builds an index for fast searching.
When searching for a term, the term is also tokenized. This is a different tokenization than the one used for making the index, I think. After that, an OR search is done, ranking pages with more hits higher.
Minor problems
Splitting on PascalCase and camelCase is unnecessary
We don't have html in the source files, so why include that in the regex
Major problems
OR search
Because mkdocs inherently uses an OR search, the search terms are not treated as one.
Therefore, the page that is ranked the highest is the page that contains the most of any of the search terms. For example, when searching for module show, the page that pops up is
which does not even contain show. It just has the most amount of module + show
Splitting
By splitting the text on ., filenames can no longer be found reliably. test.py is split into test and py and the pages with the most amount of test + py are ranked highest.
The same principle makes python-bundle return the pages with the most amount of bundle and python
Available software
The search results are cluttered by the available software pages. A solution could be to rank these pages down or exclude them entirely (and promote use of the search function at https://docs.hpc.ugent.be/Linux/available_software/)
Conclusion
the search is bad mostly because of the OR search, which we can do little about. However, some improvements can be made in terms of splitting the text (e.g. not split on .). We could also rank down the available modules to give the "real docs" a better chance.
The text was updated successfully, but these errors were encountered:
The built-in search function provided by mkdocs is not optimal for HPC users.
How searching works now
The following line in
hpc.template
dictates how text is split into tokens:This can be split into 4 distinct functions:
[\_\s\-,:!=\[\]()"/]+
: separate text into tokens, with these characters as separators(?!\b)(?=[A-Z][a-z])
: split PascalCase and camelCase\.(?!\d)
: split at.
, but not when followed by a digit (to prevent version strings like1.2.3
from being split)\.(?!\d)|&[lg]t;
: don't split htmlAfter splitting the text into tokens, mkdocs builds an index for fast searching.
When searching for a term, the term is also tokenized. This is a different tokenization than the one used for making the index, I think. After that, an OR search is done, ranking pages with more hits higher.
Minor problems
Major problems
OR search
Because mkdocs inherently uses an OR search, the search terms are not treated as one.
Therefore, the page that is ranked the highest is the page that contains the most of any of the search terms. For example, when searching for
module show
, the page that pops up iswhich does not even contain
show
. It just has the most amount ofmodule
+show
Splitting
By splitting the text on
.
, filenames can no longer be found reliably.test.py
is split intotest
andpy
and the pages with the most amount oftest
+py
are ranked highest.The same principle makes
python-bundle
return the pages with the most amount ofbundle
andpython
Available software
The search results are cluttered by the available software pages. A solution could be to rank these pages down or exclude them entirely (and promote use of the search function at https://docs.hpc.ugent.be/Linux/available_software/)
Conclusion
the search is bad mostly because of the OR search, which we can do little about. However, some improvements can be made in terms of splitting the text (e.g. not split on
.
). We could also rank down the available modules to give the "real docs" a better chance.The text was updated successfully, but these errors were encountered: