[mdanalysis] no new search results after adding new sitemaps to sitemapindex #4699

orbeckst · 2021-10-06T01:13:30Z

Bug: no search results after adding new sitemaps

If it is a DocSearch index issue, what is the related `index_name` ?

index_name= mdanalysis
index file mdanalysis.json

What is the current behaviour?

We added two new sitemaps to our siteindex (see Any other feedback below for details) for "distopia" and "pytng". Searching for new unique content does not give any results:

If the current behaviour is a bug, please provide all the steps to reproduce and screenshots with context.

To perform search, go to https://www.mdanalysis.org/ and use the search box as show in the screen shot:

The screen shot shows that the unique term CalcBondsOrtho (for distopia) is not found, as explained in more detail below:

distopia content is not found

distopia failed example text

search for "distopia"
no exact results (only a fuzzy match in a blog post, which is not a correct match)
- should have found https://www.mdanalysis.org/distopia/ and https://www.mdanalysis.org/distopia/api/distopia.html

distopia failed example API docs

Note that this example probably fails because content is in a dl (definition list) with dt/dd tags:

search for "CalcBondsOrtho"
no results:
- should have found https://www.mdanalysis.org/distopia/api/distopia.html?highlight=calcbondsortho#_CPPv4I0E14CalcBondsOrthovPK1TPK1TPK1TNSt6size_tEP1T

pytng content is not found

pytng failed example text

search ' "TNG API" '
No results found in pytng (only a blog post)
- should have found https://www.mdanalysis.org/pytng/

pytng failed example API docs

Note that this example probably fails to find the API doc because content is in a dl (definition list) with dt/dd tags, the text use should have been found

search "TNGFileIterator"
No results found
- should have found text use https://www.mdanalysis.org/pytng/documentation_pages/Examples.html
- should have found API doc https://www.mdanalysis.org/pytng/documentation_pages/API.html?highlight=tngfileiterator#pytng.TNGFileIterator

What is the expected behaviour?

Relevant pages from the distopia and pytng docs are found, as indicated above. (It was clearer to include the expected results above for the individual examples).

What have you tried to solve it?

checked that all xml files are correctly formed with
- distopia https://www.mdanalysis.org/distopia/sitemap.xml
- pytng https://www.mdanalysis.org/pytng/sitemap.xml
- top level sitemap index https://www.mdanalysis.org/sitemapindex.xml
waited one week to give the algolia crawler time to pick up changes
checked that the new content still uses the same selector descriptors that are in the config file (the docs are produced in the same way as most of our other docs with the sphinx documentation generator)
- still uses the same levels and p, li tags for most of the content
- HOWEVER. some content (technical API docs) also uses definition lists (dl, dt/dd), and the dt tags are not configured as selectors for text yet. (The dd should be ok because the text is wrapped in p tags.)
- also uses pre tags for code samples

Any quick clues?

Some content (for example, the API docs in https://www.mdanalysis.org/distopia/api/distopia.html) are stored in definition lists (dl with dt/dd elements) and the dt tags are NOT included as selectors in the algolia config file yet. (see PR #4700)

However, no idea why standard text is not appearing; seeing the scraper output might help but that requires algolia staff help.

Any other feedback / questions ?

We added two new sitemaps to our sitemap index https://www.mdanalysis.org/sitemapindex.xml for

distopia docs https://www.mdanalysis.org/distopia/
pytng docs https://www.mdanalysis.org/pytng/

<sitemap>
<loc>https://www.mdanalysis.org/pytng/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.mdanalysis.org/distopia/sitemap.xml</loc>
</sitemap>

Our open issues:

The text was updated successfully, but these errors were encountered:

* new text selectors for definitions (dt tags) in dl (definition lists) as used in the sphinx-generated API docs (to index the function/class names themselves); dd (data tags) do not need a selector because text content is wrapped in p tags inside the dd and is already selected * new text selectors for pre (code blocks) * see issue algolia#4699

orbeckst · 2021-10-07T20:42:04Z

created algolia account to have access to dashboard and possibly see index
emailed support to ask how to link mdanalysis index to my dashboard and/or get access to the new(?) Crawler management interface to see(?) the index

shortcuts · 2021-10-08T13:59:35Z

emailed support to ask how to link mdanalysis index to my dashboard and/or get access to the new(?) Crawler management interface to see(?) the index

Users will receive access to the new infrastructure on a random basis. Sorry, we are a bit early in the process right now to do bigger batches/deploy certain configs! You can read more here: https://docsearch.algolia.com/docs/migrating-from-legacy#migration-seems-to-have-started-but-i-dont-have-received-any-emails

created algolia account to have access to dashboard and possibly see index

We don't grant access to the dashboard but only the Analytics, sorry! (But it will be available in the new infra :D)

Missing pages

start_urls and stop_urls works as matching pattern/substring detection, so this URL for example https://www.mdanalysis.org/distopia/index.html will be skipped because of this stop_urls: https://www.mdanalysis.org/.*index.html$.

I'd suggest you to adapt the stop_urls to make sure we don't exclude URLs you'd potentially like to keep!

Below, the URLs matching distopia:

We can see that some selectors don't match certain pages. To make it more specific, you can use selectors_key

No index

There's was a typo in your config (sorry I didn't saw it), I've fixed it in: #4712

shortcuts · 2021-10-08T14:25:03Z

previous crawl: 67k records
new crawl: 85k records

orbeckst · 2021-10-09T02:08:06Z

I installed the docsearch-scraper locally and I’m able to run it so I can now debug more easily.

orbeckst · 2021-10-09T03:05:17Z

Well... maybe not that simple:

$ ./docsearch run ../docsearch-configs/configs/mdanalysis.json
...
algoliasearch.exceptions.RequestException: Record quota exceeded. Change plan or delete records.

Nb hits: 10415
previous nb_hits: 85975

Will need to see how to work within these limitations.

orbeckst · 2021-10-10T08:14:43Z

I am now using a scraper with disabled index submission for testing, see orbeckst/docsearch-scraper#1.

shortcuts · 2021-10-10T09:48:52Z

I am now using a scraper with disabled index submission for testing, see orbeckst/docsearch-scraper#1.

That's a good idea! It would be nice to see it as an option indeed

Let me know if I can help you debug your issue

orbeckst · 2021-10-10T09:59:12Z

Many of the missing terms are due to broken sitemaps. Apparently, somewhere our Sphinx + GitHub actions based doc deployment changed and the sitemaps now contain a version information that is not actually present in the deployment URL. That's a problem on our end.

I'll leave this issue open for the moment.

orbeckst · 2021-10-19T20:36:01Z

PR #4751 addresses some of the problems and we also fixed sitemaps. The PR has some more comments on what still seems to be missing, including the output from the scraper (for 0 record pages). Any insights why we're still missing content would be appreciated. Thanks!

orbeckst · 2021-10-20T16:14:25Z

As mentioned in PR #4751 there are still a number of "0 records" pages, namely under

From #4751 (comment) :

When pages are retrieved but without records, it's usually related to the selectors.

Testing document.querySelectorAll("[itemprop='articleBody'] > .section h1, .page h1, .post h1, .body > .section h1"); on https://www.mdanalysis.org/GridDataFormats/gridData/basic.html for example, returns no results.

You can either make your selectors broader (we often go with .class heading) to also retrieve content form these pages, or add a new selectors_key field in the start_urls

Debugging the selectors is the next step...

This was referenced Oct 6, 2021

[mdanalysis] include new text selectors #4700

Merged

search does not find results for distopia (and others...) MDAnalysis/MDAnalysis.github.io#202

Open

orbeckst mentioned this issue Oct 8, 2021

fix(mdanalysis): update config #4712

Merged

orbeckst mentioned this issue Oct 19, 2021

Update mdanalysis 01 #4751

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mdanalysis] no new search results after adding new sitemaps to sitemapindex #4699

[mdanalysis] no new search results after adding new sitemaps to sitemapindex #4699

orbeckst commented Oct 6, 2021 •

edited

Loading

orbeckst commented Oct 7, 2021

shortcuts commented Oct 8, 2021

shortcuts commented Oct 8, 2021

orbeckst commented Oct 9, 2021

orbeckst commented Oct 9, 2021

orbeckst commented Oct 10, 2021

shortcuts commented Oct 10, 2021

orbeckst commented Oct 10, 2021

orbeckst commented Oct 19, 2021

orbeckst commented Oct 20, 2021

[mdanalysis] no new search results after adding new sitemaps to sitemapindex #4699

[mdanalysis] no new search results after adding new sitemaps to sitemapindex #4699

Comments

orbeckst commented Oct 6, 2021 • edited Loading

Bug: no search results after adding new sitemaps

If it is a DocSearch index issue, what is the related index_name ?

What is the current behaviour?

distopia content is not found

distopia failed example text

distopia failed example API docs

pytng content is not found

pytng failed example text

pytng failed example API docs

What is the expected behaviour?

What have you tried to solve it?

Any quick clues?

Any other feedback / questions ?

orbeckst commented Oct 7, 2021

shortcuts commented Oct 8, 2021

Missing pages

No index

shortcuts commented Oct 8, 2021

orbeckst commented Oct 9, 2021

orbeckst commented Oct 9, 2021

orbeckst commented Oct 10, 2021

shortcuts commented Oct 10, 2021

orbeckst commented Oct 10, 2021

orbeckst commented Oct 19, 2021

orbeckst commented Oct 20, 2021

orbeckst commented Oct 6, 2021 •

edited

Loading

If it is a DocSearch index issue, what is the related `index_name` ?