-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SEO and maintenance of documentation versions #3741
Comments
And on the topic of the flyout, here's my thinking: I think I have become numb to the whole Now, look at what happens when I change the default version to be
By having the number in the URL and also in the flyout by default, I think it's more obvious how the user should go and switch to their version of choice. @stichbury in your opinion, do you think this would make our docs journey more palatable? |
I think this is good, but doesn't it mean that you have to remember to increment the version number for |
It does... but sadly RTD doesn't allow lots of customization about the versioning rules for now. It's a small price to pay though, would happen only a handful of times per year. |
TIL: |
To note, RTD has automation rules https://docs.readthedocs.io/en/stable/automation-rules.html#actions-for-versions although the |
I think the Here's the 📣 proposal
The only thing we need to understand is what would be the impact on indexing and SEO cc @noklam @ankatiyar Thoughts @stichbury ? |
I've somewhat lost track of what your I would personally consider if it's sufficient to just keep |
In principle this is related to our indexing strategy, Let's chat next week |
Renamed this issue to better reflect what should we do here. In readthedocs/readthedocs.org#10648 (comment), RTD staff gave an option to inject meta It's clear that we have to shift our strategy by:
|
Today I had to manually index https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.0 on Google (maybe there are no inbound links?) and I couldn't index 3.0.1 (it's currently blocked by our |
Summary of things to do here:
Refs: https://www.stevenhicks.me/blog/2023/11/how-to-deindex-your-docs-from-google/, https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls |
Today I've been researching about this again (yeah, I have weird hobbies...) I noticed that projects hosted on https://docs.rs don't seem to exhibit these SEO problems, and also that they seemingly take a basic, but effective, approach. Compare https://docs.rs/clap/latest/clap/ with https://docs.rs/clap/2.34.0/clap/. There is no trace of What they do though is having very lean sitemaps. If you look at https://docs.rs/-/sitemap/c/sitemap.xml, there's only 2 entries for <url>
<loc>https://docs.rs/clap/latest/clap/</loc>
<lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
<priority>1.0</priority>
</url>
<url>
<loc>https://docs.rs/clap/latest/clap/all.html</loc>
<lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
<priority>0.8</priority>
</url> Compare it with https://docs.kedro.org/sitemap.xml, which is, in comparison... less than ideal: <url>
<loc>https://docs.kedro.org/en/stable/</loc>
<lastmod>2024-08-01T18:53:11.571849+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/latest/</loc>
<lastmod>2024-08-09T09:39:27.628501+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.7/</loc>
<lastmod>2024-08-01T18:53:11.647322+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.6/</loc>
<lastmod>2024-05-27T16:32:42.584307+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.7</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.5/</loc>
<lastmod>2024-04-22T11:56:55.928132+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.4.post1/</loc>
<lastmod>2024-05-17T12:25:27.050615+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.5</priority>
</url>
... The way I read this is that RTD is treating tags as long-lived branches, and as a result telling search engines that docs of old versions will be updated monthly, which in our current scheme is incorrect. I am not sure if this is something worth reporting to RTD, but maybe we should look at uploading a custom |
xref in case it's useful https://github.com/jdillard/sphinx-sitemap |
Would it make sense to manually create the sitemap first and see if it works as expected? If successful, we could then consider incorporating an automated generation process in the next step, if needed. |
For reference, I tried the redirection trick described in #3741 (comment) for kedro-datasets #4145 (comment) and seems to be working. I don't want to boil the ocean right now because we're in the middle of some delicate SEO experimentation phase, but when the dust settles, I will propose this for all our projects. |
The sitemap hasn't changed 😬 https://docs.kedro.org/sitemap.xml |
Newsflash: RTD now excludes hidden versions from the automatically generated sitemap readthedocs/readthedocs.org#11675 |
After a discussion with @astrojuanlu and an unsuccessful attempt to apply a custom sitemap.xml to the Kedro documentation in issue #4261, we changed all Kedro documentation versions, except for "stable" and "latest," to hidden in the Read the Docs (RTD) web dashboard. This immediately updated our However, there is still an issue with subfolders, "viz" and "datasets." Hiding versions for these subfolders does not affect |
I received an answer from the RTD team:
If I understand correctly, this means that to implement a manual I think we should give this a try. What do you think, @astrojuanlu? After we hid all versions of the main Kedro project, the search results improved for Kedro, but for datasets and Viz, it still seems to be referencing old versions. For example, if I search "kedro matplotlib dataset" on Google, I see everything except the correct link:
|
From my understanding, removing all old version from the sitemap didn't hide them from the search results: In fact, none of these URLs are referenced in any of our current sitemaps. Not even
Long story short, the hypothesis I proposed in #3741 (comment) has been disproven. Just limiting the Now, if we use The method suggested by Google has 2 flavors:
|
@astrojuanlu, I agree that to achieve more reliable blocking of old documentation versions from being indexed, we should use content="noindex". I can work on implementing this approach in our Sphinx build. Additionally, if we continue with the current autogenerated setup, it’s likely that recent versions of the DataFrame and Viz documentation will remain unindexed, as we've observed. Therefore, I think we should consider reverting to our previous custom-generated |
Indeed, I'd say let's split this problem in two?
|
@astrojuanlu, I explored a few approaches in PR #4261, and one of them works: commit ff07526. This solution adds
For the current release, I propose moving forward only with a manual update to |
Yes let's move forward with this for now 👍🏼 |
I know I'm a pain in the neck 😬 but I'll leave this ticket open until we're happy with the solution... |
The new |
In #2980 we discussed about the fact that too many Kedro versions appear in search results.
We fixed that in #3030 by manually controlling what versions did we want to be indexed.
This caused a number of issues though, most importantly #3710: we had been accidentally excluded our subprojects from our search results.
We fixed that in #3729 in a somewhat unsatisfactory fashion. In particular, there are concerns about consistency and maintainability #3729 (comment) (see also #2600 (comment) about the problem of projects under
kedro-org/kedro-plugins
not having astable
version).In addition, my mind has evolved a bit and I think we should only index 1 version in search engines:
stable
. There were concerns about users not understanding the flyout menu #2980 (comment) and honestly thelatest
part is also quite confusing (#2823, readthedocs/readthedocs.org#10674) but that's a whole separate discussion.For now, the problems we want to solve are
robots.txt
, ideally by not having to ever touch it again.The text was updated successfully, but these errors were encountered: