Improve SEO and maintenance of documentation versions #3741

astrojuanlu · 2024-03-26T15:34:50Z

In #2980 we discussed about the fact that too many Kedro versions appear in search results.

We fixed that in #3030 by manually controlling what versions did we want to be indexed.

This caused a number of issues though, most importantly #3710: we had been accidentally excluded our subprojects from our search results.

We fixed that in #3729 in a somewhat unsatisfactory fashion. In particular, there are concerns about consistency and maintainability #3729 (comment) (see also #2600 (comment) about the problem of projects under kedro-org/kedro-plugins not having a stable version).

In addition, my mind has evolved a bit and I think we should only index 1 version in search engines: stable. There were concerns about users not understanding the flyout menu #2980 (comment) and honestly the latest part is also quite confusing (#2823, readthedocs/readthedocs.org#10674) but that's a whole separate discussion.

For now, the problems we want to solve are

Control excessive Kedro versions in search results #2980 again (not reopening it, hence this issue) by allowing only 1 version, the most recent stable one, and
The ongoing maintenance of robots.txt, ideally by not having to ever touch it again.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-03-26T15:36:49Z

And on the topic of the flyout, here's my thinking:

I think I have become numb to the whole stable/latest from RTD, but I think @stichbury is right this is not at all obvious.

Now, look at what happens when I change the default version to be 0.19.3 instead of the current stable:

User types https://docs.kedro.org
User gets redirected to https://docs.kedro.org/en/0.19.3/
The flyout shows the number:

By having the number in the URL and also in the flyout by default, I think it's more obvious how the user should go and switch to their version of choice.

@stichbury in your opinion, do you think this would make our docs journey more palatable?

stichbury · 2024-03-26T16:27:16Z

I think this is good, but doesn't it mean that you have to remember to increment the version number for stable in the control panel each time you make a release? If you don't it makes it hard to find the docs for that release (which incidentally are the latest stable 🤦 docs).

astrojuanlu · 2024-03-26T16:55:47Z

doesn't it mean that you have to remember to increment the version number for stable in the control panel each time you make a release?

It does... but sadly RTD doesn't allow lots of customization about the versioning rules for now. It's a small price to pay though, would happen only a handful of times per year.

astrojuanlu · 2024-03-26T18:07:41Z

TIL: robots.txt and pages actually indexed by Google are completely orthogonal #3708 (comment)

astrojuanlu · 2024-04-02T06:27:34Z

To note, RTD has automation rules https://docs.readthedocs.io/en/stable/automation-rules.html#actions-for-versions although the stable/latest rules are unfortunately implicit readthedocs/readthedocs.org#5319

astrojuanlu · 2024-04-12T08:14:32Z

I think the /stable/:splat -> /page/:splat redirection trick we got recommended in readthedocs/readthedocs.org#11183 (comment) can also solve the long standing problem of not having stable versions for repos in kedro-plugins #2600 (comment)

Here's the 📣 proposal

We turn /stable into a redirection to, well, the most recent stable version, in all subprojects (framework, viz, datasets)
All links to /stable will keep working, but instead of staying in /stable, they will get automatically redirected to the corresponding version, for example /0.19.4 or /projects/kedro-datasets/3.0.0
/latest will continue being /latest because it's not possible to rename it Rename latest readthedocs/readthedocs.org#10674 (but will continue having a "This is the latest development version" banner that we can tweak with CSS in the future)

The only thing we need to understand is what would be the impact on indexing and SEO cc @noklam @ankatiyar

Thoughts @stichbury ?

stichbury · 2024-04-12T09:58:24Z

I've somewhat lost track of what your robots.txt changes have been, but as I understand it, you want to index just 1 version and this would be stable and this would be what is shown in search results (but in fact, if the user navigates to stable they're redirected to a numbered version). Is this workable -- does the google crawler cope with redirects?

I would personally consider if it's sufficient to just keep stable as the indexed version and avoid the redirecting shenanigans. It is introducing complexity which makes maintenance harder. I understand the reasoning (I think, you can brief me in our next call) but is this helping users? (I think most users can cope with the concept of "stable" after all and some may actively seek it out). Let's discuss on Monday but if you need/want to go ahead in the meantime, please do, under some vague level of advisement!

astrojuanlu · 2024-04-12T12:59:08Z

In principle this is related to our indexing strategy, robots.txt etc but goes beyond that, it's more about keeping /stable as something our users get used to, or moving away from that to establish consistency across the subprojects.

Let's chat next week

astrojuanlu · 2024-05-21T15:37:13Z

Renamed this issue to better reflect what should we do here.

In readthedocs/readthedocs.org#10648 (comment), RTD staff gave an option to inject meta noindex tags on the docs depending on the versioning. That technique is very similar to the one described in https://www.stevenhicks.me/blog/2023/11/how-to-deindex-your-docs-from-google/ (discovered by @noklam).

It's clear that we have to shift our strategy by:

Avoid mangling robots.txt going forward
Improve how we craft our sitemaps
Add some templating tricks to our docs so proper meta noindex and link rel=canonical HTML tags are properly generated

astrojuanlu · 2024-06-05T07:41:10Z

Today I had to manually index https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.0 on Google (maybe there are no inbound links?) and I couldn't index 3.0.1 (it's currently blocked by our robots.txt).

astrojuanlu · 2024-07-16T12:02:43Z

Summary of things to do here:

Stop manually crafting our robots.txt, use the default one generated by Read the Docs (docs)
Add some logic to our kedro-sphinx-theme so that rel=canonical links pointing to /stable are inserted in older versions as suggested in Add meta tags "noindex, nofollow" for hidden version readthedocs/readthedocs.org#10648 (comment)
Consider making those changes retroactive for a few versions, and if too much work or not feasible, propose alternatives
Pause and evaluate results of efforts so far
Consider crafting a sitemap.xml manually (docs)

Refs: https://www.stevenhicks.me/blog/2023/11/how-to-deindex-your-docs-from-google/, https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls

astrojuanlu · 2024-08-10T11:08:01Z

Today I've been researching about this again (yeah, I have weird hobbies...)

I noticed that projects hosted on https://docs.rs don't seem to exhibit these SEO problems, and also that they seemingly take a basic, but effective, approach.

Compare https://docs.rs/clap/latest/clap/ with https://docs.rs/clap/2.34.0/clap/. There is no trace of <meta noindex,nofollow tags.

What they do though is having very lean sitemaps. If you look at https://docs.rs/-/sitemap/c/sitemap.xml, there's only 2 entries for clap:

<url>
            <loc>https://docs.rs/clap/latest/clap/</loc>
            <lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
            <priority>1.0</priority>
        </url>
        <url>
            <loc>https://docs.rs/clap/latest/clap/all.html</loc>
            <lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
            <priority>0.8</priority>
        </url>

Compare it with https://docs.kedro.org/sitemap.xml, which is, in comparison... less than ideal:

  <url>
    <loc>https://docs.kedro.org/en/stable/</loc>
    
    
    <lastmod>2024-08-01T18:53:11.571849+00:00</lastmod>
    
    <changefreq>weekly</changefreq>
    <priority>1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/latest/</loc>
    
    
    <lastmod>2024-08-09T09:39:27.628501+00:00</lastmod>
    
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.7/</loc>
    
    
    <lastmod>2024-08-01T18:53:11.647322+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.6/</loc>
    
    
    <lastmod>2024-05-27T16:32:42.584307+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.7</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.5/</loc>
    
    
    <lastmod>2024-04-22T11:56:55.928132+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.6</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.4.post1/</loc>
    
    
    <lastmod>2024-05-17T12:25:27.050615+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
...

The way I read this is that RTD is treating tags as long-lived branches, and as a result telling search engines that docs of old versions will be updated monthly, which in our current scheme is incorrect.

I am not sure if this is something worth reporting to RTD, but maybe we should look at uploading a custom sitemap.xml before doing the whole retroactive meta tag story.

astrojuanlu · 2024-09-05T12:56:33Z

xref in case it's useful https://github.com/jdillard/sphinx-sitemap

DimedS · 2024-09-05T13:21:52Z

xref in case it's useful https://github.com/jdillard/sphinx-sitemap

Would it make sense to manually create the sitemap first and see if it works as expected? If successful, we could then consider incorporating an automated generation process in the next step, if needed.

astrojuanlu · 2024-09-06T07:20:53Z

For reference, I tried the redirection trick described in #3741 (comment) for kedro-datasets #4145 (comment) and seems to be working.

I don't want to boil the ocean right now because we're in the middle of some delicate SEO experimentation phase, but when the dust settles, I will propose this for all our projects.

astrojuanlu · 2024-10-10T20:53:04Z

The sitemap hasn't changed 😬 https://docs.kedro.org/sitemap.xml

astrojuanlu · 2024-10-15T10:03:45Z

Newsflash: RTD now excludes hidden versions from the automatically generated sitemap readthedocs/readthedocs.org#11675

DimedS · 2024-11-13T13:39:20Z

After a discussion with @astrojuanlu and an unsuccessful attempt to apply a custom sitemap.xml to the Kedro documentation in issue #4261, we changed all Kedro documentation versions, except for "stable" and "latest," to hidden in the Read the Docs (RTD) web dashboard. This immediately updated our robots.txt and sitemap.xml to the desired state for the Kedro project.

However, there is still an issue with subfolders, "viz" and "datasets." Hiding versions for these subfolders does not affect robots.txt and sitemap.xml, so we currently don’t know how to manage them properly. We’ve contacted the RTD team with a question about this via the support portal.

DimedS · 2024-11-18T10:51:50Z

I received an answer from the RTD team:

As noted in our docs, the way to use a custom sitemap if via the robots.txt: https://docs.readthedocs.io/en/stable/reference/sitemaps.html#custom-sitemap-xml

The robots.txt is served from the default version of the docs, because it's served at the top level, we have to choose a version to serve it from. So if you merge the updated robots.txt into your default version, that should be served.

If I understand correctly, this means that to implement a manual sitemap.xml, we also need to use a manual robots.txt and include the link to sitemap.xml there.

I think we should give this a try. What do you think, @astrojuanlu? After we hid all versions of the main Kedro project, the search results improved for Kedro, but for datasets and Viz, it still seems to be referencing old versions. For example, if I search "kedro matplotlib dataset" on Google, I see everything except the correct link:

https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-5.1.0/api/kedro_datasets.matlab.MatlabDataset.html

astrojuanlu · 2024-11-18T11:51:10Z

There were changes definitely.

The only 2 pages on the current sitemap.xml have now indexing problems though, in that they're detected as duplicate:

Will elaborate a bit more later.

astrojuanlu · 2024-11-18T19:26:55Z

From my understanding, removing all old version from the sitemap didn't hide them from the search results:

In fact, none of these URLs are referenced in any of our current sitemaps. Not even /stable (confirming that "child" URLs of a sitemap aren't automatically included):

0.18.3	stable

Long story short, the hypothesis I proposed in #3741 (comment) has been disproven. Just limiting the sitemap.xml didn't work.

Now, if we use robots.txt to de-index those pages, we'll go back to square 1 and get errors again:

The method suggested by Google has 2 flavors:

<meta name="robots" content="noindex">. requires re-generating the HTML of all old versions, and preparing the theme to do it dynamically going forward
An X-Robots-Tag HTTP header. This would need to be implemented on Read the Docs. Proposed it here Add meta tags "noindex, nofollow" for hidden version readthedocs/readthedocs.org#10648 (comment)

DimedS · 2024-11-19T11:03:00Z

@astrojuanlu, I agree that to achieve more reliable blocking of old documentation versions from being indexed, we should use content="noindex". I can work on implementing this approach in our Sphinx build.

Additionally, if we continue with the current autogenerated setup, it’s likely that recent versions of the DataFrame and Viz documentation will remain unindexed, as we've observed. Therefore, I think we should consider reverting to our previous custom-generated robots.txt, possibly alongside a custom-generated sitemap.xml.

astrojuanlu · 2024-11-19T14:37:05Z

Indeed, I'd say let's split this problem in two?

P0: Index subprojects (viz, datasets)
P1: De-index old versions

DimedS · 2024-11-21T09:43:19Z

@astrojuanlu, I explored a few approaches in PR #4261, and one of them works: commit ff07526. This solution adds <meta name="robots" content="noindex, nofollow"> to the <head> section during the build. However, I realised:

It’s unlikely we can rebuild old docs since they are based on tagged commits in our repo, which cannot be modified.
Creating separate branches for each tagged commit to rebuild the docs is a potential workaround but seems fragile and not robust.
After reviewing the docs for Airflow and MLflow, I didn’t find similar noindex code in their HTML <head> sections. This suggests it may not be a common solution.

For the current release, I propose moving forward only with a manual update to robots.txt and sitemap.xml that will solve P0.

astrojuanlu · 2024-11-25T06:45:14Z

For the current release, I propose moving forward only with a manual update to robots.txt and sitemap.xml that will solve P0.

Yes let's move forward with this for now 👍🏼

astrojuanlu · 2024-11-26T11:04:58Z

I know I'm a pain in the neck 😬 but I'll leave this ticket open until we're happy with the solution...

astrojuanlu · 2024-11-27T17:21:37Z

The new robots.txt was picked up https://docs.kedro.org/robots.txt

astrojuanlu · 2024-11-27T17:23:31Z

kedro-viz is indexed

astrojuanlu added Component: Documentation 📄 Issue/PR for markdown and API documentation Component: DevOps Issue/PR that addresses automation, CI, GitHub setup labels Mar 26, 2024

astrojuanlu mentioned this issue Mar 26, 2024

Expand robots.txt for Kedro-Viz and Kedro-Datasets docs #3729

Merged

7 tasks

github-actions bot mentioned this issue Apr 1, 2024

Monthly issue metrics report #3764

Open

astrojuanlu mentioned this issue Apr 2, 2024

Versions: let users define algorithm to define stable readthedocs/readthedocs.org#11183

Open

astrojuanlu changed the title ~~Maintenance of documentation versions is complex~~ Improve SEO and maintenance of documentation versions May 21, 2024

astrojuanlu added this to Kedro Framework May 21, 2024

astrojuanlu mentioned this issue May 23, 2024

Update robots.txt for 0.19.6 #3885

Merged

7 tasks

astrojuanlu mentioned this issue Jun 10, 2024

Documentation: Catch up documentation with Framework kedro-org/kedro-viz#1724

Closed

astrojuanlu mentioned this issue Jun 25, 2024

[DRAFT] Fix/favicon and active tab kedro-org/kedro-sphinx-theme#6

Closed

merelcht assigned astrojuanlu Jul 15, 2024

astrojuanlu removed their assignment Jul 16, 2024

merelcht moved this to To Do in Kedro Framework Jul 22, 2024

merelcht assigned DimedS Jul 22, 2024

DimedS moved this from To Do to In Progress in Kedro Framework Aug 1, 2024

DimedS linked a pull request Aug 2, 2024 that will close this issue

Remove custom robots.txt in favor of RTD default #4055

Merged

7 tasks

astrojuanlu mentioned this issue Aug 21, 2024

Remove custom robots.txt in favor of RTD default #4055

Merged

7 tasks

DimedS closed this as completed in #4055 Aug 21, 2024

DimedS moved this from In Review to In Progress in Kedro Framework Sep 5, 2024

DimedS mentioned this issue Sep 5, 2024

Manually created sitemap.xml for improved control over indexed docs pages #4145

Merged

7 tasks

DimedS linked a pull request Sep 5, 2024 that will close this issue

Manually created sitemap.xml for improved control over indexed docs pages #4145

Merged

7 tasks

lrcouto closed this as completed in #4145 Oct 10, 2024

github-project-automation bot moved this from In Review to Done in Kedro Framework Oct 10, 2024

astrojuanlu reopened this Oct 10, 2024

github-project-automation bot moved this from Done to In Progress in Kedro Framework Oct 10, 2024

DimedS linked a pull request Oct 28, 2024 that will close this issue

Manually create a sitemap.xml for docs SEO #4261

Merged

7 tasks

DimedS mentioned this issue Nov 22, 2024

Manually create a sitemap.xml for docs SEO #4261

Merged

7 tasks

DimedS closed this as completed in #4261 Nov 26, 2024

github-project-automation bot moved this from In Review to Done in Kedro Framework Nov 26, 2024

astrojuanlu reopened this Nov 26, 2024

github-project-automation bot moved this from Done to In Progress in Kedro Framework Nov 26, 2024

astrojuanlu mentioned this issue Dec 2, 2024

Plugins in this monorepo are hard to find from search engines kedro-org/kedro-plugins#401

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SEO and maintenance of documentation versions #3741

Improve SEO and maintenance of documentation versions #3741

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024 •

edited

Loading

stichbury commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Apr 2, 2024

astrojuanlu commented Apr 12, 2024

stichbury commented Apr 12, 2024

astrojuanlu commented Apr 12, 2024

astrojuanlu commented May 21, 2024

astrojuanlu commented Jun 5, 2024

astrojuanlu commented Jul 16, 2024

astrojuanlu commented Aug 10, 2024

astrojuanlu commented Sep 5, 2024

DimedS commented Sep 5, 2024

astrojuanlu commented Sep 6, 2024

astrojuanlu commented Oct 10, 2024

astrojuanlu commented Oct 15, 2024

DimedS commented Nov 13, 2024

DimedS commented Nov 18, 2024

astrojuanlu commented Nov 18, 2024

astrojuanlu commented Nov 18, 2024

DimedS commented Nov 19, 2024

astrojuanlu commented Nov 19, 2024 •

edited

Loading

DimedS commented Nov 21, 2024

astrojuanlu commented Nov 25, 2024

astrojuanlu commented Nov 26, 2024

astrojuanlu commented Nov 27, 2024

astrojuanlu commented Nov 27, 2024

Improve SEO and maintenance of documentation versions #3741

Improve SEO and maintenance of documentation versions #3741

Comments

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024 • edited Loading

stichbury commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Mar 26, 2024

astrojuanlu commented Apr 2, 2024

astrojuanlu commented Apr 12, 2024

stichbury commented Apr 12, 2024

astrojuanlu commented Apr 12, 2024

astrojuanlu commented May 21, 2024

astrojuanlu commented Jun 5, 2024

astrojuanlu commented Jul 16, 2024

astrojuanlu commented Aug 10, 2024

astrojuanlu commented Sep 5, 2024

DimedS commented Sep 5, 2024

astrojuanlu commented Sep 6, 2024

astrojuanlu commented Oct 10, 2024

astrojuanlu commented Oct 15, 2024

DimedS commented Nov 13, 2024

DimedS commented Nov 18, 2024

astrojuanlu commented Nov 18, 2024

astrojuanlu commented Nov 18, 2024

DimedS commented Nov 19, 2024

astrojuanlu commented Nov 19, 2024 • edited Loading

DimedS commented Nov 21, 2024

astrojuanlu commented Nov 25, 2024

astrojuanlu commented Nov 26, 2024

astrojuanlu commented Nov 27, 2024

astrojuanlu commented Nov 27, 2024

astrojuanlu commented Mar 26, 2024 •

edited

Loading

astrojuanlu commented Nov 19, 2024 •

edited

Loading